Search Needs Computational Linguistics to Solve Its Problems

The increased use of mobile devices means search must learn to answer questions posed in natural language. Research and tech development at Google on natural language processing is filtering into the search results. So SEOs need to step beyond the keyword into computational linguistics.

As users have become increasingly dependent on their digital devices, they expect to search on them using more natural language to shape the queries. Search is deeply embedded in the fabric of our lives, and we expect more from it than previously.

We spend hours on our mobile devices every day and have devices that rely on natural language processing in our homes to turn the television on or entertain us. Every search is a quest, and users are constantly looking for and expect answers.

The terrain and contours of most e-commerce quests are reasonably easy to interpret, and SEOs have carefully developed methods for identifying keywords and concepts that apply to the most important quests that buyers/searchers will undertake for the products on offer.

Does this extend far enough? Not hardly.

We must stay with our consumers and develop an understanding of the challenges of search and how they are being addressed by those who build and operate search technology.

What’s Going On?

Each day, Google processes billions of searches and has publicly noted that 15% of those queries were previously unseen. This means that Google has no history of what the most relevant pages are to deliver for the query. These queries represent unfamiliar terrain, and Google has built ways to navigate this space.

What Needs to Happen?

The increased use of mobile devices that encourage the use of natural language means search must learn to answer questions posed in natural language. Current research and technology development at Google on natural language processing is filtering into the search results. SEOs need to step beyond the keyword into — are you ready — the arcane science of computational linguistics.

Computational linguistics is an interdisciplinary field that studies language from a computational perspective. Computational linguists build statistical or rule-based models and approaches to linguistic problems, such as natural language and search. The huge computational power available today has opened the door for rapid advances in the last five years. It is time for SEOs to integrate some of these learnings into their SEO practice.

Improving Natural Language Search

In October 2019, Google announced that it would be launching worldwide the BERT algorithm. BERT, short for Bidirectional Encoder Representations from Transformers, is a neural network-based technique for natural language processing (NLP) pre-training. Training and tuning are very important steps in developing working search algorithms. (For more on the science, see this Google blog.)

Google expects this improved model to impact 10% of all searches. It will be particularly helpful for improving queries written or spoken in natural, conversational language.

Some savvy searchers search in keyword-ese, putting long strings of disconnected words together in their queries. By keyword stuffing their query, they hope to get better results.

My own research has shown that the most frequent queries are multiple nouns strung together with an occasional adjective for refinement — long (adjective) house (noun) coat (noun). This is a simple example, but queries that are questions are much more difficult to parse. BERT will go a long way toward eliminating the need to use keyword-ese.

BERT is not to be confused with Google’s improved AI-based system of neural matching that is used to understand how words and concepts relate to one another, a super-synonym system. Combine BERT with the other advances, and we can surely expect better quality results.

Search, as a Study, Is Not Static

SEOs need to learn as much as they can about these technologies. Although it seems — at first blush — that we cannot optimize for it, we can create better content that reacts better to the new technology, watch our performance metrics to see how much and if we are improving, and then make more changes as needed. Now, isn’t that optimizing?

3 Reasons Why Achieving Organic Search Success Has Gotten Harder

If you think that it has gotten harder to achieve organic search success, you may be right. Most marketers recognize organic search’s tremendous value as an acquisition channel and focus on optimizing for organic search.

If you think that it has gotten harder to achieve organic search success, you may be right. Most marketers recognize organic search’s tremendous value as an acquisition channel and focus on optimizing for organic search.

Even if you are following all of the guidelines and work hard to keep your site in tune with the current demands, you may still be watching your results falter or not grow at levels that had once been easy to achieve. The rewards are still there, but organic search success has gotten harder.

This article will explore three reasons why, despite best efforts, achieving significant search traffic gains may be eluding you. The reasons are structural, outside your site: increased competition for top organic listings; more screens, each with its own demands; and changing consumer expectations.

More Players, Smaller Field of Play

Early adopters of search were richly rewarded. Many online businesses that recognized the potential of search cashed in by optimizing their sites.

At the same time, the search industry landscape was more diverse than it is today. The technology was also much less complex and easier to game. Although there were more search engines to consider in building an optimization plan, there were more baskets to put eggs in.

As the landscape changed and Google became increasingly dominant, search marketers had to focus their efforts toward pleasing an ever-more-sophisticated algorithm. The unfortunate side effect is that a mistake or a misbegotten tactic could and would catastrophically impact a site’s results. Add in that it was no longer a secret that search really works, and the number of businesses seeking those top results grew exponentially.

With the continued growth of e-commerce and the stumbling of bricks-and-mortar retailers, such as Sears, retail has rushed into the organic space. The increased competition of more players seeking the top spots on just a few engines has increased the amount of effort that must go into successful search optimization. This view assumes that the site owner is making all the right moves to meet the improving technology. In short, it is harder — net technology advances.

More Screens, Less Space

The growth of mobile and its impact on organic search cannot be underestimated.

Previous posts have discussed mobile rankings and Google’s own move to a mobile-first index.

Mobile makes the work and the chances for success harder for several reasons. Many sites are still developed in ways that make them mobile hostile – too-small text, color schemes that are hard to see on smaller screens, buttons that are too small, layouts that are difficult to maneuver around.

In moving to a mobile-first index and ranking scheme, Google has upped the ante for search success. Additionally, by rewarding content creation in the algorithm, site owners must balance the demands of the small screen and content presentation. The real downer is that on the small screen, the organic listings are pushed below the fold, off the screen, more readily.

With the recent announcement of new Gallery and Discovery ad formats, it remains to be seen how much screen real estate will be available for organic results. Being No. 1 never had greater valance than it does today.

Consumer Expectations Drive Search

Consumers drive search — they always have. Gone are the days of clunky keyword-stuffed copy (written to impress an algorithm, not a human). Deceptive titles and descriptions are a thing of the past.

Their role has been reaffirmed. Consumers are savvy enough to click away from a page that does not meet the expectation stated in the search result. Google’s use of snippets is a measure of how well or how poorly your page matches user queries. If Google is always pulling a snippet and never using your description, then it may be time to rethink your scheme for writing metadata.

As consumers grow more demanding, it is essential that we, as marketers, provide what they want. As consumer wants change, so we, too, must change.

Change is hard. And today, it is harder than ever to create and execute organic search strategies that work.

The ‘Algorithmification‘ of Everything

If I had asked any of my schoolmates what an “algorithm” was, their eyes would have glazed over and they would probably have asked me what I had been smoking. Fast-forward a few decades and we’ve got the algorithmification of everything, including marketing.

If I had asked any of my schoolmates what an “algorithm” was, their eyes would have glazed over and they would probably have asked me what I had been smoking. Fast-forward a few decades and we’ve got the algorithmification of everything, including marketing.

Those glazed looks would’ve happened a long time ago, long before Facebook was a glimmer in Mark Zuckerberg’s eye and he had started to bring together the more than 2 billion people who log in at least once a month. That Facebook population is now what Evan Osnos of the New Yorker says, were it a country; “ … would have the largest population on Earth … [and] as many adherents as Christianity.” When they log in, they are shaking hands with unnumbered algorithms and putting into those invisible fingers their faith and their data to be parsed, analyzed and manipulated, and hopefully not stolen.

What is an algorithm? Programmers like to say it is a word used by them when they don’t want to bother explaining what they do.

And because algorithms have become so ubiquitous, we seldom give them a thought — except when our IT colleagues start telling us why making any small change in our marketing program will take weeks or months and cost a bundle, or until something goes badly wrong as Facebook and others have discovered about their hacked data.

Our legislators, not usually well-versed on technology matters, have now started making a lot of noise about regulations: They are closing the server door after the data has bolted — an unlikely way to solve the essential problems.

Automation has always been the Holy Grail for marketers; not surprising when the ability, speed and relatively low cost of using artificial intelligence (AI) to number-crunch and manage segmentation of media and analysis of data gets better and better every year. eMarketer reports that: “About four in 10 of the worldwide advertisers surveyed by MediaMath and Econsultancy said they use AI for media spend optimization. This is another application of AI that is increasing among marketers as their demand-side platforms add AI features to increase the probability that a given programmatic bid will win its auction.”

Where is it headed? No one knows for sure. It’s all in the hands of the algorithms and they appear to be multiplying like rabbits. If you revere Darwin, as I do, you’ll expect them to get better and better. But before you totally buy into that, you would do well to read Melanie Mitchell’s thoughtful New York Times article “Artificial Intelligence Hits the Barrier of Meaning.”

There are more and more times when we applaud the use of the algorithms and can see that if properly created, they offer many benefits for almost every area of our marketing practice, as well as other areas of our lives. We really don’t have to panic (yet) about the machines and their algorithms taking over. As Neil Hughes wrote here last month; “The reality is that machines learn from systems and processes that are programmed by humans, so our destiny is still very much in our own hands.”

Machines screw up just like we do; and all the more so, because they are doing just what we told them to do.

All this machine thinking doesn’t come without dangerous side effects. Sometimes, when we try to communicate with inflexible AI systems supposedly designed to simplify and ease customer interactions, the “I” in “AI” becomes an “S,” replacing “A-Intelligence” with “A-stupidity.”

If, as defined, “an algorithm is a procedure or formula for solving a problem, based on conducting a sequence of specified actions,” we can only optimistically hope that the specified actions will take into account individual customer differences and make allowances for them. The moments when they don’t are when we start screaming and swearing; especially if we are on the customer end of the transactions.

As The New York Times wrote in a recent article: “The truth is that nobody knows what algorithmification of the human experience will bring.”

“It’s telling that companies like Facebook are only beginning to understand, much less manage, any harm caused by their decision to divert an ever-growing share of human social relations through algorithms. Whether they set out to or not, these companies are conducting what is arguably the largest social re-engineering experiment in human history, and no one has the slightest clue what the consequences are.”

However important algorithmification may seem to us, our marketing efforts and our use of AI and its algorithms are not very significant in the greater scheme of things outside of our limited business perspective. But don’t dismiss their growing impact on every facet of our future lives. As data guru Stephen H. Yu opined in his recent piece “Replacing Unskilled Data Marketers With AI”:

“In the future, people who can wield machines will be in secure places — whether they are coders or not — while new breeds of logically illiterate people will be replaced by the machines, one-by-one.”

You had better start to develop a meaningful relationship with your algorithms — while there is still time.

How Machine Learning Is Changing the SEO Rules

More than 40 updates in four years — that’s how often Google updates its search engine algorithms. And while most of these updates only caused ripples, others made waves that left digital marketers scrambling for solid ground. What if search engine algorithms evolved seamlessly without updates?

Google Panda Penguin ConceptMore than 40 updates in four years — that’s how often Google updates its search engine algorithms. And while most of these updates only caused ripples, others made waves that left digital marketers scrambling for solid ground.

What if search engine algorithms evolved seamlessly without updates?

Thanks to machine learning, the days of potentially jarring updates could someday be behind us. Machine learning occurs when programs can make predictions or determinations based on a wide range of signals or parameters. Uber, Auto Trader and Expedia are among the many large companies that employ machine learning; the technology is also proving useful in the fields of fraud detection, data security and financial trading. And yes, machine learning is already commonplace within Google and Microsoft, two of the world’s largest search and technology giants.

Don’t expect Google’s programmers to bow down to artificial intelligence anytime soon. However, there’s no denying that machine learning will play a big role in SEO.

Machine Learning’s Place in Google

You don’t need to travel far back in history to find Google casting doubt on the quality of machine learning.

Back in 2008, Google officials still believed their human programmers were more capable and less error-prone than the artificial intelligence available at the time, according to the marketing analysis blog Datawocky. In a 2011 discussion on Quora, a poster who claimed to work at Google from 2008 through 2010 said the company’s search team preferred a rule-based system over a machine-learning system because it could implement faster and more definitive algorithm changes.

However, machine learning was a core component of Google AdWords by 2012. The platform’s machine learning system – referred to as SmartASS — could determine whether users would be interested in ads enough to click them. One year later, Google officials were speaking publicly about working machine learning into their search engine algorithms.

Today, Google uses machine learning with its search algorithms mostly for “coming up with new signals and signal aggregations,” Gary Illyes of Google told Search Engine Land in October. He explained how Google’s search team uses machine learning to predict which algorithm adjustments are most worthwhile.

Illyes also talked about RankBrain, a machine-learning system implemented by Google in 2015.

RankBrain plays a vital role in Google’s ability to interpret long-tail search terms – like those often spoken into smartphones — and return relevant search results. In a Bloomberg article published in October 2015, Google senior research scientist Greg Corrado said the machine-learning system had become the third-most important page-ranking factor out of roughly 200 signals that impact the search algorithm. RankBrain was rolled out after a year of programming and testing, and it’s regularly fed loads of new data to improve its capabilities, Corrado said.

So, we know Google uses machine-learning to test and shape its algorithms. We also know Google is much more open now to embracing this technology. That begs the question: What’s next?

What Machine Learning Means for SEO

The more machine learning plays a role in search engine algorithms, the more digital markers will need to be proactive about maximizing the user experiences of their websites and landing pages. Machine-learning systems will result in more fluid search algorithms that make real-time determinations based on positive and negative reactions to content.

With that in mind, SEO experts can prepare for the machine-learning revolution by focusing on the following questions.

  1. Is your landing page relevant?
    Visitors who arrive at your site on the most appropriate landing pages are much less likely to bounce back to the search engine results page (SERP), and high bounce rates are easily detectible red flags of a poor user experience.
  2. Could my landing pages be more engaging?
    You’re halfway there if your visitors are arriving on the right pages. Now, think of new ways to capture their attention. Can you add videos, guides or additional products that add value for your visitors and make each visit more compelling?

3 Organic Search Predictions for 2016

It is always fun at the beginning of the New Year to pull out the crystal ball and make bold pronouncements about what the New Year will bring. I confess my ball is not crystal but, instead, is one of those plastic eight balls that always says “future is cloudy,” or some other meaningless sentiment. This still does not keep me from looking ahead and postulating as to what the future may bring. Here are my three rock-solid predictions:

tm0715_nutsbolts_search.jpgIt is always fun at the beginning of the New Year to pull out the crystal ball and make bold pronouncements about what the New Year will bring. I confess my ball is not crystal but, instead, is one of those plastic eight balls that always says “future is cloudy,” or some other meaningless sentiment. This still does not keep me from looking ahead and postulating as to what the future may bring. Here are my three rock-solid predictions:

Prediction 1 – Google Will Make Hundreds of Changes to Its Algorithm
It is easy to predict that Google will make hundreds of changes to its algorithm during the span of the year, because the search engine regularly claims to make hundreds of changes each year. So why highlight the obvious? The point is that it is pointless to try and chase every tweak and change made by Google. Once upon a time, algorithm chasing was a very much practiced art among SEOs. Most wise practitioners no longer chase changes, endlessly looking for ways to defeat the mighty algorithm. Unfortunately, many of us work with clients who still believe that there is a value to aggressive algo-chasing. As a result, it is one of my missions to dispel this thinking and replace it with a more balanced approach.

Prediction 2 – Google Will Announce Some Big Change
This is another can’t-miss prediction that I am sure will come true. Each year, amid the hundreds of minor changes, Google announces some big change or other that is predicted to have a serious impact on search rankings and their resulting traffic. The real question is whether the next much-ballyhooed change will be another fizzle, like Mobilegeddon or something else. Giving preference to mobile-friendly sites was supposed to be the proverbial Armageddon of search, but it was a bit of a fizzle. The reason it fizzled was that the well-optimized sites – those at the top of the rankings – were already tooled up for the change, and their results did not suffer much at all. An answer appears to be emerging for how to avoid being caught up in responding to the announced and the inevitable unannounced major change like the Phantom Update. The ability to float unscathed above the turbulence of all this change is possible and does not require prescience or magic.

Missing Data Can Be Meaningful

No matter how big the Big Data gets, we will never know everything about everything. Well, according to the super-duper computer called “Deep Thought” in the movie “The Hitchhiker’s Guide to the Galaxy” (don’t bother to watch it if you don’t care for the British sense of humour), the answer to “The Ultimate Question of Life, the Universe, and Everything” is “42.” Coincidentally, that is also my favorite number to bet on (I have my reasons), but I highly doubt that even that huge fictitious computer with unlimited access to “everything” provided that numeric answer with conviction after 7½ million years of computing and checking. At best, that “42” is an estimated figure of a sort, based on some fancy algorithm. And in the movie, even Deep Thought pointed out that “the answer is meaningless, because the beings who instructed it never actually knew what the Question was.” Ha! Isn’t that what I have been saying all along? For any type of analytics to be meaningful, one must properly define the question first. And what to do with the answer that comes out of an algorithm is entirely up to us humans, or in the business world, the decision-makers. (Who are probably human.)

No matter how big the Big Data gets, we will never know everything about everything. Well, according to the super-duper computer called “Deep Thought” in the movie “The Hitchhiker’s Guide to the Galaxy” (don’t bother to watch it if you don’t care for the British sense of humour), the answer to “The Ultimate Question of Life, the Universe, and Everything” is “42.” Coincidentally, that is also my favorite number to bet on (I have my reasons), but I highly doubt that even that huge fictitious computer with unlimited access to “everything” provided that numeric answer with conviction after 7½ million years of computing and checking. At best, that “42” is an estimated figure of a sort, based on some fancy algorithm. And in the movie, even Deep Thought pointed out that “the answer is meaningless, because the beings who instructed it never actually knew what the Question was.” Ha! Isn’t that what I have been saying all along? For any type of analytics to be meaningful, one must properly define the question first. And what to do with the answer that comes out of an algorithm is entirely up to us humans, or in the business world, the decision-makers. (Who are probably human.)

Analytics is about making the best of what we know. Good analysts do not wait for a perfect dataset (it will never come by, anyway). And businesspeople have no patience to wait for anything. Big Data is big because we digitize everything, and everything that is digitized is stored somewhere in forms of data. For example, even if we collect mobile device usage data from just pockets of the population with certain brands of mobile services in a particular area, the sheer size of the resultant dataset becomes really big, really fast. And most unstructured databases are designed to collect and store what is known. If you flip that around to see if you know every little behavior through mobile devices for “everyone,” you will be shocked to see how small the size of the population associated with meaningful data really is. Let’s imagine that we can describe human beings with 1,000 variables coming from all sorts of sources, out of 200 million people. How many would have even 10 percent of the 1,000 variables filled with some useful information? Not many, and definitely not 100 percent. Well, we have more data than ever in the history of mankind, but still not for every case for everyone.

In my previous columns, I pointed out that decision-making is about ranking different options, and to rank anything properly. We must employee predictive analytics (refer to “It’s All About Ranking“). And for ranking based on the scores resulting from predictive models to be effective, the datasets must be summarized to the level that is to be ranked (e.g., individuals, households, companies, emails, etc.). That is why transaction or event-level datasets must be transformed to “buyer-centric” portraits before any modeling activity begins. Again, it is not about the transaction or the products, but it is about the buyers, if you are doing all this to do business with people.

Trouble with buyer- or individual-centric databases is that such transformation of data structure creates lots of holes. Even if you have meticulously collected every transaction record that matters (and that will be the day), if someone did not buy a certain item, any variable that is created based on the purchase record of that particular item will have nothing to report for that person. Likewise, if you have a whole series of variables to differentiate online and offline channel behaviors, what would the online portion contain if the consumer in question never bought anything through the Web? Absolutely nothing. But in the business of predictive analytics, what did not happen is as important as what happened. Even a simple concept of “response” is only meaningful when compared to “non-response,” and the difference between the two groups becomes the basis for the “response” model algorithm.

Capturing the Meanings Behind Missing Data
Missing data are all around us. And there are many reasons why they are missing, too. It could be that there is nothing to report, as in aforementioned examples. Or, there could be errors in data collection—and there are lots of those, too. Maybe you don’t have access to certain pockets of data due to corporate, legal, confidentiality or privacy reasons. Or, maybe records did not match properly when you tried to merge disparate datasets or append external data. These things happen all the time. And, in fact, I have never seen any dataset without a missing value since I left school (and that was a long time ago). In school, the professors just made up fictitious datasets to emphasize certain phenomena as examples. In real life, databases have more holes than Swiss cheese. In marketing databases? Forget about it. We all make do with what we know, even in this day and age.

Then, let’s ask a philosophical question here:

  • If missing data are inevitable, what do we do about it?
  • How would we record them in databases?
  • Should we just leave them alone?
  • Or should we try to fill in the gaps?
  • If so, how?

The answer to all this is definitely not 42, but I’ll tell you this: Even missing data have meanings, and not all missing data are created equal, either.

Furthermore, missing data often contain interesting stories behind them. For example, certain demographic variables may be missing only for extremely wealthy people and very poor people, as their residency data are generally not exposed (for different reasons, of course). And that, in itself, is a story. Likewise, some data may be missing in certain geographic areas or for certain age groups. Collection of certain types of data may be illegal in some states. “Not” having any data on online shopping behavior or mobile activity may mean something interesting for your business, if we dig deeper into it without falling into the trap of predicting legal or corporate boundaries, instead of predicting consumer behaviors.

In terms of how to deal with missing data, let’s start with numeric data, such as dollars, days, counters, etc. Some numeric data simply may not be there, if there is no associated transaction to report. Now, if they are about “total dollar spending” and “number of transactions” in a certain category, for example, they can be initiated as zero and remain as zero in cases like this. The counter simply did not start clicking, and it can be reported as zero if nothing happened.

Some numbers are incalculable, though. If you are calculating “Average Amount per Online Transaction,” and if there is no online transaction for a particular customer, that is a situation for mathematical singularity—as we can’t divide anything by zero. In such cases, the average amount should be recorded as: “.”, blank, or any value that represents a pure missing value. But it should never be recorded as zero. And that is the key in dealing with missing numeric information; that zero should be reserved for real zeros, and nothing else.

I have seen too many cases where missing numeric values are filled with zeros, and I must say that such a practice is definitely frowned-upon. If you have to pick just one takeaway from this article, that’s it. Like I emphasized, not all missing values are the same, and zero is not the way you record them. Zeros should never represent lack of information.

Take the example of a popular demographic variable, “Number of Children in the Household.” This is a very predictable variable—not just for purchase behavior of children’s products, but for many other things. Now, it is a simple number, but it should never be treated as a simple variable—as, in this case, lack of information is not the evidence of non-existence. Let’s say that you are purchasing this data from a third-party data compiler (or a data broker). If you don’t see a positive number in that field, it could be because:

  1. The household in question really does not have a child;
  2. Even the data-collector doesn’t have the information; or
  3. The data collector has the information, but the household record did not match to the vendor’s record, for some reason.

If that field contains a number like 1, 2 or 3, that’s easy, as they will represent the number of children in that household. But the zero should be reserved for cases where the data collector has a positive confirmation that the household in question indeed does not have any children. If it is unknown, it should be marked as blank, “.” (Many statistical softwares, such as SAS, record missing values this way.) Or use “U” (though an alpha character should not be in a numeric field).

If it is a case of non-match to the external data source, then there should be a separate indicator for it. The fact that the record did not match to a professional data compiler’s list may mean something. And I’ve seen cases where such non-matching indicators are made to model algorithms along with other valid data, as in the case where missing indicators of income display the same directional tendency as high-income households.

Now, if the data compiler in question boldly inputs zeros for the cases of unknowns? Take a deep breath, fire the vendor, and don’t deal with the company again, as it is a sign that its representatives do not know what they are doing in the data business. I have done so in the past, and you can do it, too. (More on how to shop for external data in future articles.)

For non-numeric categorical data, similar rules apply. Some values could be truly “blank,” and those should be treated separately from “Unknown,” or “Not Available.” As a practice, let’s list all kinds of possible missing values in codes, texts or other character fields:

  • ” “—blank or “null”
  • “N/A,” “Not Available,” or “Not Applicable”
  • “Unknown”
  • “Other”—If it is originating from some type of multiple choice survey or pull-down menu
  • “Not Answered” or “Not Provided”—This indicates that the subjects were asked, but they refused to answer. Very different from “Unknown.”
  • “0”—In this case, the answer can be expressed in numbers. Again, only for known zeros.
  • “Non-match”—Not matched to other internal or external data sources
  • Etc.

It is entirely possible that all these values may be highly correlated to each other and move along the same predictive direction. However, there are many cases where they do not. And if they are combined into just one value, such as zero or blank, we will never be able to detect such nuances. In fact, I’ve seen many cases where one or more of these missing indicators move together with other “known” values in models. Again, missing data have meanings, too.

Filling in the Gaps
Nonetheless, missing data do not have to left as missing, blank or unknown all the time. With statistical modeling techniques, we can fill in the gaps with projected values. You didn’t think that all those data compilers really knew the income level of every household in the country, did you? It is not a big secret that much of those figures are modeled with other available data.

Such inferred statistics are everywhere. Popular variables, such as householder age, home owner/renter indicator, housing value, household income or—in the case of business data—the number of employees and sales volume contain modeled values. And there is nothing wrong with that, in the world where no one really knows everything about everything. If you understand the limitations of modeling techniques, it is quite alright to employ modeled values—which are much better alternatives to highly educated guesses—in decision-making processes. We just need to be a little careful, as models often fail to predict extreme values, such as household incomes over $500,000/year, or specific figures, such as incomes of $87,500. But “ranges” of household income, for example, can be predicted at a high confidence level, though it technically requires many separate algorithms and carefully constructed input variables in various phases. But such technicality is an issue that professional number crunchers should deal with, like in any other predictive businesses. Decision-makers should just be aware of the reality of real and inferred data.

Such imputation practices can be applied to any data source, not just compiled databases by professional data brokers. Statisticians often impute values when they encounter missing values, and there are many different methods of imputation. I haven’t met two statisticians who completely agree with each other when it comes to imputation methodologies, though. That is why it is important for an organization to have a unified rule for each variable regarding its imputation method (or lack thereof). When multiple analysts employ different methods, it often becomes the very source of inconsistent or erroneous results at the application stage. It is always more prudent to have the calculation done upfront, and store the inferred values in a consistent manner in the main database.

In terms of how that is done, there could be a long debate among the mathematical geeks. Will it be a simple average of non-missing values? If such a method is to be employed, what is the minimum required fill-rate of the variable in question? Surely, you do not want to project 95 percent of the population with 5 percent known values? Or will the missing values be replaced with modeled values, as in previous examples? If so, what would be the source of target data? What about potential biases that may exist because of data collection practices and their limitations? What should be the target definition? In what kind of ranges? Or should the target definition remain as a continuous figure? How would you differentiate modeled and real values in the database? Would you embed indicators for inferred values? Or would you forego such flags in the name of speed and convenience for users?

The important matter is not the rules or methodologies, but the consistency of them throughout the organization and the databases. That way, all users and analysts will have the same starting point, no matter what the analytical purposes are. There could be a long debate in terms of what methodology should be employed and deployed. But once the dust settles, all data fields should be treated by pre-determined rules during the database update processes, avoiding costly errors in the downstream. All too often, inconsistent imputation methods lead to inconsistent results.

If, by some chance, individual statisticians end up with freedom to come up with their own ways to fill in the blanks, then the model-scoring code in question must include missing value imputation algorithms without an exception, granted that such practice will elongate the model application processes and significantly increase chances for errors. It is also important that non-statistical users should be educated about the basics of missing data and associated imputation methods, so that everyone who has access to the database shares a common understanding of what they are dealing with. That list includes external data providers and partners, and it is strongly recommended that data dictionaries must include employed imputation rules wherever applicable.

Keep an Eye on the Missing Rate
Often, we get to find out that the missing rate of certain variables is going out of control because models become ineffective and campaigns start to yield disappointing results. Conversely, it can be stated that fluctuations in missing data ratios greatly affect the predictive power of models or any related statistical works. It goes without saying that a consistent influx of fresh data matters more than the construction and the quality of models and algorithms. It is a classic case of a garbage-in-garbage-out scenario, and that is why good data governance practices must include a time-series comparison of the missing rate of every critical variable in the database. If, all of a sudden, an important predictor’s fill-rate drops below a certain point, no analyst in this world can sustain the predictive power of the model algorithm, unless it is rebuilt with a whole new set of variables. The shelf life of models is definitely finite, but nothing deteriorates effectiveness of models faster than inconsistent data. And a fluctuating missing rate is a good indicator of such an inconsistency.

Likewise, if the model score distribution starts to deviate from the original model curve from the development and validation samples, it is prudent to check the missing rate of every variable used in the model. Any sudden changes in model score distribution are a good indicator that something undesirable is going on in the database (more on model quality control in future columns).

These few guidelines regarding the treatment of missing data will add more flavors to statistical models and analytics in general. In turn, proper handling of missing data will prolong the predictive power of models, as well. Missing data have hidden meanings, but they are revealed only when they are treated properly. And we need to do that until the day we get to know everything about everything. Unless you are just happy with that answer of “42.”