Missing Data Can Be Meaningful

No matter how big the Big Data gets, we will never know everything about everything. Well, according to the super-duper computer called “Deep Thought” in the movie “The Hitchhiker’s Guide to the Galaxy” (don’t bother to watch it if you don’t care for the British sense of humour), the answer to “The Ultimate Question of Life, the Universe, and Everything” is “42.” Coincidentally, that is also my favorite number to bet on (I have my reasons), but I highly doubt that even that huge fictitious computer with unlimited access to “everything” provided that numeric answer with conviction after 7½ million years of computing and checking. At best, that “42” is an estimated figure of a sort, based on some fancy algorithm. And in the movie, even Deep Thought pointed out that “the answer is meaningless, because the beings who instructed it never actually knew what the Question was.” Ha! Isn’t that what I have been saying all along? For any type of analytics to be meaningful, one must properly define the question first. And what to do with the answer that comes out of an algorithm is entirely up to us humans, or in the business world, the decision-makers. (Who are probably human.)

No matter how big the Big Data gets, we will never know everything about everything. Well, according to the super-duper computer called “Deep Thought” in the movie “The Hitchhiker’s Guide to the Galaxy” (don’t bother to watch it if you don’t care for the British sense of humour), the answer to “The Ultimate Question of Life, the Universe, and Everything” is “42.” Coincidentally, that is also my favorite number to bet on (I have my reasons), but I highly doubt that even that huge fictitious computer with unlimited access to “everything” provided that numeric answer with conviction after 7½ million years of computing and checking. At best, that “42” is an estimated figure of a sort, based on some fancy algorithm. And in the movie, even Deep Thought pointed out that “the answer is meaningless, because the beings who instructed it never actually knew what the Question was.” Ha! Isn’t that what I have been saying all along? For any type of analytics to be meaningful, one must properly define the question first. And what to do with the answer that comes out of an algorithm is entirely up to us humans, or in the business world, the decision-makers. (Who are probably human.)

Analytics is about making the best of what we know. Good analysts do not wait for a perfect dataset (it will never come by, anyway). And businesspeople have no patience to wait for anything. Big Data is big because we digitize everything, and everything that is digitized is stored somewhere in forms of data. For example, even if we collect mobile device usage data from just pockets of the population with certain brands of mobile services in a particular area, the sheer size of the resultant dataset becomes really big, really fast. And most unstructured databases are designed to collect and store what is known. If you flip that around to see if you know every little behavior through mobile devices for “everyone,” you will be shocked to see how small the size of the population associated with meaningful data really is. Let’s imagine that we can describe human beings with 1,000 variables coming from all sorts of sources, out of 200 million people. How many would have even 10 percent of the 1,000 variables filled with some useful information? Not many, and definitely not 100 percent. Well, we have more data than ever in the history of mankind, but still not for every case for everyone.

In my previous columns, I pointed out that decision-making is about ranking different options, and to rank anything properly. We must employee predictive analytics (refer to “It’s All About Ranking“). And for ranking based on the scores resulting from predictive models to be effective, the datasets must be summarized to the level that is to be ranked (e.g., individuals, households, companies, emails, etc.). That is why transaction or event-level datasets must be transformed to “buyer-centric” portraits before any modeling activity begins. Again, it is not about the transaction or the products, but it is about the buyers, if you are doing all this to do business with people.

Trouble with buyer- or individual-centric databases is that such transformation of data structure creates lots of holes. Even if you have meticulously collected every transaction record that matters (and that will be the day), if someone did not buy a certain item, any variable that is created based on the purchase record of that particular item will have nothing to report for that person. Likewise, if you have a whole series of variables to differentiate online and offline channel behaviors, what would the online portion contain if the consumer in question never bought anything through the Web? Absolutely nothing. But in the business of predictive analytics, what did not happen is as important as what happened. Even a simple concept of “response” is only meaningful when compared to “non-response,” and the difference between the two groups becomes the basis for the “response” model algorithm.

Capturing the Meanings Behind Missing Data
Missing data are all around us. And there are many reasons why they are missing, too. It could be that there is nothing to report, as in aforementioned examples. Or, there could be errors in data collection—and there are lots of those, too. Maybe you don’t have access to certain pockets of data due to corporate, legal, confidentiality or privacy reasons. Or, maybe records did not match properly when you tried to merge disparate datasets or append external data. These things happen all the time. And, in fact, I have never seen any dataset without a missing value since I left school (and that was a long time ago). In school, the professors just made up fictitious datasets to emphasize certain phenomena as examples. In real life, databases have more holes than Swiss cheese. In marketing databases? Forget about it. We all make do with what we know, even in this day and age.

Then, let’s ask a philosophical question here:

  • If missing data are inevitable, what do we do about it?
  • How would we record them in databases?
  • Should we just leave them alone?
  • Or should we try to fill in the gaps?
  • If so, how?

The answer to all this is definitely not 42, but I’ll tell you this: Even missing data have meanings, and not all missing data are created equal, either.

Furthermore, missing data often contain interesting stories behind them. For example, certain demographic variables may be missing only for extremely wealthy people and very poor people, as their residency data are generally not exposed (for different reasons, of course). And that, in itself, is a story. Likewise, some data may be missing in certain geographic areas or for certain age groups. Collection of certain types of data may be illegal in some states. “Not” having any data on online shopping behavior or mobile activity may mean something interesting for your business, if we dig deeper into it without falling into the trap of predicting legal or corporate boundaries, instead of predicting consumer behaviors.

In terms of how to deal with missing data, let’s start with numeric data, such as dollars, days, counters, etc. Some numeric data simply may not be there, if there is no associated transaction to report. Now, if they are about “total dollar spending” and “number of transactions” in a certain category, for example, they can be initiated as zero and remain as zero in cases like this. The counter simply did not start clicking, and it can be reported as zero if nothing happened.

Some numbers are incalculable, though. If you are calculating “Average Amount per Online Transaction,” and if there is no online transaction for a particular customer, that is a situation for mathematical singularity—as we can’t divide anything by zero. In such cases, the average amount should be recorded as: “.”, blank, or any value that represents a pure missing value. But it should never be recorded as zero. And that is the key in dealing with missing numeric information; that zero should be reserved for real zeros, and nothing else.

I have seen too many cases where missing numeric values are filled with zeros, and I must say that such a practice is definitely frowned-upon. If you have to pick just one takeaway from this article, that’s it. Like I emphasized, not all missing values are the same, and zero is not the way you record them. Zeros should never represent lack of information.

Take the example of a popular demographic variable, “Number of Children in the Household.” This is a very predictable variable—not just for purchase behavior of children’s products, but for many other things. Now, it is a simple number, but it should never be treated as a simple variable—as, in this case, lack of information is not the evidence of non-existence. Let’s say that you are purchasing this data from a third-party data compiler (or a data broker). If you don’t see a positive number in that field, it could be because:

  1. The household in question really does not have a child;
  2. Even the data-collector doesn’t have the information; or
  3. The data collector has the information, but the household record did not match to the vendor’s record, for some reason.

If that field contains a number like 1, 2 or 3, that’s easy, as they will represent the number of children in that household. But the zero should be reserved for cases where the data collector has a positive confirmation that the household in question indeed does not have any children. If it is unknown, it should be marked as blank, “.” (Many statistical softwares, such as SAS, record missing values this way.) Or use “U” (though an alpha character should not be in a numeric field).

If it is a case of non-match to the external data source, then there should be a separate indicator for it. The fact that the record did not match to a professional data compiler’s list may mean something. And I’ve seen cases where such non-matching indicators are made to model algorithms along with other valid data, as in the case where missing indicators of income display the same directional tendency as high-income households.

Now, if the data compiler in question boldly inputs zeros for the cases of unknowns? Take a deep breath, fire the vendor, and don’t deal with the company again, as it is a sign that its representatives do not know what they are doing in the data business. I have done so in the past, and you can do it, too. (More on how to shop for external data in future articles.)

For non-numeric categorical data, similar rules apply. Some values could be truly “blank,” and those should be treated separately from “Unknown,” or “Not Available.” As a practice, let’s list all kinds of possible missing values in codes, texts or other character fields:

  • ” “—blank or “null”
  • “N/A,” “Not Available,” or “Not Applicable”
  • “Unknown”
  • “Other”—If it is originating from some type of multiple choice survey or pull-down menu
  • “Not Answered” or “Not Provided”—This indicates that the subjects were asked, but they refused to answer. Very different from “Unknown.”
  • “0”—In this case, the answer can be expressed in numbers. Again, only for known zeros.
  • “Non-match”—Not matched to other internal or external data sources
  • Etc.

It is entirely possible that all these values may be highly correlated to each other and move along the same predictive direction. However, there are many cases where they do not. And if they are combined into just one value, such as zero or blank, we will never be able to detect such nuances. In fact, I’ve seen many cases where one or more of these missing indicators move together with other “known” values in models. Again, missing data have meanings, too.

Filling in the Gaps
Nonetheless, missing data do not have to left as missing, blank or unknown all the time. With statistical modeling techniques, we can fill in the gaps with projected values. You didn’t think that all those data compilers really knew the income level of every household in the country, did you? It is not a big secret that much of those figures are modeled with other available data.

Such inferred statistics are everywhere. Popular variables, such as householder age, home owner/renter indicator, housing value, household income or—in the case of business data—the number of employees and sales volume contain modeled values. And there is nothing wrong with that, in the world where no one really knows everything about everything. If you understand the limitations of modeling techniques, it is quite alright to employ modeled values—which are much better alternatives to highly educated guesses—in decision-making processes. We just need to be a little careful, as models often fail to predict extreme values, such as household incomes over $500,000/year, or specific figures, such as incomes of $87,500. But “ranges” of household income, for example, can be predicted at a high confidence level, though it technically requires many separate algorithms and carefully constructed input variables in various phases. But such technicality is an issue that professional number crunchers should deal with, like in any other predictive businesses. Decision-makers should just be aware of the reality of real and inferred data.

Such imputation practices can be applied to any data source, not just compiled databases by professional data brokers. Statisticians often impute values when they encounter missing values, and there are many different methods of imputation. I haven’t met two statisticians who completely agree with each other when it comes to imputation methodologies, though. That is why it is important for an organization to have a unified rule for each variable regarding its imputation method (or lack thereof). When multiple analysts employ different methods, it often becomes the very source of inconsistent or erroneous results at the application stage. It is always more prudent to have the calculation done upfront, and store the inferred values in a consistent manner in the main database.

In terms of how that is done, there could be a long debate among the mathematical geeks. Will it be a simple average of non-missing values? If such a method is to be employed, what is the minimum required fill-rate of the variable in question? Surely, you do not want to project 95 percent of the population with 5 percent known values? Or will the missing values be replaced with modeled values, as in previous examples? If so, what would be the source of target data? What about potential biases that may exist because of data collection practices and their limitations? What should be the target definition? In what kind of ranges? Or should the target definition remain as a continuous figure? How would you differentiate modeled and real values in the database? Would you embed indicators for inferred values? Or would you forego such flags in the name of speed and convenience for users?

The important matter is not the rules or methodologies, but the consistency of them throughout the organization and the databases. That way, all users and analysts will have the same starting point, no matter what the analytical purposes are. There could be a long debate in terms of what methodology should be employed and deployed. But once the dust settles, all data fields should be treated by pre-determined rules during the database update processes, avoiding costly errors in the downstream. All too often, inconsistent imputation methods lead to inconsistent results.

If, by some chance, individual statisticians end up with freedom to come up with their own ways to fill in the blanks, then the model-scoring code in question must include missing value imputation algorithms without an exception, granted that such practice will elongate the model application processes and significantly increase chances for errors. It is also important that non-statistical users should be educated about the basics of missing data and associated imputation methods, so that everyone who has access to the database shares a common understanding of what they are dealing with. That list includes external data providers and partners, and it is strongly recommended that data dictionaries must include employed imputation rules wherever applicable.

Keep an Eye on the Missing Rate
Often, we get to find out that the missing rate of certain variables is going out of control because models become ineffective and campaigns start to yield disappointing results. Conversely, it can be stated that fluctuations in missing data ratios greatly affect the predictive power of models or any related statistical works. It goes without saying that a consistent influx of fresh data matters more than the construction and the quality of models and algorithms. It is a classic case of a garbage-in-garbage-out scenario, and that is why good data governance practices must include a time-series comparison of the missing rate of every critical variable in the database. If, all of a sudden, an important predictor’s fill-rate drops below a certain point, no analyst in this world can sustain the predictive power of the model algorithm, unless it is rebuilt with a whole new set of variables. The shelf life of models is definitely finite, but nothing deteriorates effectiveness of models faster than inconsistent data. And a fluctuating missing rate is a good indicator of such an inconsistency.

Likewise, if the model score distribution starts to deviate from the original model curve from the development and validation samples, it is prudent to check the missing rate of every variable used in the model. Any sudden changes in model score distribution are a good indicator that something undesirable is going on in the database (more on model quality control in future columns).

These few guidelines regarding the treatment of missing data will add more flavors to statistical models and analytics in general. In turn, proper handling of missing data will prolong the predictive power of models, as well. Missing data have hidden meanings, but they are revealed only when they are treated properly. And we need to do that until the day we get to know everything about everything. Unless you are just happy with that answer of “42.”

Beware of Dubious Data Providers: A 9-Point Checklist

Are you hounded by email pitches offering access to all kinds of prospective business targets? I am, and I hate it. As a B-to-B marketer, I am always interested in new customer data sources, so I feel compelled to at least give them a listen. So, over time, I have come up with a nine-point assessment strategy to help marketers determine the likely legitimacy of a potential vendor, using approaches that can be replicated by anyone, at arm’s length.

Are you hounded by email pitches offering access to all kinds of prospective business targets? I am, and I hate it. As a B-to-B marketer, I am always interested in new customer data sources, so I feel compelled to at least give them a listen. But when I ask a few questions—like where their data comes from—answers come back like “A variety of sources” or “Sorry, that’s our intellectual property.” So, over time, I have come up with a nine-point assessment strategy to help marketers determine the likely legitimacy of a potential vendor, using approaches that can be replicated by anyone, at arm’s length.

Of course a lot of these emails are simply fraudulent. Early on, I stumbled upon an anonymous blog that reports on the most egregious of these emailers and connects them to unscrupulous spammers tracked by Spamhaus. It’s pretty hilarious to learn that many of these data sellers are complete fakes, sending identical emails from fake companies and fake addresses.

If you want to just delete them all as a matter of course, that’s a reasonable strategy. Myself, I’ve been throwing them in a folder called “suspicious data providers,” and every so often, I dig in to see if there’s any wheat among the chaff. And that is where this checklist was born.

I got some ideas from two colleagues who have written helpfully on this problem. Tim Slevin provides a nice 3-point assessment approach in the SLMA blog, where he recommends checking out the vendors’ physical address, researching them on LinkedIn, and asking them for a data sample so specific that you can tell whether their product is any good. All terrific ideas, which I have gladly incorporated in my approach.

Ken Magill, who writes an amusing and informative publication on email marketing, tackled this subject on behalf of one of his readers, who had unhappily prepaid for an email list that didn’t arrive. “You’re never going to see that $3000 again,” says Ken to the sucker. Ken offers a dozen or so red flags to look for when considering buying email addresses—and I have picked up some of his ideas, too. Magill wraps up his discussion with: “If you suspect you’d have trouble serving them with court papers, do not do business with them.”

So, to get to the point, here is my list of yes/no questions, which can be examined fairly easily, without any direct contact with the vendor.

  1. Do they have a website you can visit?
  2. Do they provide a physical business address?
  3. Do they have a company page on LinkedIn?
  4. Are the names of the management team provided on the website?
  5. Is there a client list on the website?
  6. Is there a testimonial on the website with a real name attached?
  7. Do they claim some kind of guaranteed level of accuracy for their data?
  8. Do they require 100 percent pre-payment?
  9. Is the sales rep using a Gmail or other email address unrelated to the company name?

For question Nos. 7, 8 and 9, a “no” is the right answer. For the first six, “yes” is what you’re looking for. I’d say that any vendor who gets more than one or two wrong answers should be avoided. Any other ideas out there?

A version of this article appeared in Biznology, the digital marketing blog.

Prospecting to IT Buyers: How Nine Data Vendors Stack Up

Buyers of information technology (IT) are one of the most valued audiences targeted by business marketers. Globally, these professionals spend $3.6 trillion on hardware, software and technology services. My colleague Bernice Grossman and I recently investigated the availability of prospecting data available to tech marketers for reaching this desirable group, and we found some surprises.

Buyers of information technology (IT) are one of the most valued audiences targeted by business marketers. Globally, these professionals spend $3.6 trillion on hardware, software and technology services. My colleague Bernice Grossman and I recently investigated the availability of prospecting data available to tech marketers for reaching this desirable group, and we found some surprises.

We asked twenty companies who supply prospecting data to business marketers to share with us statistics about the quantity and quality of the data they have on IT buyers in the U.S. Nine vendors graciously participated in our study-specifically, Data.com, D&B, Harte-Hanks, Infogroup, Mardev-DM2, NetProspex, Stirista, Worldata and ZoomInfo. Our thanks to them for letting us poke around under their hoods.

We asked each participating vendor to report to us on the number of companies on their databases in ten industries, by SIC code. We also asked for the numbers of contacts with IT titles in a sampling of twenty firms in those SICs, ten large enterprises and ten small businesses. Finally, we sent them the names and addresses of ten actual IT professionals (people whom Bernice and I happen to know, and were able to persuade to let us submit their names), and we asked the vendors to share with us the exact record they have on those individuals. The results of our study can be downloaded here.

This is the same methodology we have used in past studies on prospecting data available to business marketers—although this was the first study we have done on a particular industry vertical. Our objective is, first, to get at the question of coverage, meaning, the extent to which a business marketer can gain access to all the companies and contacts in the target market. And second, we want to show marketers the level of accuracy in the data available for prospecting-for example, is Joe Schmoe still the CIO at Acme Widgets, and can I get his correct phone number and email address?

The answers to these questions, in general, was YES. The data reported was surprisingly accurate, especially given how much business marketers complain about the data they get from vendors. And the coverage was wide, meaning there seem to be plenty of IT names in a variety of industries for us to contact.

But the data also revealed some interesting trends in business marketing in general and tech marketing in specific.

  • Prospecting data is being sold these days out of massive databases, which makes it far easier for marketers to select exactly the targets they want, by such criteria as title, company size and industry, irrespective of whether a “compiled” or a “response” name.
  • Company counts by SIC varied widely among the vendors, reminding us that data providers may have their own proprietary systems for flagging a company by industry code.
  • Job titles are getting fuzzier than ever. We found real IT professionals using titles such as Platform Manager and Reporting Manager-which makes it tough to know what they really do.

Given these developments, we urge our fellow marketers to probe carefully on data sourcing and categorizing practices, and to specify in great detail exactly what targets you’re going after, when buying data for new customer acquisition. And we suggest that you source from multiple vendors, in order to expand your market coverage potential. Happy prospecting to all.