Data Atrophy

Not all data are created equal. There are one-dimensional demographic and firmographic data, then there are more colorful behavioral data. The former is about how the targets look, and the latter is more about what they do, like what they click, browse, purchase and say.

Not all data are created equal. There are one-dimensional demographic and firmographic data, then there are more colorful behavioral data. The former is about how the targets look, and the latter is more about what they do, like what they click, browse, purchase and say. On top of these, if we are lucky, we may have access to attitudinal data, which are about what the target is thinking about. If we get to have all three types of data about the customers and prospects, prediction business will definitely get to the next level (refer to “Big Data Must Get Smaller”). But the reality is that it is very difficult to know everything about anyone, and that is why analytics is really about making the best of what we know. Predictive modeling is useful not only because it predicts the future, but also fills gaps in data. And even in the age of abundant data, there are many holes, as we will never have a complete set of information (refer to “Why Model?”).

Among these data types, some are more useful for prediction than others. Behavioral data definitely possess more predictive power than simple demographic data for sure. But alas, they are harder to come by. It could be that the target is new to the environment, so she may not have left much data behind at all. May be she just looked around and didn’t buy anything yet. Or she is very privacy-conscious and diligent about erasing her behavioral trails on the net or otherwise. Maybe she explicitly opted out of being traced at all, giving up much of the convenience factors of being known by the merchants. Then the data coverage comes into the equation, and that is why analysts rely on demographic and geo-demographic data for their readily available nature. Much of such data can easily be purchased and appended on a household or individual level, at least in the U.S. If we get to have some hint of identity of the target, there are ways to merge disparate data sets together.

What if we don’t get to know who are leaving data trails? Again, it could be about the privacy concerns of the target, or the manner by which the data are collected. Some data collectors avoid personally identifiable information, such as name, address or email, as they do not want to be seen as the Big Brother. Even if collectors get to have access to such PII, they do not share it with outsiders, to maintain dominance and to avoid the data privacy issue altogether. And there are many instances where that “who” part is completely out of reach. Movement data would be an example of that.

Weaving multiple types of data together is often the main source of trouble when it comes to predictive analytics. I have been talking about the importance of a 360-degree view of a customer for proper personalization and attribution, but the main show-stopper there is often the inability to merge data sources with confidence, not the lack of technology or statistical skills. That would be the horizontal challenge when dealing with multiple types of data.

Then there is the time factor. Like living organisms, data get old and wither away, too. Let’s call it the “data atrophy” challenge. Data players must be mindful about it, as outdated information is often worse than not having any at all for the decision-making or prediction business.

Now, not all data types deteriorate at the same rate. The shelf-life of demographic data are far longer than that of behavioral data. For example, people’s income levels or housing size do not change overnight, while usefulness of what we call “hotline” data evaporates much faster. If you get to know that someone is searching for a new car, how long will he be in the market? What if it is about a ticket or pay-per-view purchase for tonight’s ball game? Data that is extremely valuable this minute could be totally irrelevant within the next hour.