Even AI Needs Clean Data in Order to Be the Shiny Object

Users are quickly realizing that investing in AI is not the end of the road. Then again, in this analytics journey, there really is no end anyway; much like the scientific journey, it is a constant series of hypothesis, testing, and course corrections.

Users are quickly realizing that investing in AI is not the end of the road. Then again, in this analytics journey, there really is no end anyway; much like the scientific journey, it is a constant series of hypothesis, testing, and course corrections. And now, I’ll explain why that means even AI needs clean data.

If there is a book out there — many have asked me about it — it would look more like a long series of case studies, not some definitive roadmap for all. Why? Because prescribing analytics is much like a doctor’s work. It depends as much on the unique situation of the patient as on the list of solutions.

That is the main reason why one cannot just install AI and call it a day. Who’d give it a purpose, guide it, and constantly fine-tune it? Not itself, for sure.

Then there is a question about what goes into it. AI — or any type of analytics tool, for that matter — depends on clean and error-free data. If the data are dirty and unusable, you may end up automating inadequate decision-making processes, getting wrong answers really fast. I’d say that would be worse than not having any answer at all.

So far, you may say I am just stating the obvious here. Of course, AI or machine learning require clean and error free data. The real trouble is that such data preparation often takes up as much as 80% (if not more) of the whole process of applying data-based intelligence to decision-making. In fact, users are finding out that the algorithmic part of the equation is the simplest to automate. The data refinement process is far more complicated than that, as it really depends on the shape of the available data. And some are really messy (hence, the title of my series in this fine publication, “Big Data, Small Data, Clean Data, Messy Data”).

So, why aren’t data readily usable?

  • Data Are in Silos: This is so common that “siloed data” is actually a term that we commonly use in meeting rooms. Simply, if the data are locked up somewhere, they won’t be much of use for anyone. Worse, each silo may be on a unique platform, with incompatible data formats from others.
  • Data Are in One Place, But Not Connected: Putting the data in one place isn’t enough, if they are not properly connected. Let’s say an organization is pursuing the coveted “Customer 360” (or more properly, “360-degree view of a customer”) for personalized marketing. The first thing to do is to define what a “person” means, in the eyes of the machine and algorithms. It could be any form of PII or even biometrics data, through which all related data would be merged and consolidated. If the online and offline shopping history of a person aren’t connected properly, algorithms will treat them as two separate entities, devaluating the target customer. This is just one example; all kinds of analytics — whether they be forecasting, segmentation, or product analysis — perform better with more than one type of data, and they should be in one place to be useful.
  • Data Are Connected, But Many Fields Are Wrong or Empty: So what if the data are merged in one place? If data are mostly empty or incorrect, they will be worse than not having any at all. Good luck forecasting or predicting anything with data fields with really low fill rates. Unfortunately, we encounter tons of missing values in the case of “Customer 360.” What we call Big Data have lots of holes in them, when everything is lined up around the target (i.e., it is nearly impossible to know everything about everyone). Plus, remember that most modern databases record and maintain what are available; but in predictive analytics, what we don’t know is equally important.
  • Data Are There, But They Are Not Readily Usable, as They Are in Free-Form Formats: You may have the data, but they may need some serious standardization, refinement, categorization, and transformation processes to be useful. Many times I encountered hundreds, at time over a thousand, offer and promotion codes. To find out “what marketing efforts worked,” we would have to go through some serious data categorization to make them useful. (Refer to “The Art of Data Categorization”) This is just one example of many. Too often, analytics work is stuck in the middle of too much free-form, unstructured data.
  • Data Are Usable, But They Are One-Dimensional: Bits and pieces of data, even if they are clean and accurate, do not provide a holistic portrait of target individuals (if the job is about 1:1 marketing). Most predictive analytics work requires diverse data of a different nature, and only after proper data consolidation and summarization, we can obtain a multi-dimensional view. So-called relational databases and unstructured databases do not provide such a perspective without data summarization (or de-normalization) processes, as entities of such databases are just lists of events and transactions (e.g., on such and such date, this individual clicked some email link and bought a particular item for how much).
  • Data Are Cleaned, Consolidated, and Summarized, But There Is No Built-in Intelligence: To predict what the target individual is interested in, data players must rearrange the data to describe the person, not just events or transactions. Why do you think even large retailers, like Amazon, treat you like you are only about the very last transaction, sending the “likes” of the last item you purchased, ignoring years of interaction history? Because their data are not describing “you” as a target. And you are not just a sum of past transactions, either. For instance, your days in between purchases in the home electronics category may be far greater than those in the apparel category, yet showing higher average spending in the first category. This type of insight only comes out when the data are summarized properly to describe the buyer, not each transaction. Further, summarized data should be in the form of answers to questions, acting as building blocks of predictive scores. Intelligent variables always increase the predictive power of models, machine-based or not.
  • Data Variables Include Intelligence, But It Is Still Difficult to Derive Insights: Lists of intelligent variables are just basic necessities for advanced analytics, which would lead us to deeper and actionable insights. Even statisticians and analysts require a long training period to derive meanings out of seemingly beautiful charts and effectively develop stories around them. Yes, we can see that certain product sales went down, even with heavy promotion. But what does that really mean, and what should we do about it? For a machine to catch up with that level of storytelling, the data best be on silver platters in pristine condition first. Because changing assumptions based on “what is not there” or “what looks suspicious” is still in the realm of human intuition. Machines, for now, will read the results as if every bit of input data is correct and carries equal weight.

There are schools of thought that machines should be able to take raw data in any form, and somehow spit out answers for us mortals. But I do not subscribe to such a brute-force approach. Even if there is no human intervention in the data refinement process, machines will have to clean data in steps, like we have been doing. Simply put, a machine that is really good at identifying target individuals will be separately trained from the one that is designed for prediction of any kind.

So, what does clean and useful data mean? Just reverse the list above. In summary, good data must be:

  • Free from silos
  • Properly connected, if coming from disparate sources
  • Free from errors and too many missing values (i.e., must have good coverage)
  • Readily usable by non-specialists without having to manipulate them extensively
  • Multi-dimensional as a result of proper data summarization
  • In forms of variables with built-in intelligence
  • Presented in ways that provide insights, beyond a simple list of data points

Then, what are the steps of data refinement process? Again, if I may summarize the key steps out of the list above:

  1. Data collection (from various sources)
  2. Data consolidation (around the key object, such as individual target)
  3. Data hygiene and standardization
  4. Data categorization
  5. Data summarization
  6. Creation of intelligent variables
  7. Data visualization and/or modeling for business insights


I have covered all of these steps in detail through this column over the years. Nevertheless, I just wanted to share these steps on a high level again, as the list will serve as a checklist, of sorts. Why? Because I see too many organizations — even the advanced ones — that miss the whole category of necessary activities. How many times have I seen unstructured and uncategorized data, and how many times have I seen very clean data but only on an event and transaction level? How can anyone predict the target individual’s future behavior that way, with or without the help of machines?

The No. 1 reason why AI or machine learning do not reach their full potential is inadequate input data. Imagine putting unrefined oil as fuel or lubricant for a brand new Porsche. If the engine stalls, is that the car’s fault? To that point, please remember that even the machines require clean and organized data. And if you are about to have machines do the clean-up, also remember that machines are not that smart (yet), and they work better when trained for a specific task, such as pattern recognition (for data categorization).

One last parting thought: I am not at all saying that one must wait for a perfect set of data. Such a day will never come. Errors are inevitable, and some data will be missing. There will be all kinds of collection problems, and the limitation in data collection mechanisms cannot be fully overcome, thanks to those annoying humans who don’t comply well with the system. Or, it could be that the target individual simply did not create an event for the category yet (i.e., data will be missing for the Home Electrics category, if the buyer in question simply did not do anything in that category).

So, collect and clean the data as much as possible, but don’t pursuit 100% either. Analytics — with or without machines — always have been making the most of what we have. Leave it at “good enough,” though machine wouldn’t understand what that means.

How to Consider the Buyer’s Journey, Not Just the Channel

We are obviously living in a multichannel marketing environment, whether we are marketers or consumers. Every conceivable channel is being optimized for marketing, and in a capitalistic society, that is only natural.

Credit: Getty Images by Photo-Dave

We are obviously living in a multichannel marketing environment, whether we are marketers or consumers. Every conceivable channel is being optimized for marketing, and in a capitalistic society, that is only natural.

Someone has to pay for the maintenance of media channels, and marketers want to reach their target audiences through them. Voila! Demand meets supply, and the whole ecosystem is in perpetual motion.

So much so that many marketing organizations are organized by key media channels. The No. 1 reason many datasets are in silos? It is because data collected through different channels are hogged by the managers of those channels.

So the biggest hurdle towards a true 360-degree customer view is not the technology or lack of data, but the fact that interests of different channel managers do not meet in a common place, without heavy nudging from CEOs or CMOs. That is why I’ve been repeatedly saying that the first step towards proper data-readiness for advanced 1:1 marketing is the commitment from the top.

That being the reality, service providers — whether be data compilers, database designers, CRM experts, analytics experts or campaign specialists — must comply with the channel-centric environment, which is unfortunately the source of inadequate 1:1 targeting and personalization.

With all the technologies available today, why do you think that consumers keep getting similar or conflicting offers from the same organization? It’s because each channel manager acts like she “owns” the names of buyers who touched “her” channel. Let’s just say that is the exact opposite of customer-centric marketing.

Further, it gets even more complicated, as each channel exists not only on different plains, but on different spots on the timeline of customer journey.

What is customer journey? If I make a typical B2C engagement an example (because there are so many versions of this concept out there), it may follow these high-level steps:

  1. Awareness
  2. Interest
  3. Trial
  4. Repeat
  5. Loyalty

If this were for B2B, we may consider “Decision” and “Action” as separate steps, but the general idea of a customer journey is not all that different.

Now, the important point here is that these phases may or may not converge nicely with the “marketer’s journey,” which may look like:

  1. Acquisition
  2. Relationship Development
  3. Retention
  4. Win-back

Clearly, awareness and interest stages are closely related to acquisition; but after the purchase, we are moving into the CRM area from the marketer’s point of view, where cross-sell/up-sell, value-based targeting, various retention and anti-churn prevention measures, and win-back efforts come into play. Some actions go way past repeat and loyalty stages from the buyer’s side.

Now, add all the channels on top of this combination. No wonder there are lots of conflicts among channel managers. Who owns what stage of the game? Maybe that is just a wrong way to approach all of this.

Homework for Marketers

I’d say marketers should start with the customer’s journey first. Not just in the name of customer-centric marketing, but for practical reasons, too. So, list five customer journey phases on the left-hand side on a piece of paper.

Then, let’s write down proper marketer’s effort categories, from acquisition to win-back.

Next to it, put down data assets and technologies that you have available for each stage. You will find that distinctly different types of data and technologies should be applied to each.

For instance, third-party data are important for acquisition and win-back stages, due to lack of behavioral and transaction data. Conversely, to build proper cross-sell/up-sell, customer value or churn prevention models, you will need to use rich transaction and interaction history with your customers. Then of course, technology that you need to employ would be different for each stage.

Then, only then, write down proper media channels that would be best utilized for each stage of your marketing efforts.

For example, in the acquisition stage, where only third-party data and non-transactional data are available, what would be the best acquisition channel for you to employ? Catalog? Postcard? Email? Social media? General media?

For relationship-building and retention efforts, yes, email is the dominant one; but should it be the only one? Let’s not just settle on one channel, just because it is readily available and less costly. If you have all of the rich transaction and response data, why not use direct marketing, with rather fancy catalogs or First Class mail? Surely, with such powerful data, we can build proper targeting models to make those more expensive channels worthwhile.

Turning Marketing on Its Head

The key message here is to reverse the way we think about our channels, and shake the whole marketing ecosystem up.

I got into a heated debate with one of my colleagues the other day about this. Many digital marketers think that the journey begins at the moment a visitor lands on a website or types in a search word (refer to “Customer Journeys Don’t Start on Your Website”).

Before someone magically shows up on some site, there had to be other efforts to raise awareness and pique interest for that visitor. It could have been a banner, billboard, TV, radio, magazine, paper or more targeted media, such as direct mail, catalogs or email. All of those channels play different roles in different stages of both customer’s journey and the marketer’s journey.

Multichannel or Omnichannel concepts have been around for a long time; but to rise above the channel-centric mindset that hampers effective customer communication, markers must be aware of the timeline view, as well.

In fact, as I described in the body of this article, you may have to reverse the whole process, and see it from the timeline view first, and then assign proper channels to each stage. Otherwise, how would you ever escape from channel silos?