Even AI Needs Clean Data in Order to Be the Shiny Object

Users are quickly realizing that investing in AI is not the end of the road. Then again, in this analytics journey, there really is no end anyway; much like the scientific journey, it is a constant series of hypothesis, testing, and course corrections.

Users are quickly realizing that investing in AI is not the end of the road. Then again, in this analytics journey, there really is no end anyway; much like the scientific journey, it is a constant series of hypothesis, testing, and course corrections. And now, I’ll explain why that means even AI needs clean data.

If there is a book out there — many have asked me about it — it would look more like a long series of case studies, not some definitive roadmap for all. Why? Because prescribing analytics is much like a doctor’s work. It depends as much on the unique situation of the patient as on the list of solutions.

That is the main reason why one cannot just install AI and call it a day. Who’d give it a purpose, guide it, and constantly fine-tune it? Not itself, for sure.

Then there is a question about what goes into it. AI — or any type of analytics tool, for that matter — depends on clean and error-free data. If the data are dirty and unusable, you may end up automating inadequate decision-making processes, getting wrong answers really fast. I’d say that would be worse than not having any answer at all.

So far, you may say I am just stating the obvious here. Of course, AI or machine learning require clean and error free data. The real trouble is that such data preparation often takes up as much as 80% (if not more) of the whole process of applying data-based intelligence to decision-making. In fact, users are finding out that the algorithmic part of the equation is the simplest to automate. The data refinement process is far more complicated than that, as it really depends on the shape of the available data. And some are really messy (hence, the title of my series in this fine publication, “Big Data, Small Data, Clean Data, Messy Data”).

So, why aren’t data readily usable?

  • Data Are in Silos: This is so common that “siloed data” is actually a term that we commonly use in meeting rooms. Simply, if the data are locked up somewhere, they won’t be much of use for anyone. Worse, each silo may be on a unique platform, with incompatible data formats from others.
  • Data Are in One Place, But Not Connected: Putting the data in one place isn’t enough, if they are not properly connected. Let’s say an organization is pursuing the coveted “Customer 360” (or more properly, “360-degree view of a customer”) for personalized marketing. The first thing to do is to define what a “person” means, in the eyes of the machine and algorithms. It could be any form of PII or even biometrics data, through which all related data would be merged and consolidated. If the online and offline shopping history of a person aren’t connected properly, algorithms will treat them as two separate entities, devaluating the target customer. This is just one example; all kinds of analytics — whether they be forecasting, segmentation, or product analysis — perform better with more than one type of data, and they should be in one place to be useful.
  • Data Are Connected, But Many Fields Are Wrong or Empty: So what if the data are merged in one place? If data are mostly empty or incorrect, they will be worse than not having any at all. Good luck forecasting or predicting anything with data fields with really low fill rates. Unfortunately, we encounter tons of missing values in the case of “Customer 360.” What we call Big Data have lots of holes in them, when everything is lined up around the target (i.e., it is nearly impossible to know everything about everyone). Plus, remember that most modern databases record and maintain what are available; but in predictive analytics, what we don’t know is equally important.
  • Data Are There, But They Are Not Readily Usable, as They Are in Free-Form Formats: You may have the data, but they may need some serious standardization, refinement, categorization, and transformation processes to be useful. Many times I encountered hundreds, at time over a thousand, offer and promotion codes. To find out “what marketing efforts worked,” we would have to go through some serious data categorization to make them useful. (Refer to “The Art of Data Categorization”) This is just one example of many. Too often, analytics work is stuck in the middle of too much free-form, unstructured data.
  • Data Are Usable, But They Are One-Dimensional: Bits and pieces of data, even if they are clean and accurate, do not provide a holistic portrait of target individuals (if the job is about 1:1 marketing). Most predictive analytics work requires diverse data of a different nature, and only after proper data consolidation and summarization, we can obtain a multi-dimensional view. So-called relational databases and unstructured databases do not provide such a perspective without data summarization (or de-normalization) processes, as entities of such databases are just lists of events and transactions (e.g., on such and such date, this individual clicked some email link and bought a particular item for how much).
  • Data Are Cleaned, Consolidated, and Summarized, But There Is No Built-in Intelligence: To predict what the target individual is interested in, data players must rearrange the data to describe the person, not just events or transactions. Why do you think even large retailers, like Amazon, treat you like you are only about the very last transaction, sending the “likes” of the last item you purchased, ignoring years of interaction history? Because their data are not describing “you” as a target. And you are not just a sum of past transactions, either. For instance, your days in between purchases in the home electronics category may be far greater than those in the apparel category, yet showing higher average spending in the first category. This type of insight only comes out when the data are summarized properly to describe the buyer, not each transaction. Further, summarized data should be in the form of answers to questions, acting as building blocks of predictive scores. Intelligent variables always increase the predictive power of models, machine-based or not.
  • Data Variables Include Intelligence, But It Is Still Difficult to Derive Insights: Lists of intelligent variables are just basic necessities for advanced analytics, which would lead us to deeper and actionable insights. Even statisticians and analysts require a long training period to derive meanings out of seemingly beautiful charts and effectively develop stories around them. Yes, we can see that certain product sales went down, even with heavy promotion. But what does that really mean, and what should we do about it? For a machine to catch up with that level of storytelling, the data best be on silver platters in pristine condition first. Because changing assumptions based on “what is not there” or “what looks suspicious” is still in the realm of human intuition. Machines, for now, will read the results as if every bit of input data is correct and carries equal weight.

There are schools of thought that machines should be able to take raw data in any form, and somehow spit out answers for us mortals. But I do not subscribe to such a brute-force approach. Even if there is no human intervention in the data refinement process, machines will have to clean data in steps, like we have been doing. Simply put, a machine that is really good at identifying target individuals will be separately trained from the one that is designed for prediction of any kind.

So, what does clean and useful data mean? Just reverse the list above. In summary, good data must be:

  • Free from silos
  • Properly connected, if coming from disparate sources
  • Free from errors and too many missing values (i.e., must have good coverage)
  • Readily usable by non-specialists without having to manipulate them extensively
  • Multi-dimensional as a result of proper data summarization
  • In forms of variables with built-in intelligence
  • Presented in ways that provide insights, beyond a simple list of data points

Then, what are the steps of data refinement process? Again, if I may summarize the key steps out of the list above:

  1. Data collection (from various sources)
  2. Data consolidation (around the key object, such as individual target)
  3. Data hygiene and standardization
  4. Data categorization
  5. Data summarization
  6. Creation of intelligent variables
  7. Data visualization and/or modeling for business insights

Conclusion

I have covered all of these steps in detail through this column over the years. Nevertheless, I just wanted to share these steps on a high level again, as the list will serve as a checklist, of sorts. Why? Because I see too many organizations — even the advanced ones — that miss the whole category of necessary activities. How many times have I seen unstructured and uncategorized data, and how many times have I seen very clean data but only on an event and transaction level? How can anyone predict the target individual’s future behavior that way, with or without the help of machines?

The No. 1 reason why AI or machine learning do not reach their full potential is inadequate input data. Imagine putting unrefined oil as fuel or lubricant for a brand new Porsche. If the engine stalls, is that the car’s fault? To that point, please remember that even the machines require clean and organized data. And if you are about to have machines do the clean-up, also remember that machines are not that smart (yet), and they work better when trained for a specific task, such as pattern recognition (for data categorization).

One last parting thought: I am not at all saying that one must wait for a perfect set of data. Such a day will never come. Errors are inevitable, and some data will be missing. There will be all kinds of collection problems, and the limitation in data collection mechanisms cannot be fully overcome, thanks to those annoying humans who don’t comply well with the system. Or, it could be that the target individual simply did not create an event for the category yet (i.e., data will be missing for the Home Electrics category, if the buyer in question simply did not do anything in that category).

So, collect and clean the data as much as possible, but don’t pursuit 100% either. Analytics — with or without machines — always have been making the most of what we have. Leave it at “good enough,” though machine wouldn’t understand what that means.

Clean Up After Yourselves, Marketers

You know what can really suck? When a piece of marketing is spot on … until it isn’t. Let’s look at a couple email examples and see what went awry.

You know what can really suck? When a piece of marketing is spot on … until it isn’t.

Let’s take a look at this email from American Red Cross … I’m a blood donor, and I regularly receive emails from the nonprofit, alerting me about blood drives and more. And hey, when the subject line is “MELISSA, This is Your Week’s Best Email,” it’s got to be good, right?
Red Cross email
All right, this email is definitely on brand for me … photo of a puppy hugging a kitten? Check. Photo of a baby seal with a super cute smile on its face? Check. Let’s read on.

Red Cross emailOhmigod that puppy is so happy look at that … wait a second.

Red Cross Email CloseupIn the final paragraph, the email reads: As an AB donor, MELISSA, your help is especially important.

Oh really? I’m an A+ donor.

I’ve been donating blood off and on for the past 18 years. I’m registered as an A+ donor. So where did they get AB?

Look, it’s not the end of the world, but the incorrect personalized data stopped me dead in my tracks. And no, I didn’t schedule a donation in May.

And in April, a reader forwarded me the following email he received from Inc.:

inc-emailThe reader (who asked me to not share his name) let me know that, while Cornell University might hit the Inc. 5000 requirements, he does not work for Cornell. He’s also not an officer of trustee of the university. He is an alum (Go Big Red!) and an active volunteer, and sure, maybe his email address is @cornell.edu.

But so are the email addresses of a lot of Cornell students.

The lesson to be learned from these emails? Clean your lists, marketers. According to Experian Data Quality, dirty data costs marketers approximately 12 percent of their revenue. It makes you look bad, can cost you a sale or at least get people talking about you in ways you didn’t want them to do.