When You Fail, Don’t Blame Data Scientists First — or Models

The first step in analytics should be “formulating a question,” not data-crunching. I can even argue formulating the question is so difficult and critical, that it is the deciding factor dividing analysts into seasoned data scientists and junior number-crunchers.

Last month, I talked about ways marketing automation projects go south (refer to “Why Many Marketing Automation Projects Go South”). This time, let’s be more specific about modeling, which is an essential element in converting mounds of data into actionable solutions to challenges.

Without modeling, all automation efforts would remain at the level of rudimentary rules. And that is one of the fastest routes to automate wrong processes, leading to disappointing results in the name of marketing automation.

Nonetheless, when statistically sound models are employed, users to tend to blame the models first when the results are less than satisfactory. As a consultant, I often get called in when clients suspect the model performance. More often than not, however, I find that the model in question was the only thing that was done correctly in a series of long processes from data manipulation and target setting to model scoring and deployment. I guess it is just easier to blame some black box, but most errors happen before and after modeling.

A model is nothing but an algorithmic expression measuring likelihood of an object resembling — or not resembling — the target. As in, “I don’t know for sure, but that household is very likely to purchase high-end home electronics products,” only based on the information that we get to have. Or on a larger scale, “How many top-line TV sets over 65 inches will we sell during the Christmas shopping season this year?” Again, only based on past sales history, current marcom spending, some campaign results, and a few other factors — like seasonality and virality rate.

These are made-up examples, of course, but I tried to make them as specific and realistic as possible here. Because when people think that a model went wrong, often it is because a wrong question was asked in the first place. Those “dumb” algorithms, unfortunately, only provide answers to specific questions. If a wrong question is presented? The result would seem off, too.

That is why the first step in analytics should be “formulating a question,” not data-crunching. Jumping into a data lake — or any other form of data depository, for that matter — without a clear definition of goals and specific targets is often a shortcut to demise of the initiative itself. Imagine a case where one starts building a house without a blueprint. Just as a house is not a random pile of building materials, a model is not an arbitrary combination raw data.

I can even argue formulating the question is so difficult and critical, that it is the deciding factor dividing analysts into seasoned data scientists and junior number-crunchers. Defining proper problem statements is challenging, because:

  • business goals are often far from perfectly constructed logical statements, and
  • available data are mostly likely incomplete or inadequate for advanced analytics.

Basically, good data players must be able to translate all those wishful marketing goals into mathematical expressions, only using the data handed to them. Such skill is far beyond knowledge in regression models or machine learning.

That is why we must follow these specific steps for data-based solutioning:

data scientists use this roadmap
Credit: Stephen H. Yu
  1. Formulating Questions: Again, this is the most critical step of all. What are the immediate issues and pain points? For what type of marketing functions, and in what context? How will the solution be applied and how will they be used by whom, through what channel? What are the specific domains where the solution is needed? I will share more details on how to ask these questions later in this series, but having a specific set of goals must be the first step. Without proper goal-setting, one can’t even define success criteria against which the results would be measured.
  2. Data Discovery: It is useless to dream up a solution with data that are not even available. So, what is available, and what kind of shape are they in? Check the inventory of transaction history; third-party data, such as demographic and geo-demographic data; campaign history and response data (often not in one place); user interaction data; survey data; marcom spending and budget; product information, etc. Now, dig through everything, but don’t waste time trying to salvage everything, either. Depending on the goal, some data may not even be necessary. Too many projects get stuck right here, not moving forward an inch. The goal isn’t having a perfect data depository — CDP, Data Lake, or whatever — but providing answers to questions posed in Step 1.
  3. Data Transformation: You will find that most data sources are NOT “analytics-ready,” no matter how clean and organized they may seem (there are often NOT well-organized, either). Disparate data sources must be merged and consolidated, inconsistent data must be standardized and categorized, different levels of information must be summarized onto the level of prediction (e.g., product, email, individual, or household levels), and intelligent predictors must be methodically created. Otherwise, the modelers would spend majority of their time fixing and massaging the data. I often call this step creating an “Analytics Sandbox,” where all “necessary” data are in pristine condition, ready for any type of advanced analytics.
  4. Analytics/Model Development: This is where algorithms are created, considering all available data. This is the highlight of this analytics journey, and key to proper marketing automation. Ironically, this is the easiest part to automate, in comparison to previous steps and post-analytics steps. But only if the right questions — and right targets — are clearly defined, and data are ready for this critical step. This is why one shouldn’t just blame the models or modelers when the results aren’t good enough. There is no magic algorithm that can save ill-defined goals and unusable messy data.
  5. Knowledge Share: The models may be built, but the game isn’t over yet. It is one thing to develop algorithms with a few hundred thousand record samples, and it’s quite another to apply them to millions of live data records. There are many things that can go wrong here. Even slight differences in data values, categorization rules, or even missing data ratio will make well-developed models render ineffective. There are good reasons why many vendors charge high prices for model scoring. Once the scoring is done and proven correct, resultant model scores must be shared with all relevant systems, through which decisions are made and campaigns are deployed.
  6. Application of Insights: Just because model scores are available, it doesn’t mean that decision-makers and campaign managers will use them. They may not even know that such things are available to them; or, even if they do, they may not know how to use them. For instance, let’s say that there is a score for “likely to respond to emails with no discount offer” (to weed out habitual bargain-seekers) for millions of individuals. What do those scores mean? The lower the better, or the higher the better? If 10 is the best score, is seven good enough? What if we need to mail to the whole universe? Can we differentiate offers, depending on other model scores — such as, “likely to respond to free-shipping offers”? Do we even have enough creative materials to do something like that? Without proper applications, no amount of mathematical work will seem useful. This is why someone in charge of data and analytics must serve as an “evangelist of analytics,” continually educating and convincing the end-users.
  7. Impact Analysis: Now, one must ask the ultimate question, “Did it work?” And “If it did, what elements worked (and didn’t work)?” Like all scientific approaches, marketing analytics and applications are about small successes and improvements, with continual hypothesizing and learning from past trials and mistakes. I’m sure you remember the age-old term “Closed-loop” marketing. All data and analytics solutions must be seen as continuous efforts, not some one-off thing that you try once or twice and forget about. No solution will just double your revenue overnight; that is more like a wishful thinking than a data-based solution.

As you can see, there are many “before” and “after” steps around modeling and algorithmic solutioning. This is why one should not just blame the data scientist when things don’t work out as expected, and why even casual users must be aware of basic ins and outs of analytics. Users must understand that they should not employ models or solutions outside of their original design specifications, either. There simply is no way to provide answers to illogical questions, now or in the future.