Models Are Built, But the Job Isn’t Done Yet

In my line of business – data and analytics consulting and coaching – I often recommend some modeling work when confronted with complex targeting challenges. Through this series, I’ve shared many reasons why modeling becomes a necessity in data-rich environments (refer to “Why Model?”).

The history of model-based targeting goes back to the 1960’s, but what is the number one reason to employ modeling techniques these days? We often have too much information, way beyond the cognitive and arithmetical capacities of our brains. Most of us mortals cannot effectively consider more than two or three variables at a time. Conversely, machines don’t have such limitations when it comes to recognizing patterns among countless data variables. Subsequent marketing automation is just an added bonus.

We operate under a basic assumption that model-based targeting (with deep data) should outperform some man-made rules (with a handful of information). At times, however, I get calls as campaign results prove otherwise. Sometimes campaign segments selected by models show worse response rates than randomly selected test groups do.

When such disappointing results happen, most decision makers casually say, “The model did not work.” That may be true, but more often than not, I find that something went wrong “before” or “after” the modeling process. (Refer to “Know What to Automate With Machine Learning”, where I list major steps concerning the “before” of model-based targeting).

If the model is developed in an “analytics-ready” environment where most input errors are eradicated, then here are some common mishaps in post-modeling stages to consider.

Mishap #1: The Model Is Applied to the Wrong Universe

Model algorithm is nothing but a mathematical expression between target and comparison universes. Yes, setting up the right target is the key for success in any modeling, but defining a proper comparison universe is equally important. And the comparison group must represent the campaign universe to which the resultant model is applied.

Sometimes such universes are defined by a series of pre-selection rules before the modeling even begins. For example, the campaign universes may be set by region (or business footprint), gender of the target, availability of email address or digital ID, income level, home ownership, etc. Once set, the rules must be enforced throughout the campaign execution.

What if the rules that define the modeling universe are even slightly different from the actual campaign universe? The project may be doomed from the get-go.

For example, do not expect that models developed within a well-established business footprint will be equally effective in relatively new prospecting areas. Such expansion calls for yet another set of models, as target prospects are indeed in a different world.

If there are multiple distinct segments in the customer base, we often develop separate models within each key segment. Don’t even think about applying a model developed in one specific segment to another, just because they may look similar on the surface. And if you do something like that, don’t blame the modeler later.

Mishap #2: The Model Is Used Outside Design Specification

Even in the same modeling universe, we may develop multiple types of models for different purposes. Some models may be designed to predict future lifetime value of customers, while others are to estimate campaign responsiveness. In this example, customer value and campaign responsiveness may actually be inversely related (e.g., potential high value customers less likely to be responsive to email campaigns).

If multiple response models are built for specific channels, do not use them interchangeably. Each model should be describing distinct channel behaviors, not just general responsiveness to given offers or products.

I’ve seen a case where a cruise ship company used an affinity model specifically designed for a seasonal European line for general purposes in the name of cost savings. The result? It would have been far more cost effective developing another model than having to deal with the fallout from ineffective campaigns. Modeling cost is often a small slice in the whole pie of campaign expenses. Don’t get stingy on analytics and call for help when in doubt.

Mishap #3: There Are Scoring Errors

Applying a model algorithm to a validation sample is relatively simple, as such samples are not really large. Now, try to apply the same algorithm to over 100 million potential targets. You may encounter all kinds of performance issues caused by the sheer volume of data.

Then there are more fundamental errors stemming from the database structure itself. What if the main database structure is different from that of the development sample? That type of discrepancy – which is very common – often leads to disasters.

Always check if anything is different between the development samples and the main database:

  • Database Structure: There are so many types of database platforms, and the way they store simple transaction data may be vastly different. In general, to rank individuals, each data record must be scored on an individual level, not transaction or event levels. It is strongly recommended that data consolidation, summarization, and variable creation be done in an analytics-friendly environment “before” any modeling begins. Structural consistency eliminates many potential errors.
  • Variable List/Names: When you have hundreds, or even thousands of variables in the database, there will be similar sounding names. I’ve seen many different variable names that may represent “Total Individual Dollar Amount Past 12-month,” for example. It is a common mistake to use a wrong data field in the scoring process.
  • Variable Values: Not all similar sounding variables have similar values in them. For example, ever-so-popular “Household Income” may include dollar values in thousand-dollar increments, or pre-coded value that looks like alphabets. What if someone changed the grouping definition of such binned variables? It would be a miracle if the model scores come out correctly.
  • Imputation Assumptions: There are many ways to treat missing values (refer to “Missing Data Can Be Meaningful”). Depending on how they were transformed and stored, even missing values can be predictable in models. If missing values are substituted with imputed values, it is absolutely important to maintain their consistency throughout the process. Mistreatment of missing values is often the main cause for scoring errors.

Mishap #4: Nature of Data Is Significantly Shifted

Data values change over time due to outside factors. For instance, if there is a major shift in the business model (e.g., business moving to a subscription model), or a significant change in data collection methods or vendors, consider that all the previous models are now rendered useless. Models should be predictors of customer behaviors, not reflections of changes in your business.

Mishap #5: Scores Are Tempered After-the-Fact

This one really breaks my heart, but it happens. I once saw a user in a major financial institution unilaterally change the ranges of model decile groups after observing significant fluctuations in model group counts. As you can imagine by now, uneven model group counts are indeed revealing serious inconsistencies caused by any of the factors that I mentioned thus far. You cannot tape over a major wound — just bite the bullet and commission a new model when you see uneven or inconsistent model decile counts.

Mishap #6: There Are Selection Errors

When campaign targets are selected based on model scores, the users must be fully aware of the nature of them. If the score is grouped into model groups 1 through 10, is the ideal target “1” or “10”?

I’ve seen cases where the campaign selection was completely off the mark, as someone sorted the raw score in an ascending order, not a descending order, pushing the worse prospects to the top. But I’ve also seen errors in documentation or judgement, as it can be really confusing to figure out which group is “better.”

I tend to put things in 0-9 scale when designing a series of personas or affinity models to avoid confusion. If score groups range from 0 to 9, the user is much less likely to assume that “zero” is the best score. Without a doubt, reversed score is far worse than not using the model at all.

Final Thoughts

After all, the model algorithm itself can be wrong, too. Not all modelers are equally competent, and machine-learning is only as good as the analyst who originally set it up. Of course, you must turn that stone when investigating bad results. But you should trace all pre- and post-modeling steps, as well. After years of such detective work, my bet is firmly on errors outside the modeling processes, unless the model validation smells fishy.

In any case, do not entirely give up on modeling just because you’ve had a few bad results. There are many things to be checked and tweaked, and model-based targeting is a long series of iterative adjustments. Be mindful that even a mediocre model is still better than someone’s gut feelings, if it is applied to campaigns properly.