It’s All About Ranking

The decision-making process is really all about ranking. As a marketer, to whom should you be talking first? What product should you offer through what channel? As a businessperson, whom should you hire among all the candidates? As an investor, what stocks or bonds should you purchase? As a vacationer, where should you visit first?

The decision-making process is really all about ranking. As a marketer, to whom should you be talking first? What product should you offer through what channel? As a businessperson, whom should you hire among all the candidates? As an investor, what stocks or bonds should you purchase? As a vacationer, where should you visit first?

Yes, “choice” is the keyword in all of these questions. And if you picked Paris over other places as an answer to the last question, you just made a choice based on some ranking order in your mind. The world is big, and there could have been many factors that contributed to that decision, such as culture, art, cuisine, attractions, weather, hotels, airlines, prices, deals, distance, convenience, language, etc., and I am pretty sure that not all factors carried the same weight for you. For example, if you put more weight on “cuisine,” I can see why London would lose a few points to Paris in that ranking order.

As a citizen, for whom should I vote? That’s the choice based on your ranking among candidates, too. Call me overly analytical (and I am), but I see the difference in political stances as differences in “weights” for many political (and sometimes not-so-political) factors, such as economy, foreign policy, defense, education, tax policy, entitlement programs, environmental issues, social issues, religious views, local policies, etc. Every voter puts different weights on these factors, and the sum of them becomes the score for each candidate in their minds. No one thinks that education is not important, but among all these factors, how much weight should it receive? Well, that is different for everybody; hence, the political differences.

I didn’t bring this up to start a political debate, but rather to point out that the decision-making process is based on ranking, and the ranking scores are made of many factors with different weights. And that is how the statistical models are designed in a nutshell (so, that means the models are “nuts”?). Analysts call those factors “independent variables,” which describe the target.

In my past columns, I talked about the importance of statistical models in the age of Big Data (refer to “Why Model?”), and why marketing databases must be “model-ready” (refer to “Chicken or the Egg? Data or Analytics?”). Now let’s dig a little deeper into the design of the “model-ready” marketing databases. And surprise! That is also all about “ranking.”

Let’s step back into the marketing world, where folks are not easily offended by the subject matter. If I give a spreadsheet that contains thousands of leads for your business, you wouldn’t be able to tell easily which ones are the “Glengarry Glen Ross” leads that came from Downtown, along with those infamous steak knives. What choice would you have then? Call everyone on the list? I guess you can start picking names out of a hat. If you think a little more about it, you may filter the list by the first name, as they may reflect the decade in which they were born. Or start calling folks who live in towns that sound affluent. Heck, you can start calling them in alphabetical order, but the point is that you would “sort” the list somehow.

Now, if the list came with some other valuable information, such as income, age, gender, education level, socio-economic status, housing type, number of children, etc., you may be able to pick and choose by which variables you would use to sort the list. You may start calling the high income folks first. Not all product sales are positively related to income, but it is an easy way to start the process. Then, you would throw in other variables to break the ties in rich areas. I don’t know what you’re selling, but maybe, you would want folks who live in a single-family house with kids. And sometimes, your “gut” feeling may lead you to the right place. But only sometimes. And only when the size of the list is not in millions.

If the list was not for prospecting calls, but for a CRM application where you also need to analyze past transaction and interaction history, the list of the factors (or variables) that you need to consider would be literally nauseating. Imagine the list contains all kinds of dollars, dates, products, channels and other related numbers and figures in a seemingly endless series of columns. You’d have to scroll to the right for quite some time just to see what’s included in the chart.

In situations like that, how nice would it be if some analyst threw in just two model scores for responsiveness to your product and the potential value of each customer, for example? The analysts may have considered hundreds (or thousands) of variables to derive such scores for you, and all you need to know is that the higher the score, the more likely the lead will be responsive or have higher potential values. For your convenience, the analyst may have converted all those numbers with many decimal places into easy to understand 1-10 or 1-20 scales. That would be nice, wouldn’t it be? Now you can just start calling the folks in the model group No. 1.

But let me throw in a curveball here. Let’s go back to the list with all those transaction data attached, but without the model scores. You may say, “Hey, that’s OK, because I’ve been doing alright without any help from a statistician so far, and I’ll just use the past dollar amount as their primary value and sort the list by it.” And that is a fine plan, in many cases. Then, when you look deeper into the list, you find out there are multiple entries for the same name all over the place. How can you sort the list of leads if the list is not even on an individual level? Welcome to the world of relational databases, where every transaction deserves an entry in a table.

Relational databases are optimized to store every transaction and retrieve them efficiently. In a relational database, tables are connected by match keys, and many times, tables are connected in what we call “1-to-many” relationships. Imagine a shopping basket. There is a buyer, and we need to record the buyer’s ID number, name, address, account number, status, etc. Each buyer may have multiple transactions, and for each transaction, we now have to record the date, dollar amount, payment method, etc. Further, if the buyer put multiple items in a shopping basket, that transaction, in turn, is in yet another 1-to-many relationship to the item table. You see, in order to record everything that just happened, this relational structure is very useful. If you are the person who has to create the shipping package, yes, you need to know all the item details, transaction value and the buyer’s information, including the shipping and billing address. Database designers love this completeness so much, they even call this structure the “normal” state.

But the trouble with the relational structure is that each line is describing transactions or items, not the buyers. Sure, one can “filter” people out by interrogating every line in the transaction table, say “Select buyers who had any transaction over $100 in past 12 months.” That is what I call rudimentary filtering, but once we start asking complex questions such as, “What is the buyer’s average transaction amount for past 12 months in the outdoor sports category, and what is the overall future value of the customers through online channels?” then you will need what we call “Buyer-centric” portraits, not transaction or item-centric records. Better yet, if I ask you to rank every customer in the order of such future value, well, good luck doing that when all the tables are describing transactions, not people. That would be exactly like the case where you have multiple lines for one individual when you need to sort the leads from high value to low.

So, how do we remedy this? We need to summarize the database on an individual level, if you would like to sort the leads on an individual level. If the goal is to rank households, email addresses, companies, business sites or products, then the summarization should be done on those levels, too. Now, database designers call it the “de-normalization” process, and the tables tend to get “wide” along that process, but that is the necessary step in order to rank the entities properly.

Now, the starting point in all the summarizations is proper identification numbers for those levels. It won’t be possible to summarize any table on a household level without a reliable household ID. One may think that such things are given, but I would have to disagree. I’ve seen so many so-called “state of the art” (another cliché that makes me nauseous) databases that do not have consistent IDs of any kind. If your database managers say they are using “plain name” or “email address” fields for matching or summarization, be afraid. Be very afraid. As a starter, you know how many email addresses one person may have. To add to that, consider how many people move around each year.

Things get worse in regard to ranking by model scores when it comes to “unstructured” databases. We see more and more of those, as the data sources are getting into uncharted territories, and the size of the databases is growing exponentially. There, all these bits and pieces of data are sitting on mysterious “clouds” as entries on their own. Here again, it is one thing to select or filter based on collected data, but ranking based on some statistical modeling is simply not possible in such a structure (or lack thereof). Just ask the database managers how many 24-month active customers they really have, considering a great many people move in that time period and change their addresses, creating multiple entries. If you get an answer like “2 million-ish,” well, that’s another scary moment. (Refer to “Cheat Sheet: Is Your Database Marketing Ready?”)

In order to develop models using variables that are descriptors of customers, not transactions, we must convert those relational or unstructured data into the structure that match the level by which you would like to rank the records. Even temporarily. As the size of databases are getting bigger and bigger and the storage is getting cheaper and cheaper, I’d say that the temporary time period could be, well, indefinite. And because the word “data-mart” is overused and confusing to many, let me just call that place the “Analytical Sandbox.” Sandboxes are fun, and yes, all kinds of fun stuff for marketers and analysts happen there.

The Analytical Sandbox is where samples are created for model development, actual models are built, models are scored for every record—no matter how many there are—without hiccups; targets are easily sorted and selected by model scores; reports are created in meaningful and consistent ways (consistency is even more important than sheer accuracy in what we do), and analytical language such as SAS, SPSS or R are spoken without being frowned up by other computing folks. Here, analysts will spend their time pondering upon target definitions and methodologies, not about database structures and incomplete data fields. Have you heard about a fancy term called “in-database scoring”? This is where that happens, too.

And what comes out of the Analytical Sandbox and back into the world of relational database or unstructured databases—IT folks often ask this question—is going to be very simple. Instead of having to move mountains of data back and forth, all the variables will be in forms of model scores, providing answers to marketing questions, without any missing values (by definition, every record can be scored by models). While the scores are packing tons of information in them, the sizes could be as small as a couple bytes or even less. Even if you carry over a few hundred affinity scores for 100 million people (or any other types of entities), I wouldn’t call the resultant file large, as it would be as small as a few video files, really.

In my future columns, I will explain how to create model-ready (and human-ready) variables using all kinds of numeric, character or free-form data. In Exhibit A, you will see what we call traditional analytical activities colored in dark blue on the right-hand side. In order to make those processes really hum, we must follow all the steps that are on the left-hand side of that big cylinder in the middle. Preventing garbage-in-garbage-out situations from happening, this is where all the data get collected in uniform fashion, properly converted, edited and standardized by uniform rules, categorized based on preset meta-tables, consolidated with consistent IDs, summarized to desired levels, and meaningful variables are created for more advanced analytics.

Even more than statistical methodologies, consistent and creative variables in form of “descriptors” of the target audience make or break the marketing plan. Many people think that purchasing expensive analytical software will provide all the answers. But lest we forget, fancy software only answers the right-hand side of Exhibit A, not all of it. Creating a consistent template for all useful information in a uniform fashion is the key to maximizing the power of analytics. If you look into any modeling bakeoff in the industry, you will see that the differences in methodologies are measured in fractions. Conversely, inconsistent and incomplete data create disasters in real world. And in many cases, companies can’t even attempt advanced analytics while sitting on mountains of data, due to structural inadequacies.

I firmly believe the Big Data movement should be about

  1. getting rid of the noise, and
  2. providing simple answers to decision-makers.

Bragging about the size and the speed element alone will not bring us to the next level, which is to “humanize” the data. At the end of the day (another cliché that I hate), it is all about supporting the decision-making processes, and the decision-making process is all about ranking different options. So, in the interest of keeping it simple, let’s start by creating an analytical haven where all those rankings become easy, in case you think that the sandbox is too juvenile.

Why Model?

Why model? Uh, because someone is ridiculously good looking, like Derek Zoolander? No, seriously, why model when we have so much data around? The short answer is because we will never know the whole truth. That would be the philosophical answer. Physicists construct models to make new quantum field theories more attractive theoretically and more testable physically. If a scientist already knows the secrets of the universe, well, then that person is on a first-name basis with God Almighty, and he or she doesn’t need any models to describe things like particles or strings. And the rest of us should just hope the scientist isn’t one of those evil beings in “Star Trek.”

Why model? Uh, because someone is ridiculously good looking, like Derek Zoolander? No, seriously, why model when we have so much data around?

The short answer is because we will never know the whole truth. That would be the philosophical answer. Physicists construct models to make new quantum field theories more attractive theoretically and more testable physically. If a scientist already knows the secrets of the universe, well, then that person is on a first-name basis with God Almighty, and he or she doesn’t need any models to describe things like particles or strings. And the rest of us should just hope the scientist isn’t one of those evil beings in “Star Trek.”

Another answer to “why model?” is because we don’t really know the future, not even the immediate future. If some object is moving toward a certain direction at a certain velocity, we can safely guess where it will end up in one hour. Then again, nothing in this universe is just one-dimensional like that, and there could be a snowstorm brewing up on its path, messing up the whole trajectory. And that weather “forecast” that predicted the snowstorm is a result of some serious modeling, isn’t it?

What does all this mean for the marketers who are not necessarily masters of mathematics, statistics or theoretical physics? Plenty, actually. And the use of models in marketing goes way back to the days of punch cards and mainframes. If you are too young to know what those things are, well, congratulations on your youth, and let’s just say that it was around the time when humans first stepped on the moon using a crude rocket ship equipped with less computing power than an inexpensive passenger car of the modern days.

Anyhow, in that ancient time, some smart folks in the publishing industry figured that they would save tons of money if they could correctly “guess” who the potential buyers were “before” they dropped any expensive mail pieces. Even with basic regression models—and they only had one or two chances to get it right with glacially slow tools before the all-too-important Christmas season came around every year—they could safely cut the mail quantity by 80 percent to 90 percent. The savings added up really fast by not talking to everyone.

Fast-forward to the 21st Century. There is still a beauty of knowing who the potential buyers are before we start engaging anyone. As I wrote in my previous columns, analytics should answer:

1. To whom you should be talking; and
2. What you should offer once you’ve decided to engage someone.

At least the first part will be taken care of by knowing who is more likely to respond to you.

But in the days when the cost of contacting a person through various channels is dropping rapidly, deciding to whom to talk can’t be the only reason for all this statistical work. Of course not. There are plenty more reasons why being a statistician (or a data scientist, nowadays) is one of the best career choices in this century.

Here is a quick list of benefits of employing statistical models in marketing. Basically, models are constructed to:

  • Reduce cost by contacting prospects more wisely
  • Increase targeting accuracy
  • Maintain consistent results
  • Reveal hidden patterns in data
  • Automate marketing procedures by being more repeatable
  • Expand the prospect universe while minimizing the risk
  • Fill in the gaps and summarize complex data into an easy-to-use format—A must in the age of Big Data
  • Stay relevant to your customers and prospects

We talked enough about the first point, so let’s jump to the second one. It is hard to argue about the “targeting accuracy” part, though there still are plenty of non-believers in this day and age. Why are statistical models more accurate than someone’s gut feeling or sheer guesswork? Let’s just say that in my years of dealing with lots of smart people, I have not met anyone who can think about more than two to three variables at the same time, not to mention potential interactions among them. Maybe some are very experienced in using RFM and demographic data. Maybe they have been reasonably successful with choices of variables handed down to them by their predecessors. But can they really go head-to-head against carefully constructed statistical models?

What is a statistical model, and how is it built? In short, a model is a mathematical expression of “differences” between dichotomous groups. Too much of a mouthful? Just imagine two groups of people who do not overlap. They may be buyers vs. non-buyers; responders vs. non-responders; credit-worthy vs. not-credit-worthy; loyal customers vs. attrition-bound, etc. The first step in modeling is to define the target, and that is the most important step of all. If the target is hanging in the wrong place, you will be shooting at the wrong place, no matter how good your rifle is.

And the target should be expressed in mathematical terms, as computers can’t read our minds, not just yet. Defining the target is a job in itself:

  • If you’re going after frequent flyers, how frequent is frequent enough for you? Five times a year or 10 times a year? Or somewhere in between? Or should it remain continuous?
  • What if the target is too small or too large? What then?
  • If you are looking for more valuable prospects, how would you express that? In terms of average spending, lifetime spending or sheer number of transactions?
  • What if there is an inverse relationship between frequency and dollar spending (i.e., high spenders shopping infrequently)?
  • And what would be the borderline number to be “valuable” in all this?

Once the target is set, after much pondering, then the job is to select the variables that describe the “differences” between the two groups. For example, I know how much marketers love to use income variables in various situations. But if that popular variable does not explain the differences between the two groups (target and non-target), the mathematics will mercilessly throw it out. This rigorous exercise of examining hundreds or even thousands of variables is one of the most critical steps, during which many variables go through various types of transformations. Statisticians have different preferences in terms of ideal numbers of variables in a model, while non-statisticians like us don’t need to be too concerned, as long as the resultant model works. Who cares if a cat is white or black, as long as it catches mice?

Not all selected variables are equally important in model algorithms, either. More powerful variables will be assigned with higher weight, and the sum of these weighted values is what we call model score. Now, non-statisticians who have been slightly allergic to math since the third grade only need to know that the higher the score, the more likely the record in question is to be like the target. To make the matter even simpler, let’s just say that you want higher scores over lower scores. If you are a salesperson, just call the high-score prospects first. And would you care how many variables are packed into that score, for as long as you get the good “Glengarry Glen Ross” leads on top?

So, let me ask again. Does this sound like something a rudimentary selection rule with two to three variables can beat when it comes to identifying the right target? Maybe someone can get lucky once or twice, but not consistently.

That leads to the next point, “consistency.” Because models do not rely on a few popular variables, they are far less volatile than simple selection rules or queries. In this age of Big Data, there are more transaction and behavioral data in the mix than ever, and they are far more volatile than demographic and geo-demographic data. Put simply, people’s purchasing behavior and preferences change much faster than family composition or their income, and that volatility factor calls for more statistical work. Plus, all facets of marketing are now more about measurable results (ah, that dreaded ROI, or “Roy,” the way I call it), and the businesses call for consistent hitters over one-hit wonders.

“Revealing hidden patterns in data” is my favorite. When marketers are presented with thousands of variables, I see a majority of them just sticking to a few popular ones all the time. Some basic recency and frequency data are there, and among hundreds of demographic variables, the list often stops after income, age, gender, presence of children, and some regional variables. But seriously, do you think that the difference between a luxury car buyer and an SUV buyer is just income and age? You see, these variables are just the ones that human minds are accustomed to. Mathematics do not have such preconceived notions. Sticking to a few popular variables is like children repeatedly using three favorite colors out of a whole box of crayons.

I once saw a neighborhood-level U.S. Census variable called “% Households with Septic Tanks” in a model built for a high-end furniture catalog. Really, the variable was “percentage of houses with septic tanks in the neighborhood.” Then I realized it made a lot of sense. That variable was revealing how far away that neighborhood was located in comparison to populous city centers. As the percentage of septic tanks increased, the further away the residents were from the city center. And maybe those folks who live in scarcely populated areas were more likely to shop for furniture through catalogs than the folks who live closer to commercial areas.

This is where we all have that “aha” moment. But you and I will never pick that variable in anything that we do, not in million years, no matter how effective it may be in finding the target prospects. The word “septic” may scare some people off at “hello.” In any case, modeling procedures reveal hidden connections like that all of the time, and that is a very important function in data-rich environments. Otherwise, we will not know what to throw out without fear, and the databases will continuously become larger and more unusable.

Moving on to the next points, “Repeatable” and “Expandable” are somewhat related. Let’s say a marketer has been using a very innovative selection logic that she came across almost by accident. In pursuing special types of wealthy people, she stumbled upon a piece of data called “owner of swimming pool.” Now, she may have even had a few good runs with it, too. But eventually, that success will lead to the question of:

1. Having to repeat that success again and again; and
2. Having to expand that universe, when the “known” universe of swimming pool owners become depleted or saturated.

Ah, the chagrin of a one-hit-wonder begins.

Use of statistical models, with help of multiple variables and scalable scoring, would avoid all of those issues. You want to expand the prospect universe? No trouble. Just dial down the scores on the scale a little further. We can even measure the risk of reaching into the lower-scoring groups. And you don’t have to worry about coverage issues related to a few variables, as those won’t be the only ones in the model. Want to automate the selection process? No problem there, as using a score, which is a summary of key predictors, is far simpler than having to carry a long list of data variables into any automated system.

Now, that leads to the next point, “Filling in the gaps and summarizing the complex data into an easy-to-use format.” In the age of ubiquitous and “Big” data, this is the single-most important point, way beyond the previous examples for traditional 1-to-1 marketing applications. We are definitely going through massive data overloads everywhere, and someone better refine the data and provide some usable answers.

As I mentioned earlier, we build models because we will never know the whole truth. I believe that the Big Data movement should be all about:

1. Filtering the noise from valuable information; and
2. Filling the gaps.

“Gaps,” you say? Believe me, there are plenty of gaps in any dataset, big or small.

When information continues to get piled on, the resultant database may look big. And they are physically large. But in marketing, as I repeatedly emphasized in my previous columns, the data must be realigned to “buyer-centric” formats, with every data point describing each individual, as marketing is all about people.

Sure, you may have tons of mobile phone-related data. In fact, it could be quite huge in size. But let me turn that upside down for you (more like sideways-up, in practice). Now, try to describe everyone in your footprint in terms of certain activities. Say, “every smart phone owner who used more than 80 percent of his or her monthly data allowance on the average for the past 12 months, regardless of the carrier.” Hey, don’t blame me for asking these questions just because it’s inconvenient for data handlers to answer them. Some marketers would certainly benefit from information like that, and no one cares about just bits and pieces of data, other than for some interesting tidbits at a party.

Here’s the main trouble when you start asking buyer-related questions like that. Once we try to look at the world from the “buyer-centric” point of view, we will realize there are tons of missing data (i.e., a whole bunch of people with not much information). It may be that you will never get this kind of data from all carriers. Maybe not everyone is tracked this way. In terms of individuals, you may end up with less than 10 percent in the database with mobile information attached to them. In fact, many interesting variables may have less than 1 percent coverage. Holes are everywhere in so-called Big Data.

Models can fill in those blanks for you. For all those data compilers who sell age and income data for every household in the country, do you believe that they really “know” everyone’s age and income? A good majority of the information is based on carefully constructed models. And there is nothing wrong with that.

If you don’t get to “know” something, we can get to a “likelihood” score—of “being like” that something. And in that world, every measurement is on a scale, with no missing values. For example, the higher the score of a model built for a telecommunication company, the more likely that the prospect is going to use a high-speed data plan, or the international long distance services, depending on the purpose of the model. Or the more likely the person will buy sports packages via cable or satellite. Or the person is more likely to subscribe to premium movie channels. Etc., etc. With scores like these, a marketer can initiate the conversation with—not just talking to—a particular prospect with customized product packages in his hand.

And that leads us to the final point in all this, “Staying relevant to your customers and prospects.” That is what Big Data should be all about—at least for us marketers. We know plenty about a lot of people. And they are asking us why we are still so random about marketing messages. With all these data that are literally floating around, marketers can do so much better. But not without statistical models that fill in the gaps and turn pieces of data into marketing-ready answers.

So, why model? Because a big pile of information doesn’t provide answers on its own, and that pile has more holes than Swiss cheese if you look closely. That’s my final answer.