How to Outsource Analytics

In this series, I have been emphasizing the importance of statistical modeling in almost every article. While there are plenty of benefits of using statistical models in a more traditional sense (refer to “Why Model?”), in the days when “too much” data is the main challenge, I would dare to say that the most important function of statistical models is that they summarize complex data into simple-to-use “scores.”

In this series, I have been emphasizing the importance of statistical modeling in almost every article. While there are plenty of benefits of using statistical models in a more traditional sense (refer to “Why Model?”), in the days when “too much” data is the main challenge, I would dare to say that the most important function of statistical models is that they summarize complex data into simple-to-use “scores.”

The next important feature would be that models fill in the gaps, transforming “unknowns” to “potentials.” You see, even in the age of ubiquitous data, no one will ever know everything about everybody. For instance, out of 100,000 people you have permission to contact, only a fraction will be “known” wine enthusiasts. With modeling, we can assign scores for “likelihood of being a wine enthusiast” to everyone in the base. Sure, models are not 100 percent accurate, but I’ll take “70 percent chance of afternoon shower” over not knowing the weather forecast for the day of the company picnic.

I’ve already explained other benefits of modeling in detail earlier in this series, but if I may cut it really short, models will help marketers:

1. In deciding whom to engage, as they cannot afford to spam the world and annoy everyone who can read, and

2. In determining what to offer once they decide to engage someone, as consumers are savvier than ever and they will ignore and discard any irrelevant message, no matter how good it may look.

OK, then. I hope you are sold on this idea by now. The next question is, who is going to do all that mathematical work? In a country where jocks rule over geeks, it is clear to me that many folks are more afraid of mathematics than public speaking; which, in its own right, ranks higher than death in terms of the fear factor for many people. If I may paraphrase “Seinfeld,” many folks are figuratively more afraid of giving a eulogy than being in the coffin at a funeral. And thanks to a sub-par math education in the U.S. (and I am not joking about this, having graduated high school on foreign soil), yes, the fear of math tops them all. Scary, heh?

But that’s OK. This is a big world, and there are plenty of people who are really good at mathematics and statistics. That is why I purposefully never got into the mechanics of modeling techniques and related programming issues in this series. Instead, I have been emphasizing how to formulate questions, how to express business goals in a more logical fashion and where to invest to create analytics-ready environments. Then the next question is, “How will you find the right math geeks who can make all your dreams come true?”

If you have a plan to create an internal analytics team, there are a few things to consider before committing to that idea. Too many organizations just hire one or two statisticians, dump all the raw data onto them, and hope to God that they will figure some ways to make money with data, somehow. Good luck with that idea, as:

1. I’ve seen so many failed attempts like that (actually, I’d be shocked if it actually worked), and

2. I am sure God doesn’t micromanage statistical units.

(Similarly, I am almost certain that she doesn’t care much for football or baseball scores of certain teams, either. You don’t think God cares more for the Red Sox than the Yankees, do ya?)

The first challenge is locating good candidates. If you post any online ad for “Statistical Analysts,” you will receive a few hundred resumes per day. But the hiring process is not that simple, as you should ask the right questions to figure out who is a real deal, and who is a poser (and there are many posers out there). Even among qualified candidates with ample statistical knowledge, there are differences between the “Doers” and “Vendor Managers.” Depending on your organizational goal, you must differentiate the two.

Then the next challenge is keeping the team intact. In general, mathematicians and statisticians are not solely motivated by money; they also want constant challenges. Like any smart and creative folks, they will simply pack up and leave, if “they” determine that the job is boring. Just a couple of modeling projects a year with some rudimentary sets of data? Meh. Boring! Promises of upward mobility only work for a fraction of them, as the majority would rather deal with numbers and figures, showing no interest in managing other human beings. So, coming up with interesting and challenging projects, which will also benefit the whole organization, becomes a job in itself. If there are not enough challenges, smart ones will quit on you first. Then they need constant mentoring, as even the smartest statisticians will not know everything about challenges associated with marketing, target audiences and the business world, in general. (If you stumble into a statistician who is even remotely curious about how her salary is paid for, start with her.)

Further, you would need to invest to set up an analytical environment, as well. That includes software, hardware and other supporting staff. Toolsets are becoming much cheaper, but they are not exactly free yet. In fact, some famous statistical software, such as SAS, could be quite expensive year after year, although there are plenty of alternatives now. And they need an “analytics-ready” data environment, as I emphasized countless times in this series (refer to “Chicken or the Egg? Data or Analytics?” and “Marketing and IT; Cats and Dogs”). Such data preparation work is not for statisticians, and most of them are not even good at cleaning up dirty data, anyway. That means you will need different types of developers/programmers on the analytics team. I pointed out that analytical projects call for a cohesive team, not some super-duper analyst who can do it all (refer to “How to Be a Good Data Scientist”).

By now you would say “Jeez Louise, enough already,” as all this is just too much to manage to build just a few models. Suddenly, outsourcing may sound like a great idea. Then you would realize there are many things to consider when outsourcing analytical work.

First, where would you go? Everyone in the data industry and their cousins claim that they can take care of analytics. But in reality, it is a scary place where many who have “analytics” in their taglines do not even touch “predictive analytics.”

Analytics is a word that is abused as much as “Big Data,” so we really need to differentiate them. “Analytics” may mean:

  • Business Intelligence (BI) Reporting: This is mostly about the present, such as the display of key success metrics and dashboard reporting. While it is very important to know about the current state of business, much of so-called “analytics” unfortunately stops right here. Yes, it is good to have a dashboard in your car now, but do you know where you should be going?
  • Descriptive Analytics: This is about how the targets “look.” Common techniques such as profiling, segmentation and clustering fall under this category. These techniques are mainly for describing the target audience to enhance and optimize messages to them. But using these segments as a selection mechanism is not recommended, while many dare to do exactly that (more on this subject in future articles).
  • Predictive Modeling: This is about answering the questions about the future. Who would be more likely to behave certain ways? What communication channels will be most effective for whom? How much is the potential spending level of a prospect? Who is more likely to be a loyal and profitable customer? What are their preferences? Response models, various of types of cloning models, value models, and revenue models, attrition models, etc. all fall under this category, and they require hardcore statistical skills. Plus, as I emphasized earlier, these model scores compact large amounts of complex data into nice bite-size packages.
  • Optimization: This is mostly about budget allocation and attribution. Marketing agencies (or media buyers) generally deal with channel optimization and spending analysis, at times using econometrics models. This type of statistical work calls for different types of expertise, but many still insist on calling it simply “analytics.”

Let’s say that for the purpose of customer-level targeting and personalization, we decided to outsource the “predictive” modeling projects. What are our options?

We may consider:

  • Individual Consultants: In-house consultants are dedicated to your business for the duration of the contract, guaranteeing full access like an employee. But they are there for you only temporarily, with one foot out the door all the time. And when they do leave, all the knowledge walks away with them. Depending on the rate, the costs can add up.
  • Standalone Analytical Service Providers: Analytical work is all they do, so you get focused professionals with broad technical and institutional knowledge. Many of them are entrepreneurs, but that may work against you, as they could often be understaffed and stretched thin. They also tend to charge for every little step, with not many freebies. They are generally open to use any type of data, but the majority of them do not have secure sources of third-party data, which could be essential for certain types of analytics involving prospecting.
  • Database Service Providers: Almost all data compilers and brokers have statistical units, as they need to fill in the gap within their data assets with statistical techniques. (You didn’t think that they knew everyone’s income or age, did you?) For that reason, they have deep knowledge in all types of data, as well as in many industry verticals. They provide a one-stop shop environment with deep resource pools and a variety of data processing capabilities. However, they may not be as agile as smaller analytical shops, and analytics units may be tucked away somewhere within large and complex organizations. They also tend to emphasize the use of their own data, as after all, their main cash cows are their data assets.
  • Direct Marketing Agencies: Agencies are very strategic, as they touch all aspects of marketing and control creative processes through segmentation. Many large agencies boast full-scale analytical units, capable of all types of analytics that I explained earlier. But some agencies have very small teams, stretched really thin—just barely handling the reporting aspect, not any advanced analytics. Some just admit that predictive analytics is not part of their core competencies, and they may outsource such projects (not that it is a bad thing).

As you can see here, there is no clear-cut answer to “with whom you should you work.” Basically, you will need to check out all types of analysts and service providers to determine the partner best suitable for your long- and short-term business purposes, not just analytical goals. Often, many marketers just go with the lowest bidder. But pricing is just one of many elements to be considered. Here, allow me to introduce “10 Essential Items to Consider When Outsourcing Analytics.”

1. Consulting Capabilities: I put this on the top of the list, as being a translator between the marketing and the technology world is the most important differentiator (refer to “How to Be a Good Data Scientist”). They must understand the business goals and marketing needs, prescribe suitable solutions, convert such goals into mathematical expressions and define targets, making the best of available data. If they lack strategic vision to set up the data roadmap, statistical knowledge alone will not be enough to achieve the goals. And such business goals vary greatly depending on the industry, channel usage and related success metrics. Good consultants always ask questions first, while sub-par ones will try to force-fit marketers’ goals into their toolsets and methodologies.

Translating marketing goals into specific courses of action is a skill in itself. A good analytical partner should be capable of building a data roadmap (not just statistical steps) with a deep understanding of the business impact of resultant models. They should be able to break down larger goals into smaller steps, creating proper phased approaches. The plan may call for multiple models, all kinds of pre- and post-selection rules, or even external data acquisition, while remaining sensitive to overall costs.

The target definition is the core of all these considerations, which requires years of experience and industry knowledge. Simply, the wrong or inadequate targeting decision leads to disastrous results, no matter how sound the mathematical work is (refer to “Art of Targeting”).

Another important quality of a good analytical partner is the ability to create usefulness out of seemingly chaotic and unstructured data environments. Modeling is not about waiting for the perfect set of data, but about making the best of available data. In many modeling bake-offs, the winners are often decided by the creative usage of provided data, not just statistical techniques.

Finally, the consultative approach is important, as models do not exist in a vacuum, but they have to fit into the marketing engine. Be aware of the ones who want to change the world around their precious algorithms, as they are geeks not strategists. And the ones who understand the entire marketing cycle will give advice on what the next phase should be, as marketing efforts must be perpetual, not transient.

So, how will you find consultants? Ask the following questions:

  • Are they “listening” to you?
  • Can they repeat “your” goals in their own words?
  • Do their roadmaps cover both short- and long-term goals?
  • Are they confident enough to correct you?
  • Do they understand “non-statistical” elements in marketing?
  • Have they “been there, done that” for real, or just in theories?

2. Data Processing Capabilities: I know that some people look down upon the word “processing.” But data manipulation is the most important key step “before” any type of advanced analytics even begins. Simply, “garbage-in, garbage out.” And unfortunately, most datasets are completely unsuitable for analytics and modeling. In general, easily more than 80 percent of model development time goes into “fixing” the data, as most are unstructured and unrefined. I have been repeatedly emphasizing the importance of a “model-ready” (or “analytics-ready”) environment for that reason.

However, the reality dictates that the majority of databases are indeed NOT model-ready, and most of them are not even close to it. Well, someone has to clean up the mess. And in this data business, the last one who touches the dataset becomes responsible for all the errors and mistakes made to it thus far. I know it is not fair, but that is why we need to look at the potential partner’s ability to handle large and really messy data, not just the statistical savviness displayed in glossy presentations.

Yes, that dirty work includes data conversion, edit/hygiene, categorization/tagging, data summarization and variable creation, encompassing all kinds of numeric, character and freeform data (refer to “Beyond RFM Data” and “Freeform Data Aren’t Exactly Free”). It is not the most glorious part of this business, but data consistency is the key to successful implementation of any advanced analytics. So, if a model-ready environment is not available, someone had better know how to make the best of whatever is given. I have seen too many meltdowns in “before” and “after” modeling steps due to inconsistencies in databases.

So, grill the candidates with the following questions:

  • If they support file conversions, edit, categorization and summarization
  • How big of a dataset is too big, and how many files/tables are too many for them
  • How much free-form data are too much for them
  • Ask for sample model variables that they have created in the past

3. Track Records in the Industry: It can be argued that industry knowledge is even more crucial for the success than statistical know-how, as nuances are often “Lost in Translation” without relevant industry experience. In fact, some may not even be able to carry on a proper conversation with a client without it, leading to all kinds of wrong assumptions. I have seen a case where “real” rocket scientists messed up models for credit card campaigns.

The No. 1 reason why industry experience is important is everyone’s success metrics are unique. Just to name a few, financial services (banking, credit card, insurance, investment, etc.), travel and hospitality, entertainment, packaged goods, online and offline retail, catalogs, publication, telecommunications/utilities, non-profit and political organizations all call for different types of analytics and models, as their business models and the way they interact with target audiences are vastly different. For example, building a model (or a database, for that matter) for businesses where they hand over merchandise “before” they collect money is fundamentally different than the ones where exchange happens simultaneously. Even a simple concept of payment date or transaction date cannot be treated the same way. For retailers, recent dates could be better for business, but for subscription business, older dates may carry more weight. And these are just some examples with “dates,” before touching any dollar figures or other fun stuff.

Then the job gets even more complicated, if we further divide all of these industries by B-to-B vs. B-to-C, where available data do not even look similar. On top of that, divisional ROI metrics may be completely different, and even terminology and culture may play a role in all of this. When you are a consultant, you really don’t want to stop the flow of a meeting to clarify some unfamiliar acronyms, as you are supposed to know them all.

So, always demand specific industry references and examine client roasters, if allowed. (Many clients specifically ask vendors not to use their names as references.) Basically, watch out for the ones who push one-size-fits-all cookie-cutter solutions. You deserve way more than that.

4. Types of Models Supported: Speaking of cookie-cutter stuff, we need to be concerned with types of models that the outsourcing partner would support. Sure, nobody employs every technique, and no one can be good at everything. But we need to watch out for the “One-trick Ponies.”

This could be a tricky issue, as we are going into a more technical domain. Plus, marketers should not self-prescribe with specific techniques, instead of clearly stating their business goals (refer to “Marketing and IT; Cats and Dogs”). Some of the modeling goals are:

  • Rank and select prospect names
  • Lead scoring
  • Cross-sell/upsell
  • Segment the universe for messaging strategy
  • Pinpoint the attrition point
  • Assign lifetime values for prospects and customers
  • Optimize media/channel spending
  • Create new product packages
  • Detect fraud
  • Etc.

Unless you have successfully dealt with the outsourcing partner in the past (or you have a degree in statistics), do not blurt out words like Neural-net, CHAID, Cluster Analysis, Multiple Regression, Discriminant Function Analysis, etc. That would be like demanding specific medication before your new doctor even asks about your symptoms. The key is meeting your business goals, not fulfilling buzzwords. Let them present their methodology “after” the goal discussion. Nevertheless, see if the potential partner is pushing one or two specific techniques or solutions all the time.

5. Speed of Execution: In modern marketing, speed to action is the king. Speed wins, and speed gains respect. However, when it comes to modeling or other advanced analytics, you may be shocked by the wide range of time estimates provided by each outsourcing vendor. To be fair they are covering themselves, mainly because they have no idea what kind of messy data they will receive. As I mentioned earlier, pre-model data preparation and manipulation are critical components, and they are the most time-consuming part of all; especially when available data are in bad shape. Post-model scoring, audit and usage support may elongate the timeline. The key is to differentiate such pre- and post-modeling processes in the time estimate.

Even for pure modeling elements, time estimates vary greatly, depending on the complexity of assignments. Surely, a simple cloning model with basic demographic data would be much easier to execute than the ones that involve ample amounts of transaction- and event-level data, coming from all types of channels. If time-series elements are added, it will definitely be more complex. Typical clustering work is known to take longer than regression models with clear target definitions. If multiple models are required for the project, it will obviously take more time to finish the whole job.

Now, the interesting thing about building a model is that analysts don’t really finish it, but they just run out of time—much like the way marketers work on PowerPoint presentations. The commonality is that we can basically tweak models or decks forever, but we have to stop at some point.

However, with all kinds of automated tools and macros, model development time has decreased dramatically in past decades. We really came a long way since the first application of statistical techniques to marketing, and no one should be quoting a 1980s timeline in this century. But some still do. I know vendors are trained to follow the guideline “always under-promise and over-deliver,” but still.

An interesting aspect of this dilemma is that we can negotiate the timeline by asking for simpler and less sophisticated versions with diminished accuracy. If, hypothetically, it takes a week to be 98 percent accurate, but it only takes a day to be 90 percent accurate, what would you pick? That should be the business decision.

So, what is a general guideline? Again, it really depends on many factors, but allow me to share a version of it:

  • Pre-modeling Processing

– Data Conversions: from half a day to weeks

– Data Append/Enhancement: between overnight and two days

– Data Edit and Summarization: Data-dependent

  • Modeling: Ranges from half a day to weeks

– Depends on type, number of models and complexity

  • Scoring: from half a day to one week

– Mainly depends on number of records and state of the database to be scored

I know these are wide ranges, but watch out for the ones that routinely quote 30 days or more for simple clone models. They may not know what they are doing, or worse, they may be some mathematical perfectionists who don’t understand the marketing needs.

6. Pricing Structure: Some marketers would put this on top of the checklist, or worse, use the pricing factor as the only criterion. Obviously, I disagree. (Full disclosure: I have been on the service side of the fence during my entire career.) Yes, every project must make an economic sense in the end, but the budget should not and cannot be the sole deciding factor in choosing an outsourcing partner. There are many specialists under famous brand names who command top dollars, and then there are many data vendors who throw in “free” models, disrupting the ecosystem. Either way, one should not jump to conclusions too fast, as there is no free lunch, after all. In any case, I strongly recommend that no one should start the meeting with pricing questions (hence, this article). When you get to the pricing part, ask what the price includes, as the analytical journey could be a series of long and winding roads. Some of the biggest factors that need to be considered are:

  • Multiple Model Discounts—Less for second or third models within a project?
  • Pre-developed (off-the-shelf) Models—These can be “much” cheaper than custom models, while not custom-fitted.
  • Acquisition vs. CRM—Employing client-specific variables certainly increases the cost.
  • Regression Models vs. Other Types—At times, types of techniques may affect the price.
  • Clustering and Segmentations—They are generally priced much higher than target-specific models.

Again, it really depends on the complexity factor more than anything else, and the pre- and post-modeling process must be estimated and priced separately. Non-modeling charges often add up fast, and you should ask for unit prices and minimum charges for each step.

Scoring charges in time can be expensive, too, so negotiate for discounts for routine scoring of the same models. Some may offer all-inclusive package pricing for everything. The important thing is that you must be consistent with the checklist when shopping around with multiple candidates.

7. Documentation: When you pay for a custom model (not pre-developed, off-the-shelf ones), you get to own the algorithm. Because algorithms are not tangible items, the knowledge is to be transformed in model documents. Beware of the ones who offer “black-box” solutions with comments like, “Oh, it will work, so trust us.”

Good model documents must include the following, at the minimum:

  • Target and Comparison Universe Definitions: What was the target variable (or “dependent” variable) and how was it defined? How was the comparison universe defined? Was there any “pre-selection” for either of the universes? These are the most important factors in any model—even more than the mechanics of the model itself.
  • List of Variables: What are the “independent” variables? How were they transformed or binned? From where did they originate? Often, these model variables describe the nature of the model, and they should make intuitive sense.
  • Model Algorithms: What is the actual algorithm? What are the assigned weight for each independent variable?
  • Gains Chart: We need to examine potential effectiveness of the model. What are the “gains” for each model group, from top to bottom (e.g., 320 percent gain at the top model group in comparison to the whole universe)? How fast do such gains decrease as we move down the scale? How do the gains factors compare against the validation sample? A graphic representation would be nice, too.

For custom models, it is customary to have a formal model presentation, full documentation and scoring script in designated programming languages. In addition, if client files are provided, ask for a waterfall report that details input and output counts of each step. After the model scoring, it is also customary for the vendor to provide a scored universe count by model group. You will be shocked to find out that many so-called analytical vendors do not provide thorough documentation. Therefore, it is recommended to ask for sample documents upfront.

8. Scoring Validation: Models are built and presented properly, but the job is not done until the models are applied to the universe from which the names are ranked and selected for campaigns. I have seen too many major meltdowns at this stage. Simply, it is one thing to develop models with a few hundred thousand record samples, but it is quite another to apply the algorithm to millions of records. I am not saying that the scoring job always falls onto the developers, as you may have an internal team or a separate vendor for such ongoing processes. But do not let the model developer completely leave the building until everything checks out.

The model should have been validated against the validation sample by then, but live scoring may reveal all kinds of inconsistencies. You may also want to back-test the algorithms with past campaign results, as well. In short, many things go wrong “after” the modeling steps. When I hear customers complaining about models, I often find that the modeling is the only part that was done properly, and “before” and “after” steps were all messed up. Further, even machines misunderstand each other, as any differences in platform or scripting language may cause discrepancies. Or, maybe there was no technical error, but missing values may have caused inconsistencies (refer to “Missing Data Can Be Meaningful”). Nonetheless, the model developers would have the best insight as to what could have gone wrong, so make sure that they are available for questions after models are presented and delivered.

9. Back-end Analysis: Good analytics is all about applying learnings from past campaigns—good or bad—to new iterations of efforts. We often call it “closed-loop marketing—while many marketers often neglect to follow up. Any respectful analytics shop must be aware of it, while they may classify such work separately from modeling or other analytical projects. At the minimum, you need to check out if they even offer such services. In fact, so-called “match-back analysis” is not as simple as just matching campaign files against responders in this omnichannel environment. When many channels are employed at the same time, allocation of credit (i.e., “what worked?”) may call for all kinds of business rules or even dedicated models.

While you are at it, ask for a cheaper version of “canned” reports, as well, as custom back-end analysis can be even more costly than the modeling job itself, over time. Pre-developed reports may not include all the ROI metrics that you’re looking for (e.g., open, clickthrough, conversion rates, plus revenue and orders-per-mailed, per order, per display, per email, per conversion. etc.). So ask for sample reports upfront.

If you start breaking down all these figures by data source, campaign, time series, model group, offer, creative, targeting criteria, channel, ad server, publisher, keywords, etc., it can be unwieldy really fast. So contain yourself, as no one can understand 100-page reports, anyway. See if the analysts can guide you with such planning, as well. Lastly, if you are so into ROI analysis, get ready to share the “cost” side of the equation with the selected partner. Some jobs are on the marketers.

10. Ongoing Support: Models have a finite shelf life, as all kinds of changes happen in the real world. Seasonality may be a factor, or the business model or strategy may have changed. Fluctuations in data availability and quality further complicate the matter. Basically assumptions like “all things being equal” only happen in textbooks, so marketers must plan for periodic review of models and business rules.

A sure sign of trouble is decreasing effectiveness of models. When in doubt, consult the developers and they may recommend a re-fit or complete re-development of models. Quarterly reviews would be ideal, but if the cost becomes an issue, start with 6-month or yearly reviews, but never go past more than a year without any review. Some vendors may offer discounts for redevelopment, so ask for the price quote upfront.

I know this is a long list of things to check, but picking the right partner is very important, as it often becomes a long-term relationship. And you may find it strange that I didn’t even list “technical capabilities” at all. That is because:

1. Many marketers are not equipped to dig deep into the technical realm anyway, and

2. The difference between the most mathematically sound models and the ones from the opposite end of the spectrum is not nearly as critical as other factors I listed in this article.

In other words, even the worst model in the bake-off would be much better than no model, if these other business criterion are well-considered. So, happy shopping with this list, and I hope you find the right partner. Employing analytics is not an option when living in the sea of data.

Beyond RFM Data

In the world of predictive analytics, the transaction data is the king of the hill. The master of the domain. The protector of the realm. Why? Because they are hands-down the most powerful predictors. If I may borrow the term that my mentor coined for our cooperative venture more than a decade ago (before anyone even uttered the word “Big Data”), “The past behavior is the best predictor of the future behavior.” Indeed. Back then, we had built a platform that nowadays could easily have qualified as Big Data. The platform predicted people’s future behaviors on a massive scale, and it worked really well, so I still stand by that statement.

In the world of predictive analytics, the transaction data is the king of the hill. The master of the domain. The protector of the realm. Why? Because they are hands-down the most powerful predictors. If I may borrow the term that my mentor coined for our cooperative venture more than a decade ago (before anyone even uttered the word “Big Data”), “The past behavior is the best predictor of the future behavior.” Indeed. Back then, we had built a platform that nowadays could easily have qualified as Big Data. The platform predicted people’s future behaviors on a massive scale, and it worked really well, so I still stand by that statement.

How so? At the risk of sounding like a pompous mathematical smartypants (I’m really not), it is because people do not change that much, or if so, not so rapidly. Every move you make is on some predictive curve. What you been buying, clicking, browsing, smelling or coveting somehow leads to the next move. Well, not all the time. (Maybe you just like to “look” at pretty shoes?) But with enough data, we can calculate the probability with some confidence that you would be an outdoors type, or a golfer, or a relaxing type on a cruise ship, or a risk-averse investor, or a wine enthusiast, or into fashion, or a passionate gardener, or a sci-fi geek, or a professional wrestling fan. Beyond affinity scores listed here, we can predict future value of each customer or prospect and possible attrition points, as well. And behind all those predictive models (and I have seen countless algorithms), the leading predictors are mostly transaction data, if you are lucky enough to get your hands on them. In the age of ubiquitous data and at the dawn of the “Internet of Things,” more marketers will be in that lucky group if they are diligent about data collection and refinement. Yes, in the near future, even a refrigerator will be able to order groceries, but don’t forget that only the collection mechanism will be different there. We still have to collect, refine and analyze the transaction data.

Last month, I talked about three major types of data (refer to “Big Data Must Get Smaller“), which are:
1. Descriptive Data
2. Behavioral Data (mostly Transaction Data)
3. Attitudinal Data.

If you gain access to all three elements with decent coverage, you will have tremendous predictive power when it comes to human behaviors. Unfortunately, it is really difficult to accumulate attitudinal data on a large scale with individual-level details (i.e., knowing who’s behind all those sentiments). Behavioral data, mostly in forms of transaction data, are also not easy to collect and maintain (non-transaction behavioral data are even bigger and harder to handle), but I’d say it is definitely worth the effort, as most of what we call Big Data fall under this category. Conversely, one can just purchase descriptive data, which are what we generally call demographic or firmographic data, from data compilers or brokers. The sellers (there are many) will even do the data-append processing for you and they may also throw in a few free profile reports with it.

Now, when we start talking about the transaction data, many marketers will respond “Oh, you mean RFM data?” Well, that is not completely off-base, because “Recency, Frequency and Monetary” data certainly occupy important positions in the family of transaction data. But they hardly are the whole thing, and the term is misused as frequently as “Big Data.” Transaction data are so much more than simple RFM variables.

RFM Data Is Just a Good Start
The term RFM should be used more as a checklist for marketers, not as design guidelines—or limitations in many cases—for data professionals. How recently did this particular customer purchase our product, and how frequently did she do that and how much money did she spend with us? Answering these questions is a good start, but stopping there would seriously limit the potential of transaction data. Further, this line of questioning would lead the interrogation efforts to simple “filtering,” as in: “Select all customers who purchased anything with a price tag over $100 more than once in past 12 months.” Many data users may think that this query is somewhat complex, but it really is just a one-dimensional view of the universe. And unfortunately, no customer is one-dimensional. And this query is just one slice of truth from the marketer’s point of view, not the customer’s. If you want to get really deep, the view must be “buyer-centric,” not product-, channel-, division-, seller- or company-centric. And the database structure should reflect that view (refer to “It’s All About Ranking,” where the concept of “Analytical Sandbox” is introduced).

Transaction data by definition describe the transactions, not the buyers. If you would like to describe a buyer or if you are trying to predict the buyer’s future behavior, you need to convert the transaction data into “descriptors of the buyers” first. What is the difference? It is the same data looked at through a different window—front vs. side window—but the effect is huge.

Even if we think about just one simple transaction with one item, instead of describing the shopping basket as “transaction happened on July 3, 2014, containing the Coldplay’s latest CD ‘Ghost Stories’ priced at $11.88,” a buyer-centric description would read: “A recent CD buyer in Rock genre with an average spending level in the music category under $20.” The trick is to describe the buyer, not the product or the transaction. If that customer has many orders and items in his purchase history (let’s say he downloaded a few songs to his portable devices, as well), the description of the buyer would become much richer. If you collect all of his past purchase history, it gets even more colorful, as in: “A recent music CD or MP3 buyer in rock, classical and jazz genres with 24-month purchase totaling to 13 orders containing 16 items with total spending valued in $100-$150 range and $11 average order size.” Of course you would store all this using many different variables (such as genre indicators, number of orders, number of items, total dollars spent during the past 24 months, average order amount and number of weeks since last purchase in the music category, etc.). But the point is that the story would come out this way when you change the perspective.

Creating a Buyer-Centric Portrait
The whole process of creating a buyer-centric portrait starts with data summarization (or de-normalization). A typical structure of the table (or database) that needs to capture every transaction detail, such as transaction date and amount, would require an entry for every transaction, and the database designers call it the “normal” state. As I explained in my previous article (“Ranking is the key”), if you would like to rank in terms of customer value, the data record must be on a customer level, as well. If you are ranking households or companies, you would then need to summarize the data on those levels, too.

Now, this summarization (or de-normalization) is not a process of eliminating duplicate entries of names, as you wouldn’t want to throw away any transaction details. If there are multiple orders per person, what is the total number of orders? What is the total amount of spending on an individual level? What would be average spending level per transaction, or per year? If you are allowed to have only one line of entry per person, how would you summarize the purchase dates, as you cannot just add them up? In that case, you can start with the first and last transaction date of each customer. Now, when you have the first and last transaction date for every customer, what would be the tenure of each customer and what would be the number of days since the last purchase? How many days, on average, are there in between orders then? Yes, all these figures are related to basic RFM metrics, but they are far more colorful this way.

The attached exhibit displays a very simple example of a before and after picture of such summarization process. On the left-hand side, there resides a typical order table containing customer ID, order number, order date and transaction amount. If a customer has multiple orders in a given period, an equal number of lines are required to record the transaction details. In real life, other order level information, such as payment method (very predictive, by the way), tax amount, discount or coupon amount and, if applicable, shipping amount would be on this table, as well.

On the right-hand side of the chart, you will find there is only one line per customer. As I mentioned in my previous columns, establishing consistent and accurate customer ID cannot be neglected—for this reason alone. How would you rely on the summary data if one person may have multiple IDs? The customer may have moved to a new address, or shopped from multiple stores or sites, or there could have been errors in data collections. Relying on email address is a big no-no, as we all carry many email addresses. That is why the first step of building a functional marketing database is to go through the data hygiene and consolidation process. (There are many data processing vendors and software packages for it.) Once a persistent customer (or individual) ID system is in place, you can add up the numbers to create customer-level statistics, such as total orders, total dollars, and first and last order dates, as you see in the chart.

Remember R, F, M, P and C
The real fun begins when you combine these numeric summary figures with product, channel and other important categorical variables. Because product (or service) and channel are the most distinctive dividers of customer behaviors, let’s just add P and C to the famous RFM (remember, we are using RFM just as a checklist here), and call it R, F, M, P and C.

Product (rather, product category) is an important separator, as people often show completely different spending behavior for different types of products. For example, you can send me fancy-shmancy fashion catalogs all you want, but I won’t look at it with an intention of purchase, as most men will look at the models and not what they are wearing. So my active purchase history in the sports, home electronics or music categories won’t mean anything in the fashion category. In other words, those so-called “hotline” names should be treated differently for different categories.

Channel information is also important, as there are active online buyers who would never buy certain items, such as apparel or home furnishing products, without physically touching them first. For example, even in the same categories, I would buy guitar strings or golf balls online. But I would not purchase a guitar or a driver without trying them out first. Now, when I say channel, I mean the channel that the customer used to make the purchase, not the channel through which the marketer chose to communicate with him. Channel information should be treated as a two-way street, as no marketer “owns” a customer through a particular channel (refer to “The Future of Online is Offline“).

As an exercise, let’s go back to the basic RFM data and create some actual variables. For “each” customer, we can start with basic RFM measures, as exhibited in the chart:

· Number of Transactions
· Total Dollar Amount
· Number of Days (or Weeks) since the Last Transaction
· Number of Days (or Weeks) since the First Transaction

Notice that the days are counted from today’s point of view (practically the day the database is updated), as the actual date’s significance changes as time goes by (e.g., a day in February would feel different when looked back on from April vs. November). “Recency” is a relative concept; therefore, we should relativize the time measurements to express it.

From these basic figures, we can derive other related variables, such as:

· Average Dollar Amount per Customer
· Average Dollar Amount per Transaction
· Average Dollar Amount per Year
· Lifetime Highest Amount per Item
· Lifetime Lowest Amount per Transaction
· Average Number of Days Between Transactions
· Etc., etc…

Now, imagine you have all these measurements by channels, such as retail, Web, catalog, phone or mail-in, and separately by product categories. If you imagine a gigantic spreadsheet, the summarized table would have fewer numbers of rows, but a seemingly endless number of columns. I will discuss categorical and non-numeric variables in future articles. But for this exercise, let’s just imagine having these sets of variables for all major product categories. The result is that the recency factor now becomes more like “Weeks since Last Online Order”—not just any order. Frequency measurements would be more like “Number of Transactions in Dietary Supplement Category”—not just for any product. Monetary values can be expressed in “Average Spending Level in Outdoor Sports Category through Online Channel”—not just the customer’s average dollar amount, in general.

Why stop there? We may slice and dice the data by offer type, customer status, payment method or time intervals (e.g., lifetime, 24-month, 48-months, etc.) as well. I am not saying that all the RFM variables should be cut out this way, but having “Number of Transaction by Payment Method,” for example, could be very revealing about the customer, as everybody uses multiple payment methods, while some may never use a debit card for a large purchase, for example. All these little measurements become building blocks in predictive modeling. Now, too many variables can also be troublesome. And knowing the balance (i.e., knowing where to stop) comes from the experience and preliminary analysis. That is when experts and analysts should be consulted for this type of uniform variable creation. Nevertheless, the point is that RFM variables are not just three simple measures that happen be a part of the larger transaction data menu. And we didn’t even touch non-transaction based behavioral elements, such as clicks, views, miles or minutes.

The Time Factor
So, if such data summarization is so useful for analytics and modeling, should we always include everything that has been collected since the inception of the database? The answer is yes and no. Sorry for being cryptic here, but it really depends on what your product is all about; how the buyers would relate to it; and what you, as a marketer, are trying to achieve. As for going back forever, there is a danger in that kind of data hoarding, as “Life-to-Date” data always favors tenured customers over new customers who have a relatively short history. In reality, many new customers may have more potential in terms of value than a tenured customer with lots of transaction records from a long time ago, but with no recent activity. That is why we need to create a level playing field in terms of time limit.

If a “Life-to-Date” summary is not ideal for predictive analytics, then where should you place the cutoff line? If you are selling cars or home furnishing products, we may need to look at a 4- to 5-year history. If your products are consumables with relatively short purchase cycles, then a 1-year examination would be enough. If your product is seasonal in nature—like gardening, vacation or heavily holiday-related items, then you may have to look at a minimum of two consecutive years of history to capture seasonal patterns. If you have mixed seasonality or longevity of products (e.g., selling golf balls and golf clubs sets through the same store or site), then you may have to summarize the data with multiple timelines, where the above metrics would be separated by 12 months, 24 months, 48 months, etc. If you have lifetime value models or any time-series models in the plan, then you may have to break the timeline down even more finely. Again, this is where you may need professional guidance, but marketers’ input is equally important.

Analytical Sandbox
Lastly, who should be doing all of this data summary work? I talked about the concept of the “Analytical Sandbox,” where all types of data conversion, hygiene, transformation, categorization and summarization are done in a consistent manner, and analytical activities, such as sampling, profiling, modeling and scoring are done with proper toolsets like SAS, R or SPSS (refer to “It’s All About Ranking“). The short and final answer is this: Do not leave that to analysts or statisticians. They are the main players in that playground, not the architects or developers of it. If you are serious about employing analytics for your business, plan to build the Analytical Sandbox along with the team of analysts.

My goal as a database designer has always been serving the analysts and statisticians with “model-ready” datasets on silver platters. My promise to them has been that the modelers would spend no time fixing the data. Instead, they would be spending their valuable time thinking about the targets and statistical methodologies to fulfill the marketing goals. After all, answers that we seek come out of those mighty—but often elusive—algorithms, and the algorithms are made of data variables. So, in the interest of getting the proper answers fast, we must build lots of building blocks first. And no, simple RFM variables won’t cut it.

Cheat Sheet: Is Your Database Marketing Ready?

Many data-related projects end up as big disappointments. And, in many cases, it is because they did not have any design philosophy behind them. Because many folks are more familiar with buildings and cars than geeky databases, allow me to use them as examples here.

Many data-related projects end up as big disappointments. And, in many cases, it is because they did not have any design philosophy behind them. Because many folks are more familiar with buildings and cars than geeky databases, allow me to use them as examples here.

Imagine someone started constructing a building without a clear purpose. What is it going to be? An office building or a residence? If residential, for how many people? For a family, or for 200 college kids? Are they going to just eat and sleep in there, or are they going to engage in other activities in it? What is the budget for development and ongoing maintenance?

If someone starts building a house without answering these basic questions, well, it is safe to say that the guy who commissioned such a project is not in the right state of mind. Then again, he may be a filthy rich rock star with some crazy ideas. But let us just say that is an exceptional case. Nonetheless, surprisingly, a great many database projects start out exactly this way.

Just like a house is not just a sum of bricks, mortar and metal, a database is not just a sum of data, and there has to be design philosophy behind it. And yet, many companies think that putting all available data in one place is just good enough. Call it a movie without a director or a building without an architect; you know and I know that such a project cannot end well.

Even when a professional database designer gets involved, too often the project goes out of control—as the business requirement document ends up being a summary of
everyone’s wish lists, without any prioritization or filtering. It is a case of a movie without a director. The goal becomes something like “a database that stores all conceivable marketing, accounting and payment activities, handling both prospecting and customer relationship management through all conceivable channels, including face-to-face sales and lead management for big accounts. And it should include both domestic and international activities, and the update has to be done in real time.”

Really. Someone in that organization must have attended a database marketing conference recently to get all that listed. It might be simpler and cheaper building a 2-ton truck that flies. But before we commission something like this from the get-go, shall we discuss why the truck has to fly, too? For one, if you want real-time updates, do you have a business case for it? (As in, someone in the field must make real-time decisions with real-time data.) Or do you just fancy a large object, moving really fast?

Companies that primarily sell database tools often do not help the matter, either. Some promise that the tool sets will categorize all kinds of input data, based on some auto-generated meta-tables. (Really?) The tool will clean the data automatically. (Is it a self-cleaning oven?) The tool will establish key links (by what?), build models on its own (with what target data?), deploy campaigns (every Monday?), and conduct result analysis (with responses from all channels?).

All these capabilities sound really wonderful, but does that system set long- and short-term marketing goals for you, too? Does it understand the subtle nuances in human behaviors and intentions?

Sorry for being a skeptic here. But in such cases, I think someone watched “Star Trek” too much. I have never seen a company that does not regret spending seven figures on a tool set that was supposed to do everything. Do you wonder why? It is not because such activities cannot be automated, but because:

  1. Machines do not think for us (not quite yet); and
  2. Such a system is often very expensive, as it needs to cover all contingencies (the opposite of “goal-oriented” cheaper options).

So it becomes nearly impossible to justify the cost with incremental improvements in marketing efficiency. Even if the response rates double, all related marketing costs go down by a quarter, and revenue jumps up by 200 percent, there are not many companies that can easily justify that kind of spending.

Worse yet, imagine that you just paid 10 times more for some factory-made suit than you would have paid for a custom-made Italian suit. Since when is an automated, cookie-cutter answer more desirable than custom-tailored ones? Ever since computing and storage costs started to go down significantly, and more so in this age of Big Data that has an “everything, all the time” mentality.

But let me ask you again: Do you really have a marketing database?

Let us just say that I am a car designer. A potential customer who has been doing a lot of research on the technology front presents me with a spec for a vehicle that is as big as a tractor-trailer and as quick as a passenger car. I guess that someone really needs to move lots of stuff, really fast. Now, let us assume that it will cost about $8 million or more to build a car like that, and that estimate is without the rocket booster (ah, my heart breaks). If my business model is to take a percentage out of that budget, I would say, “Yeah sure, we can build a car like that for you. When can we start?”

But let us stop for a moment and ask why the client would “need” (not “want”) a car like that in the first place. After some user interviews and prioritization, we may collectively conclude that a fleet of full-size vans can satisfy 98 percent of the business needs, saving about $7 million. If that client absolutely and positively has to get to that extra 2 percent to satisfy every possible contingency in his business and spend that money, well, that is his prerogative, is it not? But I have to ask the business questions first before initiating that inevitable long and winding journey without a roadmap.

Knowing exactly what the database is supposed to be doing must be the starting point. Not “let’s just gather everything in one place and hope to God that some user will figure something out eventually.” Also, let’s not forget that constantly adding new goals in any phase of the project will inevitably complicate the matter and increase the cost.

Conversely, repurposing a database designed for some other goal will cause lots of troubles down the line. Yeah, sure. Is it not possible to move 100 people from A to B with a 2-seater sports car, if you are willing to make lots of quick trips and get some speeding tickets along the way? Yes, but that would not be my first recommendation. Instead, here are some real possibilities.

Databases support many different types of activities. So let us name a few:

  • Order fulfillment
  • Inventory management and accounting
  • Contact management for sales
  • Dashboard and report generation
  • Queries and selections
  • Campaign management
  • Response analysis
  • Trend analysis
  • Predictive modeling and scoring
  • Etc., etc.

The list goes on, and some of the databases may be doing fine jobs in many areas already. But can we safely call them “marketing” databases? Or are marketers simply tapping into the central data depository somehow, just making do with lots of blood, sweat and tears?

As an exercise, let me ask a few questions to see if your organization has a functioning marketing database for CRM purposes:

  • What is the average order size per year for customers with tenure of more than one year? —You may have all the transaction data, but maybe not on an individual level in order to know the average.
  • What is the number of active and dormant customers based on the last transaction date? —You will be surprised to find out that many companies do not know exactly how many customers they really have. Beep! 1 million-“ish” is not a good answer.
  • What is the average number of days between activities for each channel for each customer? —With basic transaction data summarized “properly,” this is not a difficult question to answer. But it’s very difficult if there are divisional “channel-centric” databases scattered all over.
  • What is the average number of touches through all channels that you employ before your customer reaches the projected value potential? —This is a hard one. Without all the transaction and contact history by all channels in a “closed-loop” structure, one cannot even begin to formulate an answer for this one. And the “value potential” is a result of statistical modeling, is it not?
  • What are typical gateway products, and how are they correlated to other product purchases? —This may sound like a product question, but without knowing each customer’s purchase history lined up properly with fully standardized product categories, it may take a while to figure this one out.
  • Are basic RFM data—such as dollars, transactions, dates and intervals—routinely being used in predictive models? —The answer is a firm “no,” if the statisticians are spending the majority of their time fixing the data; and “not even close,” if you are still just using RFM data for rudimentary filtering.

Now, if your answer is “Well, with some data summarization and inner/outer joins here and there—though we don’t have all transaction records from last year, and if we can get all the campaign histories from all seven vendors who managed our marketing campaigns, except for emails—maybe?”, then I am sorry to inform you that you do not have a marketing database. Even if you can eventually get to the answer if some programmer takes two weeks to draw a 7-page flow chart.

Often, I get extra comments like “But we have a relational database!” Or, “We stored every transaction for the past 10 years in Hadoop and we can retrieve any one of them in less than a second!” To these comments, I would say “Congratulations, your car has four wheels, right?”

To answer the important marketing questions, the database should be organized in a “buyer-centric” format. Going back to the database philosophy question, the fundamental design of the database changes based on its main purpose, much like the way a sports sedan and an SUV that share the same wheel base and engine end up shaped differently.

Marketing is about people. And, at the center of the marketing database, there have to be people. Every data element in the base should be “describing” those people.

Unfortunately, most relational databases are transaction-, channel- or product-centric, describing events and transactions—but not the people. Unstructured databases that are tuned primarily for massive storage and rapid retrieval may just have pieces of data all over the place, necessitating serious rearrangement to answer some of the most basic business questions.

So, the question still stands. Is your database marketing ready? Because if it is, you would have taken no time to answer my questions listed above and say: “Yeah, I got this. Anything else?”

Now, imagine the difference between marketers who get to the answers with a few clicks vs. the ones who have no clue where to begin, even when sitting on mounds of data. The difference between the two is not the size of the investment, but the design philosophy.

I just hope that you did not buy a sports car when you needed a truck.