Resistance Is Futile

Any serious Trekkie would immediately recognize this title. But I am not talking about the Borgs, who are coming to assimilate us into their hive-minded collective. I am talking about a rather benign-sounding subject — and my profession — analytics.

Any serious Trekkie would immediately recognize this title. But I am not talking about the Borgs, who are coming to assimilate us into their hive-minded collective. I am talking about a rather benign-sounding subject — and my profession — analytics.

When you look at job descriptions of analytics leads in various organizations, you will often find the word “evangelization.” If every stakeholder is a believer of analytics, we would not need such a word to describe the position at all. We use that word because an important part of an analyst lead’s job is to convert non-believers to believers. And that is often the hardest part of our profession.

I smile when I see memes (or T-shirts) like “Science doesn’t care about your beliefs.” I’m sure some geek who got frustrated by the people who treat scientific facts as just an opinion came up with this phrase. From their point of view, it may be shocking to realize that scientifically proven facts can be disputed by people without any scientific training. But that is just human nature; most really don’t want to change either their beliefs or their behaviors.

Now, without being too political about this whole subject, I must confess that I face resistance to change all of the time in business environments, too. Why is that? How did activities of making decisions based on numbers and figures became something to resist?

My first guess is that people do not like even remotely complicated stuff. Maybe the word “analytics” or talk of “modeling” bring back all of the childhood memories of their scary math teachers. Maybe that kind of headache is so bad that some would reject things that could actually be helpful to them.

If the users of information feel that way, analysts must aspire to make analytics easier to consume and digest. Customers are not always right, but without the consumers of information, all analytical activities become meaningless — at least in non-academic places.

An 80-page report filled with numbers and figures dumped on someone’s desk should not even be called analytics. Literally, that’s still an extension of an unfiltered data dump. Analysts should never leave the most important part of the job — deriving insights out of mounds of data — to the end-users of analytics. True, the answer may lie somewhere in that pile, but that is like a weather forecaster listing all of the input variables to the general public without providing any useful information. Hey, is it going to rain this morning, or what?

I frequently talk about this issue with fellow analytics professionals. Even in more advanced organizations in terms of data and analytics infrastructure, heads of analytics often worry about low acceptance of data-based decision-making. In many instances, the size of the data and smooth flow of them, often measured in terabytes per second as a bragging point, do not really matter.

Information should be in nugget-sizes for easy consumption (refer to “Big Data Must Get Smaller”). Mining the data to come up with fewer than five bullet points is the hardest part, and should not be left to the users. That is the primary reason why less and less people are talking about “Big Data” nowadays, as even non-data professionals are waking up to realize that “big” is not the answer at all.

However, resistance to analytics doesn’t disappear, even when data are packaged in beautifully summarized reports or model scores. That is because often, the results of analytics uncover an inconvenient truth for many stakeholders — as in, “Dang, we’ve been doing it wrong all of this time?”

If a person or a department is called out as ineffective by some analytical geeks, I can see how involved parties may want to dispute the results any which way they can. Who cares about the facts when their jobs or reputations are at stake? Even if their jobs are safe, who are these analytics guys asking us to “change”? That is not any different from cases where cigarette companies disputed that smoking was actually beneficial in the past, and oil and gas companies have an allergic reaction when the words “climate” and “change” are uttered together in present days.

I’ve seen cases where analytical departments were completely decimated because their analytics revealed other divisions’ shortcomings and caused big political hoopla. Maybe the analysts should have had better bedside manners; but in some cases I’ve heard about, that didn’t even matter — as the big boss used the results of analytics to scold people who were just doing their jobs based on an old set of rules.

You can guess the outcome of that kind of political struggle. The lesson is that newly discovered “facts” should never be used to blame the followers of existing paradigms. Such reactions from the top will further alienate analytics from the rest of the company, as people get genuinely scared of it. Adoption of data-based decision-making? Not when people are afraid of the truth. Forget about the good of the company; that will never win vs. people’s desire for their job security.

Now, at the opposite end of the spectrum, too much unfiltered information forcing decisions can also hurt the organization. Some may call that “Death by KPI.” When there are too many indicators floating around, even seemingly sound decisions made based on numbers and figures may lead to unintended consequences; very often, negatively impacting the overall performance of the company. The question is always, “Which variable should get higher weight over others?” And that type of prioritization comes from clearly defined business goals. When all KPIs are treated to be equally important? Then nothing really is. Not in this complex world.

Misguided interpretation of numbers leads to distrust in analytics. Just because someone quoted an interesting figure within or without proper context, that doesn’t mean that there is just one version of an explanation behind it. Contextual understanding of data is the key to beneficial insights, and in the age of abundant information, even casual users of analytics must understand the differences. Running away from it is not the answer. Blindly driving the business just based on certain indicators should be avoided, as well. Both extremes will turn out to be harmful.

Nevertheless, the No. 1 reason why people do not adopt to analytics is many have gotten burned by “wrong” analytics in the past, often by the posers (refer to “Don’t Hire Data Posers”). In some circles, the reputation of analytics got so bad that I even met a group of executives who boldly claimed that whole practice of statistical modeling was totally bogus and it just didn’t work. Jeez. In the age of machine learning, one doesn’t believe in modeling at all? What do you think that “learning” is based on?

No matter how much data we may have in our custody, we use modeling techniques to predict the future, derive answers out of seemingly disjointed data and fill in the gaps in data — as we will never have every piece of the puzzle nicely lined up all of the time.

In a case of such deep mistrust in basic activities like modeling, I definitely blame the analysts of the past. Maybe those posers overpromised about what models could do. (No, nothing in analytics happens overnight). Maybe they aimed for a wrong target. Maybe they didn’t clean the data enough before plugging them into some off-of-the-shelf modeling engine. Maybe they didn’t properly apply the model to real-life situations, and left the building. No matter. It is their fault if the users didn’t receive a clear benefit from analytical exercises.

I often tell analysts and data scientists that analytics is not about the data journey that they embarked on or the mathematical adventure that they dove into. In the business world, it is about the bottom line. Did the report in question or model in action lead to an increase in revenue or a reduction in cost? It is really that clear-cut.

So, dear data geeks, please spare the rest of the human collectives from technical details, and get to the point fast. Talking about the sample size or arguing about the merits of neural net models – unless the users are equally geeky as you — will only further alienate decision-makers from analytics.

And the folks who think they can still rely on their gut feelings over analytics? Resistance to analytics is indeed futile. You must embrace it — not for all of the buzzwords uttered by the posers out there, but for the survival of your business. When your competitors are embracing advanced analytics, what are you going into the battle with? More unsolicited emails without targeting or personalization? Without knowing what elements of promotions are the key drivers of responses? Without even basic behavioral profiles of your own customer base? Not in this century. Not when consumers are as informed as marketers.

One may think that this whole analytics thing is overly hyped-up. Maybe. But definitely not as much as someone’s gut feelings or so-called business instincts. If analytics didn’t work for you in the past, find ways to make it work. Avoiding it certainly isn’t the answer.

Resistance is futile.

How I Leveraged My 5-Year-Old to Prepare for AI

Over the span of my career, I have had opportunities to mentor future data-driven business leaders. The advice I used to give primarily revolved around the hottest analytical tools and certifications and how to tell stories through data. Five years ago, however, my advice evolved in a very dramatic way, based on a reasonably benign event.

Over the span of my career, I have had opportunities to mentor future data-driven business leaders. The advice I used to give primarily revolved around the hottest analytical tools and certifications and how to tell stories through data. Five years ago, however, my advice evolved in a very dramatic way, based on a reasonably benign event.

My wife, our two daughters and I were on a multi-state road trip. Early on, we decided to make a pit stop. My wife gave the girls $5 each to buy goodies for the road — with no conditions. Unleashed from the shackles of healthy snacking, my older daughter set about making the most her newfound economic freedom. Analytically inclined, my oldest began optimizing for the right combination of quantity, quality and taste that would provide her with the maximum overall satisfaction. My younger daughter (five years old at the time), quickly picked up her favorite fruit candy, asked my wife for a suggestion and purchased that, as well. Eager to get back on the road, I asked my oldest to finalize her decision quickly. My request was met with a look of sheer horror and frustration as she frantically searched for the optimal basket of goods that $5 would buy. With hope that the optimal solutions was only minutes away, she begged for more time to no avail.

Back on the road, my younger daughter offered my wife a substantial portion of the candy she had recommended. Astonished, my wife says, “Sweetie, if you share that with me, you will have less for the trip.” To which my daughter replied, “That’s okay, Mom. I know you like these candies. Can I have another five dollars?” To which my wife uncharacteristically replied: “Of course!” Shocked at these turn of events, my older daughter protested “What? No fair, you can do that!?”

Data Is an Equal Opportunity Enabler

I often think about that incident; especially when I am trying to help clients achieve better results through analytics. This incident is a great allegorical example of why data-driven decisions, when done well, can improve specific results, but many times fail to change the overall game. A 2015 study by KPMG identified operational efficiencies as the primary beneficiary of data and analytics in the near horizon and a more recent study in HBR also confirms that most data and analytics success is still focused on low-hanging operational opportunities. In both reports, business leaders also recognize the transformational opportunities of data and analytics. However, they will also identify an acute need for new and unique skill sets to make those transformational changes a reality.

This brings me back to the car ride. Before you assume this is a lesson about how customer empathy beats algorithms, I can assure you it is not. Not only has my younger daughter’s strategy failed on several other occasions, but I have also seen plenty of well-researched market advice from customer-centric strategy firms fail, as well. Nor do I believe this anecdote implies optimization leads to strategic myopia. (This is also not about which kid I am betting on, as they both manage to amaze and worry me in equal doses.) Instead, the lesson for me is that while analytical rigor can be foundational to disruptive innovation, the optimal solutions algorithms provide only reflect the audacity of the optimizer’s vision.

The body of recent research on successful disruptors dispels the belief that they are solely the product of a brilliant idea conceived by a highly intuitive visionary. Instead, their very existence is often an optimization exercise involving many experiments. Not only do successful new entrants go through many failed iterations, but they also emerge through the crucible of other competing ventures with similar industry disrupting objectives. Once emerged and unleashed, there is still no guarantee that the new ventures are the absolutely optimal solution. One needs only to think of MySpace, AOL and Yahoo if there is doubt. As a result, the body of knowledge on innovation is now focusing around the concept of failing fast, failing early and failing often. A critical component of the “failing for success” strategy involves testing, measuring, and optimizing rapidly and regularly and but also involves having a broad view of the playing field and the bravery to challenge existing assumptions.

AI Whisperers Wanted

The career implications of these trends for data-driven talent are significant. As analytics takes a central role in strategic business functions, it does not necessarily mean that my fellow quant jocks will rule the future. This is because traditional optimization algorithms are just beginning to transition into artificial intelligence-based solutions with the ability to learn on their own and at some point human talent will no longer be needed to build models. If you are in analytics today, it will be important to keep up with the evolution of AI solutions, but even more critical is developing your analytical creativity and bravery.

Evangelizing Analytics Through Baseball

What do you think that ERA (Earned Run Average) stands for? If you can paint the quality of a baseball pitcher with a bunch of statistics and indices like that, yes, you do have a basic aptitude to be analytical.

baseballA great many people are simply allergic to mathematics. Maybe even more so than to public speaking. They just hate the subject and the very thought of it gives them a big headache. In many cases though, I just have to blame their math teachers in their youth for not providing enough appreciation for the subject, as the same people have no trouble understanding baseball stats of their favorite teams and players.

What do you think that ERA (Earned Run Average) stands for? It is nothing but an index value made of a numerator and a denominator, multiplied by a factor. If you can paint the quality of a baseball pitcher with a bunch of statistics and indices like that, yes, you do have a basic aptitude to be analytical. Maybe not enough to be a professional analyst, but enough to be a consumer of analytics.

In the near future, that kind of basic aptitude may be all we need to navigate through this complex world weaved in numbers and figures originated from humans, machines and networks that connect them all.

All the headachy equational problems will be taken care of by the smart machines anyway, right? Maybe.

The way this old analyst sees it, the answer could be yes or no. Because there is no way for any machine to provide good answers to illogical questions (like Mr. Spock would point out). And the logical mind comes from mathematical training — with a little help from the DNA with which the subjects were born.

Why do I worry about such things now? Simple. I see too many decision-makers who say they must get more into analytics, and their behaviors tell otherwise. It is unfortunate for them, as the verdict is out already on the effectiveness of good analytics. In fact, the question is no longer about whether an organization must embrace more analytics-based decision-making processes, but about how deep and complex they must get into it. The winners and losers in the business world will not solely be determined by the business models, but by the effectiveness of execution, enhanced and measured by analytics. Gut feelings may have worked well for many in the beginning of the last century, but that won’t be enough when competitors are armed with data and analytical toolsets.

We are undoubtedly living in the complex world now. The differences between people who freely wield technology and toolsets and people who are afraid of the changes won’t be just income levels; it may even form new social classes. And yet, the way many are dealing with perceived and real challenges isn’t much different from the past era. No, you can’t just work hard and hope that everything will be alright. There are people who see what is coming before anyone else does. Not all may be able to “see” it completely per se, but at least some have better future prediction than others. And those who do properly employ predictive analytics will clearly have an edge over those who don’t.

Then, why is it so difficult to “sell” analytics? There are many reasons. The first one, I think, is the fear of unknown (or unfamiliar territory). People hate to spend money and resources on the things that they don’t understand. The majority of the population does not understand how the internal combustion engine works, so the car companies sell coolness and other perceived benefits of their products.

Unfortunately, there is absolutely nothing sexy — for the general population, not for the geeks — about algorithms that may increase sales and reduce costs. So, the analysts must try to emphasize the benefits of it all; yet too many fall into the trap of believing that everyone will appreciate the beauty of the solution they agonized over. Well, most people simply don’t care for the details. That is why engineers don’t sell cars, but salespeople do. Analysts must get to the point fast; possibly within a minute, as most don’t have the patience for anyone’s mathematical journey.

Another reason why selling the concept of analytics is difficult is collective resistance to change. For most people, change is scary; and even if it’s not, it’s terribly inconvenient. All organizations and people in them are accustomed to some existing ways of doing their businesses. Analytics inevitably invoke changes in existing behaviors. It may start with a simple request for some extra data collection (“What do you mean I have to enter more data into my Salesforce account?”), and ultimately move onto changes in decision-making processes (“Oh, now I have to look at those model scores while I’m on the phone with the customer?”).

Patients Aren’t Ready for Treatment?

The key is to an effective prescription is to listen to the client first. Why do they lose sleep at night? What are their key success metrics? What are the immediate pain points? What are their long-term goals? And how would we reach there within the limits of provided resources

In my job of being “a guy who finds money-making opportunities using data,” I get to meet all kinds of businesspeople in various industries. Thanks to the business trend around analytics (and to that infamous “Big Data” fad), I don’t have to spend a long time explaining what I do any more; I just say I am in the field of analytics, or to sound a bit fancier, I say data science. Then most marketers seem to understand where the conversation will go from there. Things are never that simple in real life, though, as there are many types of analytics — business intelligence, descriptive analytics, predictive analytics, optimization, forecasting, etc., even at a high level — but figuring what type of solutions should be prescribed is THE job for a consultant, anyway (refer to “Prescriptive Analytics at All Stages”).

The key is to an effective prescription is to listen to the client first. Why do they lose sleep at night? What are their key success metrics? What are the immediate pain points? What are their long-term goals? And how would we reach there within the limits of provided resources and put out the fire at the same time? Building a sound data and analytics roadmap is critical, as no one wants to have an “Oh dang, we should have done that a year ago!” moment after a complex data project is well on its way. Reconstruction in any line of business is costly, and unfortunately, it happens all of the time, as many marketers and decision-makers often jump into the data pool out of desperation under organizational pressure (or under false promises by toolset providers, as in “all your dreams will come true with this piece of technology”). It is a sad sight when users realize that they don’t know how to swim in it “after” they jumped into it.

Why does that happen all of the time? At the risk of sounding like a pompous doctor, I must say that it is quite often the patient’s fault, too; there are lots of bad patients. When it comes to the data and analytics business, not all marketers are experts in it, though some are. Most do have a mid-level understanding, and they actually know when to call in for help. And there are complete novices, too. Now, regardless of their understanding level, bad patients are the ones who show up with self-prescribed solutions, and wouldn’t hear about any other options or precautions. Once, I’ve even met a client who demanded a neural-net model right after we exchanged pleasantries. My response? “Whoa, hold your horses for a minute here, why do you think that you need one?” (Though I didn’t quite say it like that.) Maybe you just came back from some expensive analytics conference, but can we talk about your business case first? After that conversation, I could understand why doctors wouldn’t appreciate patients who would trust WebMD over living, breathing doctors who are in front of them.

Then there are opposite types of cases, too. Some marketers are so insecure about the state of their data assets (or their level of understanding) that they wouldn’t even want to hear about any solutions that sound even remotely complex or difficult, although they may be in desperate need of them. A typical response is something like “Our datasets are so messy that we can’t possibly entertain anything statistical.” You know what that sounds like? It sounds like a patient refusing any surgical treatment in an ER because “he” is not ready for it. No, doctors should be ready to perform the surgery, not the patient.

Messy datasets are surely no excuse for not taking the right path. If we had to wait for a perfect set of data all of the time, there wouldn’t be any need for statisticians or data scientists. In fact, we need such specialists precisely because most data sets are messy and incomplete, and they need to be enhanced by statistical techniques.

Analytics is about making the best of what we have. Cleaning dirty and messy data is part of the job, and should never be an excuse for not doing the right thing. If anyone assumes that simple reports don’t require data cleansing steps because the results look simple, nothing could be further from the truth. Most reporting errors stem from dirty data, and most datasets — big or small, new or old — are not ready to be just plugged into analytical engines.

Besides, different types of analytics are needed because there are so many variations of business challenges, and no analytics is supposed to happen in some preset order. In other words, we get into predictive modeling because the business calls for it, not because a marketer finished some basic Reporting 101 class and now wants to move onto an Analytics 202 course. I often argue that deriving insights out of a series of simple reports could be a lot more difficult than building models or complex data management. Conversely, regardless of the sophistication level, marketers are not supposed to get into advanced analytics just for intellectual curiosity. Every data and analytics activity must be justified with business purposes, carefully following the strategic data roadmap, not difficulty level of the task.

How to Outsource Analytics

In this series, I have been emphasizing the importance of statistical modeling in almost every article. While there are plenty of benefits of using statistical models in a more traditional sense (refer to “Why Model?”), in the days when “too much” data is the main challenge, I would dare to say that the most important function of statistical models is that they summarize complex data into simple-to-use “scores.”

In this series, I have been emphasizing the importance of statistical modeling in almost every article. While there are plenty of benefits of using statistical models in a more traditional sense (refer to “Why Model?”), in the days when “too much” data is the main challenge, I would dare to say that the most important function of statistical models is that they summarize complex data into simple-to-use “scores.”

The next important feature would be that models fill in the gaps, transforming “unknowns” to “potentials.” You see, even in the age of ubiquitous data, no one will ever know everything about everybody. For instance, out of 100,000 people you have permission to contact, only a fraction will be “known” wine enthusiasts. With modeling, we can assign scores for “likelihood of being a wine enthusiast” to everyone in the base. Sure, models are not 100 percent accurate, but I’ll take “70 percent chance of afternoon shower” over not knowing the weather forecast for the day of the company picnic.

I’ve already explained other benefits of modeling in detail earlier in this series, but if I may cut it really short, models will help marketers:

1. In deciding whom to engage, as they cannot afford to spam the world and annoy everyone who can read, and

2. In determining what to offer once they decide to engage someone, as consumers are savvier than ever and they will ignore and discard any irrelevant message, no matter how good it may look.

OK, then. I hope you are sold on this idea by now. The next question is, who is going to do all that mathematical work? In a country where jocks rule over geeks, it is clear to me that many folks are more afraid of mathematics than public speaking; which, in its own right, ranks higher than death in terms of the fear factor for many people. If I may paraphrase “Seinfeld,” many folks are figuratively more afraid of giving a eulogy than being in the coffin at a funeral. And thanks to a sub-par math education in the U.S. (and I am not joking about this, having graduated high school on foreign soil), yes, the fear of math tops them all. Scary, heh?

But that’s OK. This is a big world, and there are plenty of people who are really good at mathematics and statistics. That is why I purposefully never got into the mechanics of modeling techniques and related programming issues in this series. Instead, I have been emphasizing how to formulate questions, how to express business goals in a more logical fashion and where to invest to create analytics-ready environments. Then the next question is, “How will you find the right math geeks who can make all your dreams come true?”

If you have a plan to create an internal analytics team, there are a few things to consider before committing to that idea. Too many organizations just hire one or two statisticians, dump all the raw data onto them, and hope to God that they will figure some ways to make money with data, somehow. Good luck with that idea, as:

1. I’ve seen so many failed attempts like that (actually, I’d be shocked if it actually worked), and

2. I am sure God doesn’t micromanage statistical units.

(Similarly, I am almost certain that she doesn’t care much for football or baseball scores of certain teams, either. You don’t think God cares more for the Red Sox than the Yankees, do ya?)

The first challenge is locating good candidates. If you post any online ad for “Statistical Analysts,” you will receive a few hundred resumes per day. But the hiring process is not that simple, as you should ask the right questions to figure out who is a real deal, and who is a poser (and there are many posers out there). Even among qualified candidates with ample statistical knowledge, there are differences between the “Doers” and “Vendor Managers.” Depending on your organizational goal, you must differentiate the two.

Then the next challenge is keeping the team intact. In general, mathematicians and statisticians are not solely motivated by money; they also want constant challenges. Like any smart and creative folks, they will simply pack up and leave, if “they” determine that the job is boring. Just a couple of modeling projects a year with some rudimentary sets of data? Meh. Boring! Promises of upward mobility only work for a fraction of them, as the majority would rather deal with numbers and figures, showing no interest in managing other human beings. So, coming up with interesting and challenging projects, which will also benefit the whole organization, becomes a job in itself. If there are not enough challenges, smart ones will quit on you first. Then they need constant mentoring, as even the smartest statisticians will not know everything about challenges associated with marketing, target audiences and the business world, in general. (If you stumble into a statistician who is even remotely curious about how her salary is paid for, start with her.)

Further, you would need to invest to set up an analytical environment, as well. That includes software, hardware and other supporting staff. Toolsets are becoming much cheaper, but they are not exactly free yet. In fact, some famous statistical software, such as SAS, could be quite expensive year after year, although there are plenty of alternatives now. And they need an “analytics-ready” data environment, as I emphasized countless times in this series (refer to “Chicken or the Egg? Data or Analytics?” and “Marketing and IT; Cats and Dogs”). Such data preparation work is not for statisticians, and most of them are not even good at cleaning up dirty data, anyway. That means you will need different types of developers/programmers on the analytics team. I pointed out that analytical projects call for a cohesive team, not some super-duper analyst who can do it all (refer to “How to Be a Good Data Scientist”).

By now you would say “Jeez Louise, enough already,” as all this is just too much to manage to build just a few models. Suddenly, outsourcing may sound like a great idea. Then you would realize there are many things to consider when outsourcing analytical work.

First, where would you go? Everyone in the data industry and their cousins claim that they can take care of analytics. But in reality, it is a scary place where many who have “analytics” in their taglines do not even touch “predictive analytics.”

Analytics is a word that is abused as much as “Big Data,” so we really need to differentiate them. “Analytics” may mean:

  • Business Intelligence (BI) Reporting: This is mostly about the present, such as the display of key success metrics and dashboard reporting. While it is very important to know about the current state of business, much of so-called “analytics” unfortunately stops right here. Yes, it is good to have a dashboard in your car now, but do you know where you should be going?
  • Descriptive Analytics: This is about how the targets “look.” Common techniques such as profiling, segmentation and clustering fall under this category. These techniques are mainly for describing the target audience to enhance and optimize messages to them. But using these segments as a selection mechanism is not recommended, while many dare to do exactly that (more on this subject in future articles).
  • Predictive Modeling: This is about answering the questions about the future. Who would be more likely to behave certain ways? What communication channels will be most effective for whom? How much is the potential spending level of a prospect? Who is more likely to be a loyal and profitable customer? What are their preferences? Response models, various of types of cloning models, value models, and revenue models, attrition models, etc. all fall under this category, and they require hardcore statistical skills. Plus, as I emphasized earlier, these model scores compact large amounts of complex data into nice bite-size packages.
  • Optimization: This is mostly about budget allocation and attribution. Marketing agencies (or media buyers) generally deal with channel optimization and spending analysis, at times using econometrics models. This type of statistical work calls for different types of expertise, but many still insist on calling it simply “analytics.”

Let’s say that for the purpose of customer-level targeting and personalization, we decided to outsource the “predictive” modeling projects. What are our options?

We may consider:

  • Individual Consultants: In-house consultants are dedicated to your business for the duration of the contract, guaranteeing full access like an employee. But they are there for you only temporarily, with one foot out the door all the time. And when they do leave, all the knowledge walks away with them. Depending on the rate, the costs can add up.
  • Standalone Analytical Service Providers: Analytical work is all they do, so you get focused professionals with broad technical and institutional knowledge. Many of them are entrepreneurs, but that may work against you, as they could often be understaffed and stretched thin. They also tend to charge for every little step, with not many freebies. They are generally open to use any type of data, but the majority of them do not have secure sources of third-party data, which could be essential for certain types of analytics involving prospecting.
  • Database Service Providers: Almost all data compilers and brokers have statistical units, as they need to fill in the gap within their data assets with statistical techniques. (You didn’t think that they knew everyone’s income or age, did you?) For that reason, they have deep knowledge in all types of data, as well as in many industry verticals. They provide a one-stop shop environment with deep resource pools and a variety of data processing capabilities. However, they may not be as agile as smaller analytical shops, and analytics units may be tucked away somewhere within large and complex organizations. They also tend to emphasize the use of their own data, as after all, their main cash cows are their data assets.
  • Direct Marketing Agencies: Agencies are very strategic, as they touch all aspects of marketing and control creative processes through segmentation. Many large agencies boast full-scale analytical units, capable of all types of analytics that I explained earlier. But some agencies have very small teams, stretched really thin—just barely handling the reporting aspect, not any advanced analytics. Some just admit that predictive analytics is not part of their core competencies, and they may outsource such projects (not that it is a bad thing).

As you can see here, there is no clear-cut answer to “with whom you should you work.” Basically, you will need to check out all types of analysts and service providers to determine the partner best suitable for your long- and short-term business purposes, not just analytical goals. Often, many marketers just go with the lowest bidder. But pricing is just one of many elements to be considered. Here, allow me to introduce “10 Essential Items to Consider When Outsourcing Analytics.”

1. Consulting Capabilities: I put this on the top of the list, as being a translator between the marketing and the technology world is the most important differentiator (refer to “How to Be a Good Data Scientist”). They must understand the business goals and marketing needs, prescribe suitable solutions, convert such goals into mathematical expressions and define targets, making the best of available data. If they lack strategic vision to set up the data roadmap, statistical knowledge alone will not be enough to achieve the goals. And such business goals vary greatly depending on the industry, channel usage and related success metrics. Good consultants always ask questions first, while sub-par ones will try to force-fit marketers’ goals into their toolsets and methodologies.

Translating marketing goals into specific courses of action is a skill in itself. A good analytical partner should be capable of building a data roadmap (not just statistical steps) with a deep understanding of the business impact of resultant models. They should be able to break down larger goals into smaller steps, creating proper phased approaches. The plan may call for multiple models, all kinds of pre- and post-selection rules, or even external data acquisition, while remaining sensitive to overall costs.

The target definition is the core of all these considerations, which requires years of experience and industry knowledge. Simply, the wrong or inadequate targeting decision leads to disastrous results, no matter how sound the mathematical work is (refer to “Art of Targeting”).

Another important quality of a good analytical partner is the ability to create usefulness out of seemingly chaotic and unstructured data environments. Modeling is not about waiting for the perfect set of data, but about making the best of available data. In many modeling bake-offs, the winners are often decided by the creative usage of provided data, not just statistical techniques.

Finally, the consultative approach is important, as models do not exist in a vacuum, but they have to fit into the marketing engine. Be aware of the ones who want to change the world around their precious algorithms, as they are geeks not strategists. And the ones who understand the entire marketing cycle will give advice on what the next phase should be, as marketing efforts must be perpetual, not transient.

So, how will you find consultants? Ask the following questions:

  • Are they “listening” to you?
  • Can they repeat “your” goals in their own words?
  • Do their roadmaps cover both short- and long-term goals?
  • Are they confident enough to correct you?
  • Do they understand “non-statistical” elements in marketing?
  • Have they “been there, done that” for real, or just in theories?

2. Data Processing Capabilities: I know that some people look down upon the word “processing.” But data manipulation is the most important key step “before” any type of advanced analytics even begins. Simply, “garbage-in, garbage out.” And unfortunately, most datasets are completely unsuitable for analytics and modeling. In general, easily more than 80 percent of model development time goes into “fixing” the data, as most are unstructured and unrefined. I have been repeatedly emphasizing the importance of a “model-ready” (or “analytics-ready”) environment for that reason.

However, the reality dictates that the majority of databases are indeed NOT model-ready, and most of them are not even close to it. Well, someone has to clean up the mess. And in this data business, the last one who touches the dataset becomes responsible for all the errors and mistakes made to it thus far. I know it is not fair, but that is why we need to look at the potential partner’s ability to handle large and really messy data, not just the statistical savviness displayed in glossy presentations.

Yes, that dirty work includes data conversion, edit/hygiene, categorization/tagging, data summarization and variable creation, encompassing all kinds of numeric, character and freeform data (refer to “Beyond RFM Data” and “Freeform Data Aren’t Exactly Free”). It is not the most glorious part of this business, but data consistency is the key to successful implementation of any advanced analytics. So, if a model-ready environment is not available, someone had better know how to make the best of whatever is given. I have seen too many meltdowns in “before” and “after” modeling steps due to inconsistencies in databases.

So, grill the candidates with the following questions:

  • If they support file conversions, edit, categorization and summarization
  • How big of a dataset is too big, and how many files/tables are too many for them
  • How much free-form data are too much for them
  • Ask for sample model variables that they have created in the past

3. Track Records in the Industry: It can be argued that industry knowledge is even more crucial for the success than statistical know-how, as nuances are often “Lost in Translation” without relevant industry experience. In fact, some may not even be able to carry on a proper conversation with a client without it, leading to all kinds of wrong assumptions. I have seen a case where “real” rocket scientists messed up models for credit card campaigns.

The No. 1 reason why industry experience is important is everyone’s success metrics are unique. Just to name a few, financial services (banking, credit card, insurance, investment, etc.), travel and hospitality, entertainment, packaged goods, online and offline retail, catalogs, publication, telecommunications/utilities, non-profit and political organizations all call for different types of analytics and models, as their business models and the way they interact with target audiences are vastly different. For example, building a model (or a database, for that matter) for businesses where they hand over merchandise “before” they collect money is fundamentally different than the ones where exchange happens simultaneously. Even a simple concept of payment date or transaction date cannot be treated the same way. For retailers, recent dates could be better for business, but for subscription business, older dates may carry more weight. And these are just some examples with “dates,” before touching any dollar figures or other fun stuff.

Then the job gets even more complicated, if we further divide all of these industries by B-to-B vs. B-to-C, where available data do not even look similar. On top of that, divisional ROI metrics may be completely different, and even terminology and culture may play a role in all of this. When you are a consultant, you really don’t want to stop the flow of a meeting to clarify some unfamiliar acronyms, as you are supposed to know them all.

So, always demand specific industry references and examine client roasters, if allowed. (Many clients specifically ask vendors not to use their names as references.) Basically, watch out for the ones who push one-size-fits-all cookie-cutter solutions. You deserve way more than that.

4. Types of Models Supported: Speaking of cookie-cutter stuff, we need to be concerned with types of models that the outsourcing partner would support. Sure, nobody employs every technique, and no one can be good at everything. But we need to watch out for the “One-trick Ponies.”

This could be a tricky issue, as we are going into a more technical domain. Plus, marketers should not self-prescribe with specific techniques, instead of clearly stating their business goals (refer to “Marketing and IT; Cats and Dogs”). Some of the modeling goals are:

  • Rank and select prospect names
  • Lead scoring
  • Cross-sell/upsell
  • Segment the universe for messaging strategy
  • Pinpoint the attrition point
  • Assign lifetime values for prospects and customers
  • Optimize media/channel spending
  • Create new product packages
  • Detect fraud
  • Etc.

Unless you have successfully dealt with the outsourcing partner in the past (or you have a degree in statistics), do not blurt out words like Neural-net, CHAID, Cluster Analysis, Multiple Regression, Discriminant Function Analysis, etc. That would be like demanding specific medication before your new doctor even asks about your symptoms. The key is meeting your business goals, not fulfilling buzzwords. Let them present their methodology “after” the goal discussion. Nevertheless, see if the potential partner is pushing one or two specific techniques or solutions all the time.

5. Speed of Execution: In modern marketing, speed to action is the king. Speed wins, and speed gains respect. However, when it comes to modeling or other advanced analytics, you may be shocked by the wide range of time estimates provided by each outsourcing vendor. To be fair they are covering themselves, mainly because they have no idea what kind of messy data they will receive. As I mentioned earlier, pre-model data preparation and manipulation are critical components, and they are the most time-consuming part of all; especially when available data are in bad shape. Post-model scoring, audit and usage support may elongate the timeline. The key is to differentiate such pre- and post-modeling processes in the time estimate.

Even for pure modeling elements, time estimates vary greatly, depending on the complexity of assignments. Surely, a simple cloning model with basic demographic data would be much easier to execute than the ones that involve ample amounts of transaction- and event-level data, coming from all types of channels. If time-series elements are added, it will definitely be more complex. Typical clustering work is known to take longer than regression models with clear target definitions. If multiple models are required for the project, it will obviously take more time to finish the whole job.

Now, the interesting thing about building a model is that analysts don’t really finish it, but they just run out of time—much like the way marketers work on PowerPoint presentations. The commonality is that we can basically tweak models or decks forever, but we have to stop at some point.

However, with all kinds of automated tools and macros, model development time has decreased dramatically in past decades. We really came a long way since the first application of statistical techniques to marketing, and no one should be quoting a 1980s timeline in this century. But some still do. I know vendors are trained to follow the guideline “always under-promise and over-deliver,” but still.

An interesting aspect of this dilemma is that we can negotiate the timeline by asking for simpler and less sophisticated versions with diminished accuracy. If, hypothetically, it takes a week to be 98 percent accurate, but it only takes a day to be 90 percent accurate, what would you pick? That should be the business decision.

So, what is a general guideline? Again, it really depends on many factors, but allow me to share a version of it:

  • Pre-modeling Processing

– Data Conversions: from half a day to weeks

– Data Append/Enhancement: between overnight and two days

– Data Edit and Summarization: Data-dependent

  • Modeling: Ranges from half a day to weeks

– Depends on type, number of models and complexity

  • Scoring: from half a day to one week

– Mainly depends on number of records and state of the database to be scored

I know these are wide ranges, but watch out for the ones that routinely quote 30 days or more for simple clone models. They may not know what they are doing, or worse, they may be some mathematical perfectionists who don’t understand the marketing needs.

6. Pricing Structure: Some marketers would put this on top of the checklist, or worse, use the pricing factor as the only criterion. Obviously, I disagree. (Full disclosure: I have been on the service side of the fence during my entire career.) Yes, every project must make an economic sense in the end, but the budget should not and cannot be the sole deciding factor in choosing an outsourcing partner. There are many specialists under famous brand names who command top dollars, and then there are many data vendors who throw in “free” models, disrupting the ecosystem. Either way, one should not jump to conclusions too fast, as there is no free lunch, after all. In any case, I strongly recommend that no one should start the meeting with pricing questions (hence, this article). When you get to the pricing part, ask what the price includes, as the analytical journey could be a series of long and winding roads. Some of the biggest factors that need to be considered are:

  • Multiple Model Discounts—Less for second or third models within a project?
  • Pre-developed (off-the-shelf) Models—These can be “much” cheaper than custom models, while not custom-fitted.
  • Acquisition vs. CRM—Employing client-specific variables certainly increases the cost.
  • Regression Models vs. Other Types—At times, types of techniques may affect the price.
  • Clustering and Segmentations—They are generally priced much higher than target-specific models.

Again, it really depends on the complexity factor more than anything else, and the pre- and post-modeling process must be estimated and priced separately. Non-modeling charges often add up fast, and you should ask for unit prices and minimum charges for each step.

Scoring charges in time can be expensive, too, so negotiate for discounts for routine scoring of the same models. Some may offer all-inclusive package pricing for everything. The important thing is that you must be consistent with the checklist when shopping around with multiple candidates.

7. Documentation: When you pay for a custom model (not pre-developed, off-the-shelf ones), you get to own the algorithm. Because algorithms are not tangible items, the knowledge is to be transformed in model documents. Beware of the ones who offer “black-box” solutions with comments like, “Oh, it will work, so trust us.”

Good model documents must include the following, at the minimum:

  • Target and Comparison Universe Definitions: What was the target variable (or “dependent” variable) and how was it defined? How was the comparison universe defined? Was there any “pre-selection” for either of the universes? These are the most important factors in any model—even more than the mechanics of the model itself.
  • List of Variables: What are the “independent” variables? How were they transformed or binned? From where did they originate? Often, these model variables describe the nature of the model, and they should make intuitive sense.
  • Model Algorithms: What is the actual algorithm? What are the assigned weight for each independent variable?
  • Gains Chart: We need to examine potential effectiveness of the model. What are the “gains” for each model group, from top to bottom (e.g., 320 percent gain at the top model group in comparison to the whole universe)? How fast do such gains decrease as we move down the scale? How do the gains factors compare against the validation sample? A graphic representation would be nice, too.

For custom models, it is customary to have a formal model presentation, full documentation and scoring script in designated programming languages. In addition, if client files are provided, ask for a waterfall report that details input and output counts of each step. After the model scoring, it is also customary for the vendor to provide a scored universe count by model group. You will be shocked to find out that many so-called analytical vendors do not provide thorough documentation. Therefore, it is recommended to ask for sample documents upfront.

8. Scoring Validation: Models are built and presented properly, but the job is not done until the models are applied to the universe from which the names are ranked and selected for campaigns. I have seen too many major meltdowns at this stage. Simply, it is one thing to develop models with a few hundred thousand record samples, but it is quite another to apply the algorithm to millions of records. I am not saying that the scoring job always falls onto the developers, as you may have an internal team or a separate vendor for such ongoing processes. But do not let the model developer completely leave the building until everything checks out.

The model should have been validated against the validation sample by then, but live scoring may reveal all kinds of inconsistencies. You may also want to back-test the algorithms with past campaign results, as well. In short, many things go wrong “after” the modeling steps. When I hear customers complaining about models, I often find that the modeling is the only part that was done properly, and “before” and “after” steps were all messed up. Further, even machines misunderstand each other, as any differences in platform or scripting language may cause discrepancies. Or, maybe there was no technical error, but missing values may have caused inconsistencies (refer to “Missing Data Can Be Meaningful”). Nonetheless, the model developers would have the best insight as to what could have gone wrong, so make sure that they are available for questions after models are presented and delivered.

9. Back-end Analysis: Good analytics is all about applying learnings from past campaigns—good or bad—to new iterations of efforts. We often call it “closed-loop marketing—while many marketers often neglect to follow up. Any respectful analytics shop must be aware of it, while they may classify such work separately from modeling or other analytical projects. At the minimum, you need to check out if they even offer such services. In fact, so-called “match-back analysis” is not as simple as just matching campaign files against responders in this omnichannel environment. When many channels are employed at the same time, allocation of credit (i.e., “what worked?”) may call for all kinds of business rules or even dedicated models.

While you are at it, ask for a cheaper version of “canned” reports, as well, as custom back-end analysis can be even more costly than the modeling job itself, over time. Pre-developed reports may not include all the ROI metrics that you’re looking for (e.g., open, clickthrough, conversion rates, plus revenue and orders-per-mailed, per order, per display, per email, per conversion. etc.). So ask for sample reports upfront.

If you start breaking down all these figures by data source, campaign, time series, model group, offer, creative, targeting criteria, channel, ad server, publisher, keywords, etc., it can be unwieldy really fast. So contain yourself, as no one can understand 100-page reports, anyway. See if the analysts can guide you with such planning, as well. Lastly, if you are so into ROI analysis, get ready to share the “cost” side of the equation with the selected partner. Some jobs are on the marketers.

10. Ongoing Support: Models have a finite shelf life, as all kinds of changes happen in the real world. Seasonality may be a factor, or the business model or strategy may have changed. Fluctuations in data availability and quality further complicate the matter. Basically assumptions like “all things being equal” only happen in textbooks, so marketers must plan for periodic review of models and business rules.

A sure sign of trouble is decreasing effectiveness of models. When in doubt, consult the developers and they may recommend a re-fit or complete re-development of models. Quarterly reviews would be ideal, but if the cost becomes an issue, start with 6-month or yearly reviews, but never go past more than a year without any review. Some vendors may offer discounts for redevelopment, so ask for the price quote upfront.

I know this is a long list of things to check, but picking the right partner is very important, as it often becomes a long-term relationship. And you may find it strange that I didn’t even list “technical capabilities” at all. That is because:

1. Many marketers are not equipped to dig deep into the technical realm anyway, and

2. The difference between the most mathematically sound models and the ones from the opposite end of the spectrum is not nearly as critical as other factors I listed in this article.

In other words, even the worst model in the bake-off would be much better than no model, if these other business criterion are well-considered. So, happy shopping with this list, and I hope you find the right partner. Employing analytics is not an option when living in the sea of data.

Marketing and IT; Cats and Dogs

Cats and dogs do not get along unless they grew up together since birth. That is because cats and dogs have rather fundamental communication problems with each other. A dog would wag his tail in an upward position when he wants to play. To a cat though, upward-tail is a sure sign of hostility, as in “What’s up, dawg?!” In fact, if you observe an angry or nervous cat, you will see that everything is up; tail, hair, toes, even her spine. So imagine the dog’s confusion in this situation, where he just sent a friendly signal that he wants to play with the cat, and what he gets back are loud hisses and scary evil eyes—but along with an upward tail that “looks” like a peace sign to him. Yeah, I admit that I am a bona-fide dog person, so I looked at this from his perspective, first. But I sympathize with the cat, too. As from her point of view, the dog started to mess with her, disrupting an afternoon slumber in her favorite sunny spot by wagging his stupid tail. Encounters like this cannot end well. Thank goodness that us Homo sapiens lost our tails during our evolutionary journey, as that would have been one more thing that clueless guys would have to decode regarding the mood of our female companions. Imagine a conversation like “How could you not see that I didn’t mean it? My tail was pointing the ground when I said that!” Then a guy would say, “Oh jeez, because I was looking at your lips moving up and down when you were saying something?”

 

Cats and dogs do not get along unless they grew up together since birth. That is because cats and dogs have rather fundamental communication problems with each other. A dog would wag his tail in an upward position when he wants to play. To a cat though, upward-tail is a sure sign of hostility, as in “What’s up, dawg?!” In fact, if you observe an angry or nervous cat, you will see that everything is up; tail, hair, toes, even her spine. So imagine the dog’s confusion in this situation, where he just sent a friendly signal that he wants to play with the cat, and what he gets back are loud hisses and scary evil eyes—but along with an upward tail that “looks” like a peace sign to him. Yeah, I admit that I am a bona-fide dog person, so I looked at this from his perspective first. But I sympathize with the cat, too. As from her point of view, the dog started to mess with her, disrupting an afternoon slumber in her favorite sunny spot by wagging his stupid tail. Encounters like this cannot end well. Thank goodness that us Homo sapiens lost our tails during our evolutionary journey, as that would have been one more thing that clueless guys would have to decode regarding the mood of our female companions. Imagine a conversation like “How could you not see that I didn’t mean it? My tail was pointing the ground when I said that!” Then a guy would say, “Oh jeez, because I was looking at your lips moving up and down when you were saying something?”

Of course I am generalizing for a comedic effect here, but I see communication breakdowns like this all the time in business environments, especially between the marketing and IT teams. You think men are from Mars and women are from Venus? I think IT folks are from Vulcan and marketing people are from Betazed (if you didn’t get this, find a Trekkie around you and ask).

Now that we are living in the age of Big Data where marketing messages must be custom-tailored based on data, we really need to find a way to narrow the gap between the marketing and the IT world. I wouldn’t dare to say which side is more like a dog or a cat, as I will surely offend someone. But I think even non-Trekkies would agree that it could be terribly frustrating to talk to a Vulcan who thinks that every sentence must be logically impeccable, or a Betazed who thinks that someone’s emotional state is the way it is just because she read it that way. How do they meet in the middle? They need a translator—generally a “human” captain of a starship—between the two worlds, and that translator had better speak both languages fluently and understand both cultures without any preconceived notions.

Similarly, we need translators between the IT world and the marketing world, too. Call such translators “data scientists” if you want (refer to “How to Be a Good Data Scientist”). Or, at times a data strategist or a consultant like myself plays that role. Call us “bats” caught in between the beasts and the birds in an Aesop’s tale, as we need to be marginal people who don’t really belong to one specific world 100 percent. At times, it is a lonely place as we are understood by none, and often we are blamed for representing “the other side.” It is hard enough to be an expert in data and analytics, and we now have to master the artistry of diplomacy. But that is the reality, and I have seen plenty of evidence as to why people whose main job it is to harness meanings out of data must act as translators, as well.

IT is a very special function in modern organizations, regardless of their business models. Systems must run smoothly without errors, and all employees and outside collaborators must constantly be in connection through all imaginable devices and operating systems. Data must be securely stored and backed up regularly, and permissions to access them must be granted based on complex rules, based on job levels and functions. Then there are constant requests to install and maintain new and strange software and technologies, which should be patched and updated diligently. And God forbid if anything fails to work even for a few seconds on a weekend, all hell will break lose. Simply, the end-users—many of them in positions of dealing with customers and clients directly—do not care about IT when things run smoothly, as they take them all for granted. But when they don’t, you know the consequences. Thankless job? You bet. It is like a utility company never getting praises when the lights are up, but everyone yelling and screaming if the service is disrupted, even for a natural cause.

On the other side of the world, there are marketers, salespeople and account executives who deal with customers, clients and their bosses, who would treat IT like their servants, not partners, when things do not “seem” to work properly or when “their” sales projections are not met. The craziest part is that most customers, clients and bosses state their goals and complaints in the most ambiguous terms, as in “This ad doesn’t look slick enough,” “This copy doesn’t talk to me,” “This app doesn’t stick” or “We need to find the right audience.” What the IT folks often do not grasp is that (1) it really stinks when you get yelled at by customers and clients for any reason, and (2) not all business goals are easily translatable to logical statements. And this is when all data elements and systems are functioning within normal parameters.

Without a proper translator, marketers often self-prescribe solutions that call for data work and analytics. Often, they think that all the problems will go away if they have unlimited access to every piece of data ever collected. So they ask for exactly that. IT will respond that such request will put a terrible burden on the system, which has to support not just marketing but also other operations. Eventually they may meet in the middle and the marketer will have access to more data than ever possible in the past. Then the marketers realize that their business issues do not go away just because they have more data in their hands. In fact, their job seems to have gotten even more complicated. They think that it is because data elements are too difficult to understand and they start blaming the data dictionary or lack thereof. They start using words like Data Governance and Quality Control, which may sound almost offensive to diligent IT personnel. IT will respond that they showed every useful bit of data they are allowed to share without breaking the security protocol, and the data dictionaries are all up to date. Marketers say the data dictionaries are hard to understand, and they are filled with too many similar variables and seemingly conflicting information. IT now says they need yet another tool set to properly implement data governance protocols and deploy them. Heck, I have seen cases where some heads of IT went for complete re-platforming of their system, as if that would answer all the marketing questions. Now, does this sound familiar so far? Does it sound like your own experience, like when you are reading “Dilbert” comic strips? It is because you are not alone in all this.

Allow me to be a little more specific with an example. Marketers often talk about “High-Value Customers.” To people who deal with 1s and 0s, that means less than nothing. What does that even mean? Because “high-value customers” could be:

  • High-dollar spenders—But what if they do not purchase often?
  • Frequent shoppers—But what if they don’t spend much at all?
  • Recent customers—Oh, those coveted “hotline” names … but will they stay that way, even for another few months?
  • Tenured customers—But are they loyal to your business, now?
  • Customers with high loyalty points—Or are they just racking up points and they would do anything to accumulate points?
  • High activity—Such as point redemptions and other non-monetary activities, but what if all those activities do not generate profit?
  • Profitable customers—The nice ones who don’t need much hand-holding. And where do we get the “cost” side of the equation on a personal level?
  • Customers who purchases extra items—Such as cruisers who drink a lot on board or diners who order many special items, as suggested.
  • Etc., etc …

Now it gets more complex, as these definitions must be represented in numbers and figures, and depending on the industry, whether be they for retailers, airlines, hotels, cruise ships, credit cards, investments, utilities, non-profit or business services, variables that would be employed to define seemingly straightforward “high-value customers” would be vastly different. But let’s say that we pick an airline as an example. Let me ask you this; how frequent is frequent enough for anyone to be called a frequent flyer?

Let’s just assume that we are going through an exercise of defining a frequent flyer for an airline company, not for any other travel-related businesses or even travel agencies (that would deal with lots of non-flyers). Granted that we have access to all necessary data, we may consider using:

1. Number of Miles—But for how many years? If we go back too far, shouldn’t we have to examine further if the customer is still active with the airline in question? And what does “active” mean to you?

2. Dollars Spent—Again for how long? In what currency? Converted into U.S. dollars at what point in time?

3. Number of Full-Price Ticket Purchases—OK, do we get to see all the ticket codes that define full price? What about customers who purchased tickets through partners and agencies vs. direct buyers through the airline’s website? Do they share a common coding system?

4. Days Between Travel—What date shall we use? Booking date, payment date or travel date? What time zones should we use for consistency? If UTC/GMT is to be used, how will we know who is booking trips during business hours vs. evening hours in their own time zone?

After a considerable hours of debate, let’s say that we reached the conclusion that all involved parties could live with. Then we find out that the databases from the IT department are all on “event” levels (such as clicks, views, bookings, payments, boarding, redemption, etc.), and we would have to realign and summarize the data—in terms of miles, dollars and trips—on an individual customer level to create a definition of “frequent flyers.”

In other words, we would need to see the data from the customer-centric point of view, just to begin the discussion about frequent flyers, not to mention how to communicate with each customer in the future. Now, it that a job for IT or marketing? Who will put the bell on the cat’s neck? (Hint: Not the dog.) Well, it depends. But this definitely is not a traditional IT function, nor is it a standalone analytical project. It is something in between, requiring a translator.

Customer-Centric Database, Revisited
I have been emphasizing the importance of a customer-centric view throughout this series, and I also shared some details regarding databases designed for marketing functions (refer to: “Cheat Sheet: Is Your Database Marketing Ready?”). But allow me to reiterate this point.

In the age of abundant and ubiquitous data, omnichannel marketing communication—optimized based on customers’ past transaction history, product preferences, and demographic and behavioral personas—should be an effortless routine. The reality is far from it for many organizations, as it is very common that much of the vital information is locked in silos without being properly consolidated or governed by a standard set of business rules. It is not that creating such a marketing-oriented database (or data-mart) is solely the IT department’s responsibility, but having a dedicated information source for efficient personalization should be an organizational priority in modern days.

Most databases nowadays are optimized for data collection, storage and rapid retrieval, and such design in general does not provide a customer-centric view—which is essential for any type of personalized communication via all conceivable channels and devices of the present and future. Using brand-, division-, product-, channel- or device-centric datasets is often the biggest obstacle in the journey to an optimal customer experience, as those describe events and transactions, not individuals. Further, bits and pieces of information must be transformed into answers to questions through advanced analytics, including statistical models.

In short, all analytical efforts must be geared toward meeting business objectives, and databases must be optimized for analytics (refer to “Chicken or the Egg? Data or Analytics?”). Unfortunately, the situation is completely reversed in many organizations, where analytical maneuvering is limited due to inadequate source data, and decision-making processes are dictated by limitations of available analytics. Visible symptoms of such cases are, to list a few, elongated project cycle time, decreasing response rates, ineffective customer communication, saturation of data sources due to overexposure, and—as I was emphasizing in this article—communication breakdown among divisions and team members. I can even go as far as to say that the lack of a properly designed analytical environment is the No. 1 cause of miscommunications between IT and marketing.

Without a doubt, key pieces of data must reside in the centralized data depository—generally governed by IT—for effective marketing. But that is only the beginning and still is just a part of the data collection process. Collected data must be consolidated around the solid definition of a “customer,” and all product-, transaction-, event- and channel-level information should be transformed into descriptors of customers, via data standardization, categorization, transformation and summarization. Then the data may be further enhanced via third-party data acquisition and statistical modeling, using all available data. In other words, raw data must be refined through these steps to be useful in marketing and other customer interactions, online or offline (refer to “‘Big Data’ Is Like Mining Gold for a Watch—Gold Can’t Tell Time“). It does not matter how well the original transaction- or event-level data are stored in the main database without visible errors, or what kind of state-of-the-art communication tool sets a company is equipped with. Trying to use raw data for a near real-time personalization engine is like putting unrefined oil into a high-performance sports car.

This whole data refinement process may sound like a daunting task, but it is not nearly as painful as analytical efforts to derive meanings out of unstructured, unconsolidated and uncategorized data that are scattered all over the organization. A customer-centric marketing database (call it a data-mart if “database” sounds too much like it should solely belong to IT) created with standard business rules and uniform variables sets would, in turn, provide an “analytics-ready” environment, where statistical modeling and other advanced analytics efforts would gain tremendous momentum. In the end, the decision-making process would become much more efficient as analytics would provide answers to questions, not just bits and pieces of fragmented data, to the ultimate beneficiaries of data. And answers to questions do not require an enormous data dictionary, either; fast-acting marketing machines do not have time to look up dictionaries, anyway.

Data Roadmap—Phased Approach
For the effort to build a consolidated marketing data platform that is analytics-ready (hence, marketing-ready), I always recommend a phased approach, as (1) inevitable complexity of a data consolidation project will be contained and managed more efficiently in carefully defined phases, and (2) each phase will require different types of expertise, tool sets and technologies. Nevertheless, the overall project must be managed by an internal champion, along with a group of experts who possess long-term vision and tactical knowledge in both database and analytics technologies. That means this effort must reside above IT and marketing, and it should be seen as a strategic effort for an entire organization. If the company already hired a Chief Data Officer, I would say that this should be one of the top priorities for that position. If not, outsourcing would be a good option, as an impartial decision-maker, who would play a role of a referee, may have to come from the outside.

The following are the major steps:

  1. Formulate Questions: “All of the above” is not a good way to start a complex project. In order to come up with the most effective way to build a centralized data depository, we first need to understand what questions must be answered by it. Too many database projects call for cars that must fly, as well.
  2. Data Inventory: Every organization has more data than it expected, and not all goldmines are in plain sight. All the gatekeepers of existing databases should be interviewed, and any data that could be valuable for customer descriptions or behavioral predictions should be considered, starting with product, transaction, promotion and response data, stemming from all divisions and marketing channels.
  3. Data Hygiene and Standardization: All available data fields should be examined and cleaned up, where some data may be discarded or modified. Free form fields would deserve special attention, as categorization and tagging are one of the key steps to opening up new intelligence.
  4. Customer Definition: Any existing Customer ID systems (such as loyalty program ID, account number, etc.) will be examined. It may be further enhanced with available PII (personally identifiable information), as there could be inconsistencies among different systems, and customers often move their residency or use multiple email addresses, creating duplicate identities. A consistent and reliable Customer ID system becomes the backbone of a customer-centric database.
  5. Data Consolidation: Data from different silos and divisions will be merged together based on the master Customer ID. A customer-centric database begins to take shape here. The database update process should be thoroughly tested, as “incremental” updates are often found to be more difficult than the initial build. The job is simply not done until after a few successful iterations of updates.
  6. Data Transformation: Once a solid Customer ID system is in place, all transaction- and event-level data will be transformed to “descriptors” of individual customers, via summarization by categories and creation of analytical variables. For example, all product information will be aligned for each customer, and transaction data will be converted into personal-level monetary summaries and activities, in both static and time-series formats. Promotion and response history data will go through similar processes, yielding individual-level ROI metrics by channel and time period. This is the single-most critical step in all of this, requiring deep knowledge in business, data and analytics, as the stage is being set for all future analytics and reporting efforts. Due to variety and uniqueness of business goals in different industries, a one-size-fits-all approach will not work, either.
  7. Analytical Projects: Test projects will be selected and the entire process will be done on the new platform. Ad-hoc reporting and complex modeling projects will be conducted, and the results will be graded on timing, accuracy, consistency and user-friendliness. An iterative approach is required, as it is impossible to foresee all possible user requests and related complexities upfront. A database should be treated as a living, breathing organism, not something rigid and inflexible. Marketers will “break-in” the database as they use it more routinely.
  8. Applying the Knowledge: The outcomes of analytical projects will be applied to the entire customer base, and live campaigns will be run based on them. Often, major breakdowns happen at the large-scale deployment stages; especially when dealing with millions of customers and complex mathematical formulae at the same time. A model-ready database will definitely minimize the risk (hence, the term “in-database scoring”), but the process will still require some fine-tuning. To proliferate gained knowledge throughout the organization, some model scores—which pack deep intelligence in small sizes—may be transferred back to the main databases managed by IT. Imagine model scores driving operational decisions—live, on the ground.
  9. Result Analysis: Good marketing intelligence engines must be equipped with feedback mechanisms, effectively closing the “loop” where each iteration of marketing efforts improves its effectiveness with accumulated knowledge on a customer level. It is very unfortunate that many marketers just move through the tracks set up by their predecessors, mainly because existing database environments are not even equipped to link necessary data elements on a customer level. Too many back-end analyses are just event-, offer- or channel-driven, not customer-centric. Can you easily tell which customer is over-, under- or adequately promoted, based on a personal-level promotion-and-response ratio? With a customer-centric view established, you can.

These are just high-level summaries of key steps, and each step should be managed as independent projects within a large-scale initiative with common goals. Some steps may run concurrently to reduce the overall timeline, and tactical knowledge in all required technologies and tool sets is the key for the successful implementation of centralized marketing intelligence.

Who Will Do the Work?
Then, who will be in charge of all this and who will actually do the work? As I mentioned earlier, a job of this magnitude requires a champion, and a CDO may be a good fit. But each of these steps will require different skill sets, so some outsourcing may be inevitable (more on how to pick an outsourcing partner in future articles).

But the case that should not be is the IT team or the analytics team solely dictating the whole process. Creating a central depository of marketing intelligence is something that sits between IT and marketing, and the decisions must be made with business goals in mind, not just limitations and challenges that IT faces. If the CDO or the champion of this type of initiative starts representing IT issues before overall business goals, then the project is doomed from the beginning. Again, it is not about touching the core database of the company, but realigning existing data assets to create new intelligence. Raw data (no matter how clean they are at the collection stage) are like unrefined raw materials to the users. What the decision-makers need are simple answers to their questions, not hundreds of data pieces.

From the user’s point of view, data should be:

  • Easy to understand and use (intuitive to non-mathematicians)
  • Bite-size (i.e., small answers, not mounds of raw data)
  • Useful and effective (consistently accurate)
  • Broad (answers should be available most of time, not just “sometimes”)
  • Readily available (data should be easily accessible via users’ favorite devices/channels)

And getting to this point is the job of a translator who sits in between marketing and IT. Call them data scientists or data strategists, if you like. But they do not belong to just marketing or IT, even though they have to understand both sides really well. Do not be rigid, insisting that all pilots must belong to the Air Force; some pilots do belong to the Navy.

Lastly, let me add this at the risk of sounding like I am siding with technologists. Marketers, please don’t be bad patients. Don’t be that bad patient who shows up at a doctor’s office with a specific prescription, as in “Don’t ask me why, but just give me these pills, now.” I’ve even met an executive who wanted a neural-net model for his business without telling me why. I just said to myself, “Hmm, he must have been to one of those analytics conferences recently.” Then after listening to his “business” issues, I prescribed an entirely different solution package.

So, instead of blurting out requests for pieces of data variables or queries using cool-sounding, semi-technical terms, state the business issues and challenges that you are facing as clearly as possible. IT and analytics specialists will prescribe the right solution for you if they understand the ultimate goals better. Too often, requesters determine the solutions they want without any understanding of underlying technical issues. Don’t forget that the end-users of any technology are only exposed to symptoms, not the causes.

And if Mr. Spock doesn’t seem to understand your issues and keeps saying that your statements are illogical, then call in a translator, even if you have to hire him for just one day. I know this all too well, because after all, this one phrase summarizes my entire career: “A bridge person between the marketing world and the IT world.” Although it ain’t easy to live a life as a marginal person.

How to Be a Good Data Scientist

I guess no one wants to be a plain “Analyst” anymore; now “Data Scientist” is the title of the day. Then again, I never thought that there was anything wrong with titles like “Secretary,” “Stewardess” or “Janitor,” either. But somehow, someone decided “Administrative Assistant” should replace “Secretary” completely, and that someone was very successful in that endeavor. So much so that, people actually get offended when they are called “Secretaries.” The same goes for “Flight Attendants.” If you want an extra bag of peanuts or the whole can of soda with ice on the side, do not dare to call any service personnel by the outdated title. The verdict is still out for the title “Janitor,” as it could be replaced by “Custodial Engineer,” “Sanitary Engineer,” “Maintenance Technician,” or anything that gives an impression that the job requirement includes a degree in engineering. No matter. When the inflation-adjusted income of salaried workers is decreasing, I guess the number of words in the job title should go up instead. Something’s got to give, right?

I guess no one wants to be a plain “Analyst” anymore; now “Data Scientist” is the title of the day. Then again, I never thought that there was anything wrong with titles like “Secretary,” “Stewardess” or “Janitor,” either. But somehow, someone decided “Administrative Assistant” should replace “Secretary” completely, and that someone was very successful in that endeavor. So much so that, people actually get offended when they are called “Secretaries.” The same goes for “Flight Attendants.” If you want an extra bag of peanuts or the whole can of soda with ice on the side, do not dare to call any service personnel by the outdated title. The verdict is still out for the title “Janitor,” as it could be replaced by “Custodial Engineer,” “Sanitary Engineer,” “Maintenance Technician,” or anything that gives an impression that the job requirement includes a degree in engineering. No matter. When the inflation-adjusted income of salaried workers is decreasing, I guess the number of words in the job title should go up instead. Something’s got to give, right?

Please do not ask me to be politically correct here. As an openly Asian person in America, I am not even sure why I should be offended when someone addresses me as an “Oriental.” Someone explained it to me a long time ago. The word is reserved for “things,” not for people. OK, then. I will be offended when someone knowingly addresses me as an Oriental, now that the memo has been out for a while. So, do me this favor and do not call me an Oriental (at least in front of my face), and I promise that I will not call anyone an “Occidental” in return.

In any case, anyone who touches data for living now wants to be called a Data Scientist. Well, the title is longer than one word, and that is a good start. Did anyone get a raise along with that title inflation? I highly doubt it. But I’ve noticed the qualifications got much longer and more complicated.

I have seen some job requirements for data scientists that call for “all” of the following qualifications:

  • A master’s degree in statistics or mathematics; able to build statistical models proficiently using R or SAS
  • Strong analytical and storytelling skills
  • Hands-on knowledge in technologies such as Hadoop, Java, Python, C++, NoSQL, etc., being able to manipulate the data any which way, independently
  • Deep knowledge in ETL (extract, transform and load) to handle data from all sources
  • Proven experience in data modeling and database design
  • Data visualization skills using whatever tools that are considered to be cool this month
  • Deep business/industry/domain knowledge
  • Superb written and verbal communication skills, being able to explain complex technical concepts in plain English
  • Etc. etc…

I actually cut this list short, as it is already becoming ridiculous. I just want to see the face of a recruiter who got the order to find super-duper candidates based on this list—at the same salary level as a Senior Statistician (another fine title). Heck, while we’re at it, why don’t we add that the candidate must look like Brad Pitt and be able to tap-dance, too? The long and the short of it is maybe some executive wanted to hire just “1” data scientist with all these skillsets, hoping to God that this mad scientist will be able to make sense out of mounds of unstructured and unorganized data all on her own, and provide business answers without even knowing what the question was in the first place.

Over the years, I have worked with many statisticians, analysts and programmers (notice that they are all one-word titles), dealing with large, small, clean, dirty and, at times, really dirty data (hence the title of this series, “Big Data, Small Data, Clean Data, Messy Data”). And navigating through all those data has always been a team effort.

Yes, there are some exceptional musicians who can write music and lyrics, sing really well, play all instruments, program sequencers, record, mix, produce and sell music—all on their own. But if you insist that only such geniuses can produce music, there won’t be much to listen to in this world. Even Stevie Wonder, who can write and sing, and play keyboards, drums and harmonicas, had close to 100 names on the album credits in his heyday. Yes, the digital revolution changed the music scene as much as the data industry in terms of team sizes, but both aren’t and shouldn’t be one-man shows.

So, if being a “Data Scientist” means being a super businessman/analyst/statistician who can program, build models, write, present and sell, we should all just give up searching for one in the near future within your budget. Literally, we may be able to find a few qualified candidates in the job market on a national level. Too bad that every industry report says we need tens of thousands of them, right now.

Conversely, if it is just a bloated new title for good old data analysts with some knowledge in statistical applications and the ability to understand business needs—yeah, sure. Why not? I know plenty of those people, and we can groom more of them. And I don’t even mind giving them new long-winded titles that are suitable for the modern business world and peer groups.

I have been in the data business for a long time. And even before the datasets became really large, I have always maintained the following division of labor when dealing with complex data projects involving advanced analytics:

  • Business Analysts
  • Programmers/Developers
  • Statistical Analysts

The reason is very simple: It is extremely difficult to be a master-level expert in just one of these areas. Out of hundreds of statisticians who I’ve worked with, I can count only a handful of people who even “tried” to venture into the business side. Of those, even fewer successfully transformed themselves into businesspeople, and they are now business owners of consulting practices or in positions with “Chief” in their titles (Chief Data Officer or Chief Analytics Officer being the title du jour).

On the other side of the spectrum, less than a 10th of decent statisticians are also good at coding to manipulate complex data. But even they are mostly not good enough to be completely independent from professional programmers or developers. The reality is, most statisticians are not very good at setting up workable samples out of really messy data. Simply put, handling data and developing analytical frameworks or models call for different mindsets on a professional level.

The Business Analysts, I think, are the closest to the modern-day Data Scientists; albeit that the ones in the past were less so technicians, due to available toolsets back then. Nevertheless, granted that it is much easier to teach business aspects to statisticians or developers than to convert businesspeople or marketers into coders (no offense, but true), many of these “in-between” people—between the marketing world and technology world, for example—are rooted in the technology world (myself included) or at least have a deep understanding of it.

At times labeled as Research Analysts, they are the folks who would:

  • Understand the business requirements and issues at hand
  • Prescribe suitable solutions
  • Develop tangible analytical projects
  • Perform data audits
  • Procure data from various sources
  • Translate business requirements into technical specifications
  • Oversee the progress as project managers
  • Create reports and visual presentations
  • Interpret the results and create “stories”
  • And present the findings and recommended next steps to decision-makers

Sounds complex? You bet it is. And I didn’t even list all the job functions here. And to do this job effectively, these Business/Research Analysts (or Data Scientists) must understand the technical limitations of all related areas, including database, statistics, and general analytics, as well as industry verticals, uniqueness of business models and campaign/transaction channels. But they do not have to be full-blown statisticians or coders; they just have to know what they want and how to ask for it clearly. If they know how to code as well, great. All the more power to them. But that would be like a cherry on top, as the business mindset should be in front of everything.

So, now that the data are bigger and more complex than ever in human history, are we about to combine all aspects of data and analytics business and find people who are good at absolutely everything? Yes, various toolsets made some aspects of analysts’ lives easier and simpler, but not enough to get rid of the partitions between positions completely. Some third basemen may be able to pitch, too. But they wouldn’t go on the mound as starting pitchers—not on a professional level. And yes, analysts who advance up through the corporate and socioeconomic ladder are the ones who successfully crossed the boundaries. But we shouldn’t wait for the ones who are masters of everything. Like I said, even Stevie Wonder needs great sound engineers.

Then, what would be a good path to find Data Scientists in the existing pool of talent? I have been using the following four evaluation criteria to identify individuals with upward mobility in the technology world for a long time. Like I said, it is a lot simpler and easier to teach business aspects to people with technical backgrounds than the other way around.

So let’s start with the techies. These are the qualities we need to look for:

1. Skills: When it comes to the technical aspect of it, the skillset is the most important criterion. Generally a person has it, or doesn’t have it. If we are talking about a developer, how good is he? Can he develop a database without wasting time? A good coder is not just a little faster than mediocre ones; he can be 10 to 20 times faster. I am talking about the ones who don’t have to look through some manual or the Internet every five minutes, but the ones who just know all the shortcuts and options. The same goes for statistical analysts. How well is she versed in all the statistical techniques? Or is she a one-trick pony? How is her track record? Are her models performing in the market for a prolonged time? The thing about statistical work is that time is the ultimate test; we eventually get to find out how well the prediction holds up in the real world.

2. Attitude: This is a very important aspect, as many techies are locked up in their own little world. Many are socially awkward, like characters in Dilbert or “Big Bang Theory,” and most much prefer to deal with the machines (where things are clean-cut binary) than people (well, humans can be really annoying). Some do not work well with others and do not know how to compromise at all, as they do not know how to look at the world from a different perspective. And there are a lot of lazy ones. Yes, lazy programmers are the ones who are more motivated to automate processes (primarily to support their laissez faire lifestyle), but the ones who blow the deadlines all the time are just too much trouble for the team. In short, a genius with a really bad attitude won’t be able to move to the business or the management side, regardless of the IQ score.

3. Communication: Many technical folks are not good at written or verbal communications. I am not talking about just the ones who are foreign-born (like me), even though most technically oriented departments are full of them. The issue is many technical people (yes, even the ones who were born and raised in the U.S., speaking English) do not communicate with the rest of the world very well. Many can’t explain anything without using technical jargon, nor can they summarize messages to decision-makers. Businesspeople don’t need to hear the life story about how complex the project was or how messy the data sets were. Conversely, many techies do not understand marketers or businesspeople who speak plain English. Some fail to grasp the concept that human beings are not robots, and most mortals often fail to communicate every sentence as a logical expression. When a marketer says “Omit customers in New York and New Jersey from the next campaign,” the coder on the receiving end shouldn’t take that as a proper Boolean logic. Yes, obviously a state cannot be New York “and” New Jersey at the same time. But most humans don’t (or can’t) distinguish such differences. Seriously, I’ve seen some developers who refuse to work with people whose command of logical expressions aren’t at the level of Mr. Spock. That’s the primary reason we need business analysts or project managers who work as translators between these two worlds. And obviously, the translators should be able to speak both languages fluently.

4. Business Understanding: Granted, the candidates in question are qualified in terms of criteria one through three. Their eagerness to understand the ultimate business goals behind analytical projects would truly set them apart from the rest on the path to become a data scientist. As I mentioned previously, many technically oriented people do not really care much about the business side of the deal, or even have slight curiosity about it. What is the business model of the company for which they are working? How do they make money? What are the major business concerns? What are the long- and short-term business goals of their clients? Why do they lose sleep at night? Before complaining about incomplete data, why are the databases so messy? How are the data being collected? What does all this data mean for their bottom line? Can you bring up the “So what?” question after a great scientific finding? And ultimately, how will we make our clients look good in front of “their” bosses? When we deal with technical issues, we often find ourselves at a crossroad. Picking the right path (or a path with the least amount of downsides) is not just an IT decision, but more of a business decision. The person who has a more holistic view of the world, without a doubt, would make a better decision—even for a minor difference in a small feature, in terms of programming. Unfortunately, it is very difficult to find such IT people who have a balanced view.

And that is the punchline. We want data scientists who have the right balance of business and technical acumen—not just jacks of all trades who can do all the IT and analytical work all by themselves. Just like business strategy isn’t solely set by a data strategist, data projects aren’t done by one super techie. What we need are business analysts or data scientists who truly “get” the business goals and who will be able to translate them into functional technical specifications, with an understanding of all the limitations of each technology piece that is to be employed—which is quite different from being able to do it all.

If the career path for a data scientist ultimately leads to Chief Data Officer or Chief Analytics Officer, it is important for the candidates to understand that such “chief” titles are all about the business, not the IT. As soon as a CDO, CAO or CTO start representing technology before business, that organization is doomed. They should be executives who understand the technology and employ it to increase profit and efficiency for the whole company. Movie directors don’t necessarily write scripts, hold the cameras, develop special effects or act out scenes. But they understand all aspects of the movie-making process and put all the resources together to create films that they envision. As soon as a director falls too deep into just one aspect, such as special effects, the resultant movie quickly becomes an unwatchable bore. Data business is the same way.

So what is my advice for young and upcoming data scientists? Master the basics and be a specialist first. Pick a field that fits your aptitude, whether it be programming, software development, mathematics or statistics, and try to be really good at it. But remain curious about other related IT fields.

Then travel the world. Watch lots of movies. Read a variety of books. Not just technical books, but books about psychology, sociology, philosophy, science, economics and marketing, as well. This data business is inevitably related to activities that generate revenue for some organization. Try to understand the business ecosystem, not just technical systems. As marketing will always be a big part of the Big Data phenomenon, be an educated consumer first. Then look at advertisements and marketing campaigns from the promotor’s point of view, not just from an annoyed consumer’s view. Be an informed buyer through all available channels, online or offline. Then imagine how the world will be different in the future, and how a simple concept of a monetary transaction will transform along with other technical advances, which will certainly not stop at ApplePay. All of those changes will turn into business opportunities for people who understand data. If you see some real opportunities, try to imagine how you would create a startup company around them. You will quickly realize answering technical challenges is not even the half of building a viable business model.

If you are already one of those data scientists, live up to that title and be solution-oriented, not technology-oriented. Don’t be a slave to technologies, or be whom we sometimes address as a “data plumber” (who just moves data from one place to another). Be a master who wields data and technology to provide useful answers. And most importantly, don’t be evil (like Google says), and never do things just because you can. Always think about the social consequences, as actions based on data and technology affect real people, often negatively (more on this subject in future article). If you want to ride this Big Data wave for the foreseeable future, try not to annoy people who may not understand all the ins and outs of the data business. Don’t be the guy who spoils it for everyone else in the industry.

A while back, I started to see the unemployment rate as a rate of people who are being left behind during the progress (if we consider technical innovations as progress). Every evolutionary stage since the Industrial Revolution created gaps between supply and demand of new skillsets required for the new world. And this wave is not going to be an exception. It is unfortunate that, in this age of a high unemployment rate, we have such hard times finding good candidates for high tech positions. On one side, there are too many people who were educated under the old paradigm. And on the other side, there are too few people who can wield new technologies and apply them to satisfy business needs. If this new title “Data Scientist” means the latter, then yes. We need more of them, for sure. But we all need to be more realistic about how to groom them, as it would take a village to do so. And if we can’t even agree on what the job description for a data scientist should be, we will need lots of luck developing armies of them.

Not All Databases Are Created Equal

Not all databases are created equal. No kidding. That is like saying that not all cars are the same, or not all buildings are the same. But somehow, “judging” databases isn’t so easy. First off, there is no tangible “tire” that you can kick when evaluating databases or data sources. Actually, kicking the tire is quite useless, even when you are inspecting an automobile. Can you really gauge the car’s handling, balance, fuel efficiency, comfort, speed, capacity or reliability based on how it feels when you kick “one” of the tires? I can guarantee that your toes will hurt if you kick it hard enough, and even then you won’t be able to tell the tire pressure within 20 psi. If you really want to evaluate an automobile, you will have to sign some papers and take it out for a spin (well, more than one spin, but you know what I mean). Then, how do we take a database out for a spin? That’s when the tool sets come into play.

Not all databases are created equal. No kidding. That is like saying that not all cars are the same, or not all buildings are the same. But somehow, “judging” databases isn’t so easy. First off, there is no tangible “tire” that you can kick when evaluating databases or data sources. Actually, kicking the tire is quite useless, even when you are inspecting an automobile. Can you really gauge the car’s handling, balance, fuel efficiency, comfort, speed, capacity or reliability based on how it feels when you kick “one” of the tires? I can guarantee that your toes will hurt if you kick it hard enough, and even then you won’t be able to tell the tire pressure within 20 psi. If you really want to evaluate an automobile, you will have to sign some papers and take it out for a spin (well, more than one spin, but you know what I mean). Then, how do we take a database out for a spin? That’s when the tool sets come into play.

However, even when the database in question is attached to analytical, visualization, CRM or drill-down tools, it is not so easy to evaluate it completely, as such practice reveals only a few aspects of a database, hardly all of them. That is because such tools are like window treatments of a building, through which you may look into the database. Imagine a building inspector inspecting a building without ever entering it. Would you respect the opinion of the inspector who just parks his car outside the building, looks into the building through one or two windows, and says, “Hey, we’re good to go”? No way, no sir. No one should judge a book by its cover.

In the age of the Big Data (you should know by now that I am not too fond of that word), everything digitized is considered data. And data reside in databases. And databases are supposed be designed to serve specific purposes, just like buildings and cars are. Although many modern databases are just mindless piles of accumulated data, granted that the database design is decent and functional, we can still imagine many different types of databases depending on the purposes and their contents.

Now, most of the Big Data discussions these days are about the platform, environment, or tool sets. I’m sure you heard or read enough about those, so let me boldly skip all that and their related techie words, such as Hadoop, MongoDB, Pig, Python, MapReduce, Java, SQL, PHP, C++, SAS or anything related to that elusive “cloud.” Instead, allow me to show you the way to evaluate databases—or data sources—from a business point of view.

For businesspeople and decision-makers, it is not about NoSQL vs. RDB; it is just about the usefulness of the data. And the usefulness comes from the overall content and database management practices, not just platforms, tool sets and buzzwords. Yes, tool sets are important, but concert-goers do not care much about the types and brands of musical instruments that are being used; they just care if the music is entertaining or not. Would you be impressed with a mediocre guitarist just because he uses the same brand of guitar that his guitar hero uses? Nope. Likewise, the usefulness of a database is not about the tool sets.

In my past column, titled “Big Data Must Get Smaller,” I explained that there are three major types of data, with which marketers can holistically describe their target audience: (1) Descriptive Data, (2) Transaction/Behavioral Data, and (3) Attitudinal Data. In short, if you have access to all three dimensions of the data spectrum, you will have a more complete portrait of customers and prospects. Because I already went through that subject in-depth, let me just say that such types of data are not the basis of database evaluation here, though the contents should be on top of the checklist to meet business objectives.

In addition, throughout this series, I have been repeatedly emphasizing that the database and analytics management philosophy must originate from business goals. Basically, the business objective must dictate the course for analytics, and databases must be designed and optimized to support such analytical activities. Decision-makers—and all involved parties, for that matter—suffer a great deal when that hierarchy is reversed. And unfortunately, that is the case in many organizations today. Therefore, let me emphasize that the evaluation criteria that I am about to introduce here are all about usefulness for decision-making processes and supporting analytical activities, including predictive analytics.

Let’s start digging into key evaluation criteria for databases. This list would be quite useful when examining internal and external data sources. Even databases managed by professional compilers can be examined through these criteria. The checklist could also be applicable to investors who are about to acquire a company with data assets (as in, “Kick the tire before you buy it.”).

1. Depth
Let’s start with the most obvious one. What kind of information is stored and maintained in the database? What are the dominant data variables in the database, and what is so unique about them? Variety of information matters for sure, and uniqueness is often related to specific business purposes for which databases are designed and created, along the lines of business data, international data, specific types of behavioral data like mobile data, categorical purchase data, lifestyle data, survey data, movement data, etc. Then again, mindless compilation of random data may not be useful for any business, regardless of the size.

Generally, data dictionaries (lack of it is a sure sign of trouble) reveal the depth of the database, but we need to dig deeper, as transaction and behavioral data are much more potent predictors and harder to manage in comparison to demographic and firmographic data, which are very much commoditized already. Likewise, Lifestyle variables that are derived from surveys that may have been conducted a long time ago are far less valuable than actual purchase history data, as what people say they do and what they actually do are two completely different things. (For more details on the types of data, refer to the second half of “Big Data Must Get Smaller.”)

Innovative ideas should not be overlooked, as data packaging is often very important in the age of information overflow. If someone or some company transformed many data points into user-friendly formats using modeling or other statistical techniques (imagine pre-developed categorical models targeting a variety of human behaviors, or pre-packaged segmentation or clustering tools), such effort deserves extra points, for sure. As I emphasized numerous times in this series, data must be refined to provide answers to decision-makers. That is why the sheer size of the database isn’t so impressive, and the depth of the database is not just about the length of the variable list and the number of bytes that go along with it. So, data collectors, impress us—because we’ve seen a lot.

2. Width
No matter how deep the information goes, if the coverage is not wide enough, the database becomes useless. Imagine well-organized, buyer-level POS (Point of Service) data coming from actual stores in “real-time” (though I am sick of this word, as it is also overused). The data go down to SKU-level details and payment methods. Now imagine that the data in question are collected in only two stores—one in Michigan, and the other in Delaware. This, by the way, is not a completely made -p story, and I faced similar cases in the past. Needless to say, we had to make many assumptions that we didn’t want to make in order to make the data useful, somehow. And I must say that it was far from ideal.

Even in the age when data are collected everywhere by every device, no dataset is ever complete (refer to “Missing Data Can Be Meaningful“). The limitations are everywhere. It could be about brand, business footprint, consumer privacy, data ownership, collection methods, technical limitations, distribution of collection devices, and the list goes on. Yes, Apple Pay is making a big splash in the news these days. But would you believe that the data collected only through Apple iPhone can really show the overall consumer trend in the country? Maybe in the future, but not yet. If you can pick only one credit card type to analyze, such as American Express for example, would you think that the result of the study is free from any bias? No siree. We can easily assume that such analysis would skew toward the more affluent population. I am not saying that such analyses are useless. And in fact, they can be quite useful if we understand the limitations of data collection and the nature of the bias. But the point is that the coverage matters.

Further, even within multisource databases in the market, the coverage should be examined variable by variable, simply because some data points are really difficult to obtain even by professional data compilers. For example, any information that crosses between the business and the consumer world is sparsely populated in many cases, and the “occupation” variable remains mostly blank or unknown on the consumer side. Similarly, any data related to young children is difficult or even forbidden to collect, so a seemingly simple variable, such as “number of children,” is left unknown for many households. Automobile data used to be abundant on a household level in the past, but a series of laws made sure that the access to such data is forbidden for many users. Again, don’t be impressed with the existence of some variables in the data menu, but look into it to see “how much” is available.

3. Accuracy
In any scientific analysis, a “false positive” is a dangerous enemy. In fact, they are worse than not having the information at all. Many folks just assume that any data coming out a computer is accurate (as in, “Hey, the computer says so!”). But data are not completely free from human errors.

Sheer accuracy of information is hard to measure, especially when the data sources are unique and rare. And the errors can happen in any stage, from data collection to imputation. If there are other known sources, comparing data from multiple sources is one way to ensure accuracy. Watching out for fluctuations in distributions of important variables from update to update is another good practice.

Nonetheless, the overall quality of the data is not just up to the person or department who manages the database. Yes, in this business, the last person who touches the data is responsible for all the mistakes that were made to it up to that point. However, when the garbage goes in, the garbage comes out. So, when there are errors, everyone who touched the database at any point must share in the burden of guilt.

Recently, I was part of a project that involved data collected from retail stores. We ran all kinds of reports and tallies to check the data, and edited many data values out when we encountered obvious errors. The funniest one that I saw was the first name “Asian” and the last name “Tourist.” As an openly Asian-American person, I was semi-glad that they didn’t put in “Oriental Tourist” (though I still can’t figure out who decided that word is for objects, but not people). We also found names like “No info” or “Not given.” Heck, I saw in the news that this refugee from Afghanistan (he was a translator for the U.S. troops) obtained a new first name as he was granted an entry visa, “Fnu.” That would be short for “First Name Unknown” as the first name in his new passport. Welcome to America, Fnu. Compared to that, “Andolini” becoming “Corleone” on Ellis Island is almost cute.

Data entry errors are everywhere. When I used to deal with data files from banks, I found that many last names were “Ira.” Well, it turned out that it wasn’t really the customers’ last names, but they all happened to have opened “IRA” accounts. Similarly, movie phone numbers like 777-555-1234 are very common. And fictitious names, such as “Mickey Mouse,” or profanities that are not fit to print are abundant, as well. At least fake email addresses can be tested and eliminated easily, and erroneous addresses can be corrected by time-tested routines, too. So, yes, maintaining a clean database is not so easy when people freely enter whatever they feel like. But it is not an impossible task, either.

We can also train employees regarding data entry principles, to a certain degree. (As in, “Do not enter your own email address,” “Do not use bad words,” etc.). But what about user-generated data? Search and kill is the only way to do it, and the job would never end. And the meta-table for fictitious names would grow longer and longer. Maybe we should just add “Thor” and “Sponge Bob” to that Mickey Mouse list, while we’re at it. Yet, dealing with this type of “text” data is the easy part. If the database manager in charge is not lazy, and if there is a bit of a budget allowed for data hygiene routines, one can avoid sending emails to “Dear Asian Tourist.”

Numeric errors are much harder to catch, as numbers do not look wrong to human eyes. That is when comparison to other known sources becomes important. If such examination is not possible on a granular level, then median value and distribution curves should be checked against historical transaction data or known public data sources, such as U.S. Census Data in the case of demographic information.

When it’s about the companies’ own data, follow your instincts and get rid of data that look too good or too bad to be true. We all can afford to lose a few records in our databases, and there is nothing wrong with deleting the “outliers” with extreme values. Erroneous names, like “No Information,” may be attached to a seven-figure lifetime spending sum, and you know that can’t be right.

The main takeaways are: (1) Never trust the data just because someone bothered to store them in computers, and (2) Constantly look for bad data in reports and listings, at times using old-fashioned eye-balling methods. Computers do not know what is “bad,” until we specifically tell them what bad data are. So, don’t give up, and keep at it. And if it’s about someone else’s data, insist on data tallies and data hygiene stats.

4. Recency
Outdated data are really bad for prediction or analysis, and that is a different kind of badness. Many call it a “Data Atrophy” issue, as no matter how fresh and accurate a data point may be today, it will surely deteriorate over time. Yes, data have a finite shelf-life, too. Let’s say that you obtained a piece of information called “Golf Interest” on an individual level. That information could be coming from a survey conducted a long time ago, or some golf equipment purchase data from a while ago. In any case, someone who is attached to that flag may have stopped shopping for new golf equipment, as he doesn’t play much anymore. Without a proper database update and a constant feed of fresh data, irrelevant data will continue to drive our decisions.

The crazy thing is that, the harder it is to obtain certain types of data—such as transaction or behavioral data—the faster they will deteriorate. By nature, transaction or behavioral data are time-sensitive. That is why it is important to install time parameters in databases for behavioral data. If someone purchased a new golf driver, when did he do that? Surely, having bought a golf driver in 2009 (“Hey, time for a new driver!”) is different from having purchased it last May.

So-called “Hot Line Names” literally cease to be hot after two to three months, or in some cases much sooner. The evaporation period maybe different for different product types, as one may stay longer in the market for an automobile than for a new printer. Part of the job of a data scientist is to defer the expiration date of data, finding leads or prospects who are still “warm,” or even “lukewarm,” with available valid data. But no matter how much statistical work goes into making the data “look” fresh, eventually the models will cease to be effective.

For decision-makers who do not make real-time decisions, a real-time database update could be an expensive solution. But the databases must be updated constantly (I mean daily, weekly, monthly or even quarterly). Otherwise, someone will eventually end up making a wrong decision based on outdated data.

5. Consistency
No matter how much effort goes into keeping the database fresh, not all data variables will be updated or filled in consistently. And that is the reality. The interesting thing is that, especially when using them for advanced analytics, we can still provide decent predictions if the data are consistent. It may sound crazy, but even not-so-accurate-data can be used in predictive analytics, if they are “consistently” wrong. Modeling is developing an algorithm that differentiates targets and non-targets, and if the descriptive variables are “consistently” off (or outdated, like census data from five years ago) on both sides, the model can still perform.

Conversely, if there is a huge influx of a new type of data, or any drastic change in data collection or in a business model that supports such data collection, all bets are off. We may end up predicting such changes in business models or in methodologies, not the differences in consumer behavior. And that is one of the worst kinds of errors in the predictive business.

Last month, I talked about dealing with missing data (refer to “Missing Data Can Be Meaningful“), and I mentioned that data can be inferred via various statistical techniques. And such data imputation is OK, as long as it returns consistent values. I have seen so many so-called professionals messing up popular models, like “Household Income,” from update to update. If the inferred values jump dramatically due to changes in the source data, there is no amount of effort that can save the targeting models that employed such variables, short of re-developing them.

That is why a time-series comparison of important variables in databases is so important. Any changes of more than 5 percent in distribution of variables when compared to the previous update should be investigated immediately. If you are dealing with external data vendors, insist on having a distribution report of key variables for every update. Consistency of data is more important in predictive analytics than sheer accuracy of data.

6. Connectivity
As I mentioned earlier, there are many types of data. And the predictive power of data multiplies as different types of data get to be used together. For instance, demographic data, which is quite commoditized, still plays an important role in predictive modeling, even when dominant predictors are behavioral data. It is partly because no one dataset is complete, and because different types of data play different roles in algorithms.

The trouble is that many modern datasets do not share any common matching keys. On the demographic side, we can easily imagine using PII (Personally Identifiable Information), such as name, address, phone number or email address for matching. Now, if we want to add some transaction data to the mix, we would need some match “key” (or a magic decoder ring) by which we can link it to the base records. Unfortunately, many modern databases completely lack PII, right from the data collection stage. The result is that such a data source would remain in a silo. It is not like all is lost in such a situation, as they can still be used for trend analysis. But to employ multisource data for one-to-one targeting, we really need to establish the connection among various data worlds.

Even if the connection cannot be made to household, individual or email levels, I would not give up entirely, as we can still target based on IP addresses, which may lead us to some geographic denominations, such as ZIP codes. I’d take ZIP-level targeting anytime over no targeting at all, even though there are many analytical and summarization steps required for that (more on that subject in future articles).

Not having PII or any hard matchkey is not a complete deal-breaker, but the maneuvering space for analysts and marketers decreases significantly without it. That is why the existence of PII, or even ZIP codes, is the first thing that I check when looking into a new data source. I would like to free them from isolation.

7. Delivery Mechanisms
Users judge databases based on visualization or reporting tool sets that are attached to the database. As I mentioned earlier, that is like judging the entire building based just on the window treatments. But for many users, that is the reality. After all, how would a casual user without programming or statistical background would even “see” the data? Through tool sets, of course.

But that is the only one end of it. There are so many types of platforms and devices, and the data must flow through them all. The important point is that data is useless if it is not in the hands of decision-makers through the device of their choice, at the right time. Such flow can be actualized via API feed, FTP, or good, old-fashioned batch installments, and no database should stay too far away from the decision-makers. In my earlier column, I emphasized that data players must be good at (1) Collection, (2) Refinement, and (3) Delivery (refer to “Big Data is Like Mining Gold for a Watch—Gold Can’t Tell Time“). Delivering the answers to inquirers properly closes one iteration of information flow. And they must continue to flow to the users.

8. User-Friendliness
Even when state-of-the-art (I apologize for using this cliché) visualization, reporting or drill-down tool sets are attached to the database, if the data variables are too complicated or not intuitive, users will get frustrated and eventually move away from it. If that happens after pouring a sick amount of money into any data initiative, that would be a shame. But it happens all the time. In fact, I am not going to name names here, but I saw some ridiculously hard to understand data dictionary from a major data broker in the U.S.; it looked like the data layout was designed for robots by the robots. Please. Data scientists must try to humanize the data.

This whole Big Data movement has a momentum now, and in the interest of not killing it, data players must make every aspect of this data business easy for the users, not harder. Simpler data fields, intuitive variable names, meaningful value sets, pre-packaged variables in forms of answers, and completeness of a data dictionary are not too much to ask after the hard work of developing and maintaining the database.

This is why I insist that data scientists and professionals must be businesspeople first. The developers should never forget that end-users are not trained data experts. And guess what? Even professional analysts would appreciate intuitive variable sets and complete data dictionaries. So, pretty please, with sugar on top, make things easy and simple.

9. Cost
I saved this important item for last for a good reason. Yes, the dollar sign is a very important factor in all business decisions, but it should not be the sole deciding factor when it comes to databases. That means CFOs should not dictate the decisions regarding data or databases without considering the input from CMOs, CTOs, CIOs or CDOs who should be, in turn, concerned about all the other criteria listed in this article.

Playing with the data costs money. And, at times, a lot of money. When you add up all the costs for hardware, software, platforms, tool sets, maintenance and, most importantly, the man-hours for database development and maintenance, the sum becomes very large very fast, even in the age of the open-source environment and cloud computing. That is why many companies outsource the database work to share the financial burden of having to create infrastructures. But even in that case, the quality of the database should be evaluated based on all criteria, not just the price tag. In other words, don’t just pick the lowest bidder and hope to God that it will be alright.

When you purchase external data, you can also apply these evaluation criteria. A test-match job with a data vendor will reveal lots of details that are listed here; and metrics, such as match rate and variable fill-rate, along with complete the data dictionary should be carefully examined. In short, what good is lower unit price per 1,000 records, if the match rate is horrendous and even matched data are filled with missing or sub-par inferred values? Also consider that, once you commit to an external vendor and start building models and analytical framework around their its, it becomes very difficult to switch vendors later on.

When shopping for external data, consider the following when it comes to pricing options:

  • Number of variables to be acquired: Don’t just go for the full option. Pick the ones that you need (involve analysts), unless you get a fantastic deal for an all-inclusive option. Generally, most vendors provide multiple-packaging options.
  • Number of records: Processed vs. Matched. Some vendors charge based on “processed” records, not just matched records. Depending on the match rate, it can make a big difference in total cost.
  • Installment/update frequency: Real-time, weekly, monthly, quarterly, etc. Think carefully about how often you would need to refresh “demographic” data, which doesn’t change as rapidly as transaction data, and how big the incremental universe would be for each update. Obviously, a real-time API feed can be costly.
  • Delivery method: API vs. Batch Delivery, for example. Price, as well as the data menu, change quite a bit based on the delivery options.
  • Availability of a full-licensing option: When the internal database becomes really big, full installment becomes a good option. But you would need internal capability for a match and append process that involves “soft-match,” using similar names and addresses (imagine good-old name and address merge routines). It becomes a bit of commitment as the match and append becomes a part of the internal database update process.

Business First
Evaluating a database is a project in itself, and these nine evaluation criteria will be a good guideline. Depending on the businesses, of course, more conditions could be added to the list. And that is the final point that I did not even include in the list: That the database (or all data, for that matter) should be useful to meet the business goals.

I have been saying that “Big Data Must Get Smaller,” and this whole Big Data movement should be about (1) Cutting down on the noise, and (2) Providing answers to decision-makers. If the data sources in question do not serve the business goals, cut them out of the plan, or cut loose the vendor if they are from external sources. It would be an easy decision if you “know” that the database in question is filled with dirty, sporadic and outdated data that cost lots of money to maintain.

But if that database is needed for your business to grow, clean it, update it, expand it and restructure it to harness better answers from it. Just like the way you’d maintain your cherished automobile to get more mileage out of it. Not all databases are created equal for sure, and some are definitely more equal than others. You just have to open your eyes to see the differences.

It’s All About Ranking

The decision-making process is really all about ranking. As a marketer, to whom should you be talking first? What product should you offer through what channel? As a businessperson, whom should you hire among all the candidates? As an investor, what stocks or bonds should you purchase? As a vacationer, where should you visit first?

The decision-making process is really all about ranking. As a marketer, to whom should you be talking first? What product should you offer through what channel? As a businessperson, whom should you hire among all the candidates? As an investor, what stocks or bonds should you purchase? As a vacationer, where should you visit first?

Yes, “choice” is the keyword in all of these questions. And if you picked Paris over other places as an answer to the last question, you just made a choice based on some ranking order in your mind. The world is big, and there could have been many factors that contributed to that decision, such as culture, art, cuisine, attractions, weather, hotels, airlines, prices, deals, distance, convenience, language, etc., and I am pretty sure that not all factors carried the same weight for you. For example, if you put more weight on “cuisine,” I can see why London would lose a few points to Paris in that ranking order.

As a citizen, for whom should I vote? That’s the choice based on your ranking among candidates, too. Call me overly analytical (and I am), but I see the difference in political stances as differences in “weights” for many political (and sometimes not-so-political) factors, such as economy, foreign policy, defense, education, tax policy, entitlement programs, environmental issues, social issues, religious views, local policies, etc. Every voter puts different weights on these factors, and the sum of them becomes the score for each candidate in their minds. No one thinks that education is not important, but among all these factors, how much weight should it receive? Well, that is different for everybody; hence, the political differences.

I didn’t bring this up to start a political debate, but rather to point out that the decision-making process is based on ranking, and the ranking scores are made of many factors with different weights. And that is how the statistical models are designed in a nutshell (so, that means the models are “nuts”?). Analysts call those factors “independent variables,” which describe the target.

In my past columns, I talked about the importance of statistical models in the age of Big Data (refer to “Why Model?”), and why marketing databases must be “model-ready” (refer to “Chicken or the Egg? Data or Analytics?”). Now let’s dig a little deeper into the design of the “model-ready” marketing databases. And surprise! That is also all about “ranking.”

Let’s step back into the marketing world, where folks are not easily offended by the subject matter. If I give a spreadsheet that contains thousands of leads for your business, you wouldn’t be able to tell easily which ones are the “Glengarry Glen Ross” leads that came from Downtown, along with those infamous steak knives. What choice would you have then? Call everyone on the list? I guess you can start picking names out of a hat. If you think a little more about it, you may filter the list by the first name, as they may reflect the decade in which they were born. Or start calling folks who live in towns that sound affluent. Heck, you can start calling them in alphabetical order, but the point is that you would “sort” the list somehow.

Now, if the list came with some other valuable information, such as income, age, gender, education level, socio-economic status, housing type, number of children, etc., you may be able to pick and choose by which variables you would use to sort the list. You may start calling the high income folks first. Not all product sales are positively related to income, but it is an easy way to start the process. Then, you would throw in other variables to break the ties in rich areas. I don’t know what you’re selling, but maybe, you would want folks who live in a single-family house with kids. And sometimes, your “gut” feeling may lead you to the right place. But only sometimes. And only when the size of the list is not in millions.

If the list was not for prospecting calls, but for a CRM application where you also need to analyze past transaction and interaction history, the list of the factors (or variables) that you need to consider would be literally nauseating. Imagine the list contains all kinds of dollars, dates, products, channels and other related numbers and figures in a seemingly endless series of columns. You’d have to scroll to the right for quite some time just to see what’s included in the chart.

In situations like that, how nice would it be if some analyst threw in just two model scores for responsiveness to your product and the potential value of each customer, for example? The analysts may have considered hundreds (or thousands) of variables to derive such scores for you, and all you need to know is that the higher the score, the more likely the lead will be responsive or have higher potential values. For your convenience, the analyst may have converted all those numbers with many decimal places into easy to understand 1-10 or 1-20 scales. That would be nice, wouldn’t it be? Now you can just start calling the folks in the model group No. 1.

But let me throw in a curveball here. Let’s go back to the list with all those transaction data attached, but without the model scores. You may say, “Hey, that’s OK, because I’ve been doing alright without any help from a statistician so far, and I’ll just use the past dollar amount as their primary value and sort the list by it.” And that is a fine plan, in many cases. Then, when you look deeper into the list, you find out there are multiple entries for the same name all over the place. How can you sort the list of leads if the list is not even on an individual level? Welcome to the world of relational databases, where every transaction deserves an entry in a table.

Relational databases are optimized to store every transaction and retrieve them efficiently. In a relational database, tables are connected by match keys, and many times, tables are connected in what we call “1-to-many” relationships. Imagine a shopping basket. There is a buyer, and we need to record the buyer’s ID number, name, address, account number, status, etc. Each buyer may have multiple transactions, and for each transaction, we now have to record the date, dollar amount, payment method, etc. Further, if the buyer put multiple items in a shopping basket, that transaction, in turn, is in yet another 1-to-many relationship to the item table. You see, in order to record everything that just happened, this relational structure is very useful. If you are the person who has to create the shipping package, yes, you need to know all the item details, transaction value and the buyer’s information, including the shipping and billing address. Database designers love this completeness so much, they even call this structure the “normal” state.

But the trouble with the relational structure is that each line is describing transactions or items, not the buyers. Sure, one can “filter” people out by interrogating every line in the transaction table, say “Select buyers who had any transaction over $100 in past 12 months.” That is what I call rudimentary filtering, but once we start asking complex questions such as, “What is the buyer’s average transaction amount for past 12 months in the outdoor sports category, and what is the overall future value of the customers through online channels?” then you will need what we call “Buyer-centric” portraits, not transaction or item-centric records. Better yet, if I ask you to rank every customer in the order of such future value, well, good luck doing that when all the tables are describing transactions, not people. That would be exactly like the case where you have multiple lines for one individual when you need to sort the leads from high value to low.

So, how do we remedy this? We need to summarize the database on an individual level, if you would like to sort the leads on an individual level. If the goal is to rank households, email addresses, companies, business sites or products, then the summarization should be done on those levels, too. Now, database designers call it the “de-normalization” process, and the tables tend to get “wide” along that process, but that is the necessary step in order to rank the entities properly.

Now, the starting point in all the summarizations is proper identification numbers for those levels. It won’t be possible to summarize any table on a household level without a reliable household ID. One may think that such things are given, but I would have to disagree. I’ve seen so many so-called “state of the art” (another cliché that makes me nauseous) databases that do not have consistent IDs of any kind. If your database managers say they are using “plain name” or “email address” fields for matching or summarization, be afraid. Be very afraid. As a starter, you know how many email addresses one person may have. To add to that, consider how many people move around each year.

Things get worse in regard to ranking by model scores when it comes to “unstructured” databases. We see more and more of those, as the data sources are getting into uncharted territories, and the size of the databases is growing exponentially. There, all these bits and pieces of data are sitting on mysterious “clouds” as entries on their own. Here again, it is one thing to select or filter based on collected data, but ranking based on some statistical modeling is simply not possible in such a structure (or lack thereof). Just ask the database managers how many 24-month active customers they really have, considering a great many people move in that time period and change their addresses, creating multiple entries. If you get an answer like “2 million-ish,” well, that’s another scary moment. (Refer to “Cheat Sheet: Is Your Database Marketing Ready?”)

In order to develop models using variables that are descriptors of customers, not transactions, we must convert those relational or unstructured data into the structure that match the level by which you would like to rank the records. Even temporarily. As the size of databases are getting bigger and bigger and the storage is getting cheaper and cheaper, I’d say that the temporary time period could be, well, indefinite. And because the word “data-mart” is overused and confusing to many, let me just call that place the “Analytical Sandbox.” Sandboxes are fun, and yes, all kinds of fun stuff for marketers and analysts happen there.

The Analytical Sandbox is where samples are created for model development, actual models are built, models are scored for every record—no matter how many there are—without hiccups; targets are easily sorted and selected by model scores; reports are created in meaningful and consistent ways (consistency is even more important than sheer accuracy in what we do), and analytical language such as SAS, SPSS or R are spoken without being frowned up by other computing folks. Here, analysts will spend their time pondering upon target definitions and methodologies, not about database structures and incomplete data fields. Have you heard about a fancy term called “in-database scoring”? This is where that happens, too.

And what comes out of the Analytical Sandbox and back into the world of relational database or unstructured databases—IT folks often ask this question—is going to be very simple. Instead of having to move mountains of data back and forth, all the variables will be in forms of model scores, providing answers to marketing questions, without any missing values (by definition, every record can be scored by models). While the scores are packing tons of information in them, the sizes could be as small as a couple bytes or even less. Even if you carry over a few hundred affinity scores for 100 million people (or any other types of entities), I wouldn’t call the resultant file large, as it would be as small as a few video files, really.

In my future columns, I will explain how to create model-ready (and human-ready) variables using all kinds of numeric, character or free-form data. In Exhibit A, you will see what we call traditional analytical activities colored in dark blue on the right-hand side. In order to make those processes really hum, we must follow all the steps that are on the left-hand side of that big cylinder in the middle. Preventing garbage-in-garbage-out situations from happening, this is where all the data get collected in uniform fashion, properly converted, edited and standardized by uniform rules, categorized based on preset meta-tables, consolidated with consistent IDs, summarized to desired levels, and meaningful variables are created for more advanced analytics.

Even more than statistical methodologies, consistent and creative variables in form of “descriptors” of the target audience make or break the marketing plan. Many people think that purchasing expensive analytical software will provide all the answers. But lest we forget, fancy software only answers the right-hand side of Exhibit A, not all of it. Creating a consistent template for all useful information in a uniform fashion is the key to maximizing the power of analytics. If you look into any modeling bakeoff in the industry, you will see that the differences in methodologies are measured in fractions. Conversely, inconsistent and incomplete data create disasters in real world. And in many cases, companies can’t even attempt advanced analytics while sitting on mountains of data, due to structural inadequacies.

I firmly believe the Big Data movement should be about

  1. getting rid of the noise, and
  2. providing simple answers to decision-makers.

Bragging about the size and the speed element alone will not bring us to the next level, which is to “humanize” the data. At the end of the day (another cliché that I hate), it is all about supporting the decision-making processes, and the decision-making process is all about ranking different options. So, in the interest of keeping it simple, let’s start by creating an analytical haven where all those rankings become easy, in case you think that the sandbox is too juvenile.