Consumer Marketers, Looking to Test New Data Categories? Try These

We are all trying to create and sustain customers, using data to discover new patterns, new audiences, and new prospects — and that requires a lot of testing, and innovative data sets to explore (responsibly). Let’s make it experiential, as well as experimental.

We in the data marketing business love to test — at least, we should. And what we should test for is new data categories.

Expanding the marketing universe — and stretching the marketing budget — depends on higher efficiency in our lists, offers, and creative. We should be eager to test new proofs of concepts and new categories of data sources as they enter the market … if only to know whether or not they produce incrementally or otherwise.

I’m still surprised when I hear some of my data-vendor friends say that a good number of their clients pass on testing — and just go all-in on new lists and data sources. It seems like testing is still too much work for some, or they feel the only way to test is with an entire data source. Guess these client-side folks have money to burn, or are operating very much on-the-fly.

In some ways, digital marketers have it all over offline marketers in their ability to test, cycle, test again, and so on — often, many times over by the time a direct mail or direct-response print or broadcast test cycle has run its course. Yet, in this speed, have we sacrificed some quality in our prospecting strategies?

Online audience algorithms can produce some highly categorized niche segments, based on site visits and app usage — much of it de-identified, from a personal perspective. But how do these segments really stack up against a transaction database, or response lists, or even compiled lists, based on personally identifiable information? Thankfully, we can test for this, or even overlay data! (I am not advocating re-identification here, nor should you. Oh California, please don’t force us to identify non-PII. It’s soooo anti-privacy.)

Recently, the Direct Marketing Club of New York (DMCNY) held a very interesting breakfast program titled “Beyond Demographics: The Data You Need to Max Out Marketing Performance.”

Some Fresh Categories for New Reach and Affinity Discovery

Consider some of these data sources for testing:

  • Values Data — Test cohorts based on “shared values,” rather than simply choosing audiences based on demographics or psychographics. David Allison, principal, David Allison Inc., and author of “We Are All the Same Age Now,” pointed to his firm’s internal research that shows that popularly defined age groups rarely (or barely) match on what they agree upon, or value, as a generation. For example, Baby Boomers agree with each other about 13% of the time; Gen X, about 11% of the time; and Millennials, 15% of the time. Thus, targeting based on demographics alone can be extremely wasteful if the marketer is assuming some sort of shared attribute among them, other than age.However, when targeting based on shared “values” — Adventurers, Savers, and Techsters, and the like — all of a sudden affinities jump sky-high. In these cases, 89%, 76%, and 81%, respectively. These “valuegraphics” are based on “big data” segments — rather than small data (response lists, for example). Still, when compared to demographics targeting alone, shared-value targeting offers an eight-time lift!  Well, that’s worth testing.
  • Attitudinal Data — Another perspective on “beyond demographics” came from Mark Himmelsbach, co-founder, Episode Four, a creator of “brand hits,” such as this one for Charles Schwab. We often have stereotypical views of many demographic and other audience categories — and too many algorithms, he said. But analyze the data for unusual patterns, and suddenly you can find “who knew?” commonalities among certain audience segments that would wow any of us.Who knew that ultra-high net worth individuals are electronic dance music enthusiasts? Who knew that African-American married women are high on the e-sports genre? Or that young Hispanic/Latino adventurers are really into escape rooms? These discoveries give brands new advertising, product placement, and sponsorship opportunities, for example, which might otherwise go untapped. I’m still trying to get my head around these reported affinities, based no doubt by my own preconceptions.
  • Location Data — According to the World Economic Forum, 90% of the world will soon have or already has a supercomputer in their pocket — a smartphone. We’re actually closing in on four connected devices per person, reports Jeff White, founder and CEO, Gravy Analytics. With smartphones alone, as constant companions, we have a huge opportunity to leverage responsibly use of location data. Location can provide huge “affinity” targeting opportunities.A casual wine user might search and buy online his or her wine. But a wine aficionado visits a winery (Location X), or attends a wine tasting (Location Y), and now you have a true affinity opportunity. Granted, location data has a level of sensitivity that carries, more often than not, an opt-in requirement — but the marketing lift can be a significant reward for the advertiser who strategically applies such insights from it. Makes me want to tag every latitude and longitude for some hobby or interest!
  • Experiential Data — Live Nation may own concert venues, Ticketmaster, online game communities and music/culture festivals — but across these many first-party experiences, the company can provide deep analytics that help monetize its various audiences through enriched second-party relationships, said Anubhav Mehrotra, VP, Live Nation. Hilton, American Express, and Uber are just some of the brands Live Nation has teamed up with to enrich brand users with engaging experiences, such as backstage tours and “meet the artists.”

We are all trying to create and sustain customers, using data to discover new patterns, new audiences, and new prospects — and that requires a lot of testing, and innovative data sets to explore (responsibly). Let’s make it experiential, as well as experimental: I sure hope to meet some ultra-high-net-worth individuals at the next Electronic Dance Festival I attend. Or not.

It’s Decision Time for Data Privacy (or Will Be Soon)

Chet Dalzell’s recent thoughtful piece on “Our Digital Selves” came along at the same time I (and probably a gazillion others) were pondering the increasingly pressing question of data privacy in the digital age.

Chet Dalzell’s recent thoughtful piece on “Our Digital Selves” came along at the same time I (and probably a gazillion others) were pondering the increasingly pressing question of data privacy in the digital age.

It’s a much bigger question than what data can be used to target potential customers for the latest widget or widget club or to stop you in your tracks at the supermarket in front of the pet food shelves to tell you that Fido, your beloved Fido, seen in the picture on your cell phone, absolutely must have the new, nutritious and tasty Dogbit,s or he may bite your fingers off if you try to give him anything else.

The data question goes to the heart of how we see ourselves in the digital world. And how we see ourselves is in no way clear — even to ourselves.

“Bottom line: If Facebook’s users in the United States are similar to most Americans (and studies suggest they are), large majorities don’t want personalized ads — and when they learn how companies find out information about them, even greater percentages don’t want them.”

That’s what Joseph Turow, a professor of communications and Chris Jay Hoofnagle, an adjunct professor of law, say in The New York Times using various research to support their thesis. The problem is what people tell researchers is not always what they do. Facebook’s quarterly earnings statement showed these enlightening KPIs.

  • Monthly active users (MAUs) — MAUs were 2.32 billion as of Dec. 31, 2018, an increase of 9%, year-over-year.
  • We estimate that around 2.7 billion people now use Facebook, Instagram, WhatsApp or Messenger (our “Family” of services) each month, and more than 2 billion people use at least one of our Family of services every day, on average.

It has been said over and over again that everything has its price. Assuming that this is largely true, how much value or benefit should the consumer expect in return for how much and which data? As I wrote in a comment to Chet’s article, this is sure to be the data-use question we’ll all be turning in our minds as the algorithms get smarter and the temptations greater.

Imagine that you could put a value on each element of your personal, demographic, psychographic and behavioral data, and anyone wanting to use that data would have to pay your price, whether or not you ended up making a purchase or taking a desired action? Imagine further that a data user wanted to use $20 worth of your data to try to sell you a product you wanted, priced at $100? It would be an easy transaction, if the seller were willing to offer you a 20% or even a greater discount for the specific permission to use the data. You would have the product, the seller would have the sale and everyone would be happy.

However fanciful that scenario, it is not nearly as crazy as it sounds. In fact, in one form or another, that is exactly what is happening in the real marketplace; although without your specific permission. As a marketer, I have to spend money to acquire your data and, by making an attractive offer (say a 20% discount), I am offering to compensate you for your data, which allows me to talk to you.

Of course, I have over-simplified the argument. As stated earlier: How much value or benefit should the consumer expect in return for how much and which data?

I think we would all agree that this determination is much too complicated, so we let the “invisible hand of the market” do its magic. Which reduces the decision to a very simple one: Do we perceive that we get enough value from having our data out there in the marketplace to be manipulated however the marketers wish to and simply lie back and enjoy all the offers and benefits? Or should we bite the bullet, give our cell phones to a needy child, do without Waze and get lost again and again, be prepared to stand in the endless line at the bank, throw the “delete everything” switch and effectively remove ourselves from the digital economy? It is getting near decision time for all of us.

I remember many years ago in London, as “one of those Americans,” being lectured over lunch by a very traditional British publisher about the horrors of books being sold by mail order and direct mail and assuring me that the British wouldn’t have anything to do with book clubs or the like. Just when the bill had been paid and we were preparing to depart, she reached into her handbag and pulled out an all singing and all dancing mailing piece from the Readers Digest, offering a very handsome discount on their superb motorist bible, the “Book of the Road.”

She was going to order it right away.

 

Marketing Success Sans ‘Every Breath They Take, Every Move They Make’

Last month, I talked about how to measure success when there are many conflicting goals and available metrics flying around (refer to “Marketing Success Metrics: Response or Dollars?”). This time, let’s start thinking about how to act on data and intelligence that we’ve gathered. And that means we get to touch different kinds of advanced analytics.

Last month, I talked about how to measure success when there are many conflicting goals and available metrics flying around (refer to “Marketing Success Metrics: Response or Dollars?”). This time, let’s start thinking about how to act on data and intelligence that we’ve gathered. And that means we get to touch different kinds of advanced analytics.

But before we get into boring analytics talk, citing words like “predictive analytics” and “segmentation,” let’s talk about what kind of data are required to make predictions better and more accurate. After all, no data, no analytics.

I often get questions like what the “best” kind of data are. And my answer is, to the inquirer’s disappointment, “it depends.” It really depends on what you are trying to predict, or ultimately, do. If you would like to have an accurate forecast of futures sales, such an effort calls for a past sales history (but not necessarily on an individual or transactional level); past and current marcom spending by channel; web and other channel traffic data; and environmental data, such as economic indicators, just to start off.

Conversely, if you’d like to predict an individual’s product affinity, preferred offer types or likelihood to respond to certain promotion types, such predictive modeling requires data about the past behavior of the target. And that word “behavior” may evoke different responses, even among seasoned marketers. Yes, we are all reflections of our past behavior, but what does that mean? Every breath you take, every move you make?

Thanks to the Big Data hype a few years back, many now believe that we should just collect anything and everything about everybody. Surely, cost for data collection, storage and maintenance has decreased quite a bit over the years, but that doesn’t mean that we should just hoard data mindlessly. Because you may be deferring inevitable data hygiene, standardization, categorization and consolidation to future users — or machines — who must sort out unorganized and unrefined data and provide applicable insights.

So, going back to that question of what makes up data about human behavior, let’s define what that means in a categorical fashion. With proliferation of digital data collection and analytics, the definition of behavioral data has expanded considerably.

In short, what people casually refer to as “behavioral data” may include this to measure success:

  • Online Behavior: Web data regarding click, view and other shopping behavior.
  • Purchase: Transactional data, made of who, what, when, how much and through what channel.
  • Response: Response history, in relation to specific promotions, covering open, click-through, opt-out, view, shopping basket, conversion/transaction. Offline response may be as simple as product purchase.
  • Channel: Channel usage data, not necessarily limited to shopping behavior.
  • Payment: Payment and related delinquent history — essential for credit purchases and continuity and subscription businesses.
  • Communication: Call, chat or other communication log data, positive or negative in nature.
  • Movement: Physical proximity or movement data, in store or store area, for example.
  • Survey: Responses to various surveys.
  • Opt-in/Opt-out: Sign-up specific 2-way communications and channel specific opt-out requests.
  • Social Media: Product review, social media posting and product/service-related sentiment data.

I am sure some will think of more categories. But before we create an exhaustive list of data types, let’s pause and think about what we are trying to do here.

First off, all of these data traceable to a person are being collected for one major reason (at least for marketers): To sell more things to them. If the goal is to predict the who, what, when and why of buying behavior, do we really need all of this?

The ‘Who’ of Buying Behavior

In the prediction business, predicting “who” (as in “who will buy this product?”) is the simplest kind of action. We’d need some PII (personally identifiable information) that can link to buying behaviors of the target. After all, the whole modeling technique was invented to rank target individuals and set up contact priority — in that order. Like sending expensive catalogs only to high-score individuals, in terms of “likely to respond,” or sales teams contacting high “likely to convert” targets as priorities in B2B businesses.

The ‘What’ of Buying Behavior

The next difficulty level lies with the prediction of “what” (as in “what is that target individual going to buy next?”). This type of prediction is generally a hit-or-miss, so even mighty Amazon displays multiple product offers at the end of a successful transaction, by saying “Customers who purchased this item are also interested in these products.” Such a gentle push, based on collaborative filtering, requires massive purchase history by many buyers to be effective. But, provided with ample amounts of data, it is not terribly difficult, and the risk of being wrong is relatively low. Pinpointing the very next product for 1:1 messaging can be challenging, but product basket analysis can easily lead to popular combinations of products, at the minimum.

6 More Thorny Data Problems That Vex B-to-B Marketers, and How to Solve Them

B-to-B data continues to challenge marketers, who need to identify and communicate with customers and prospects, but who run into thorny issues every day. Problems range from duplicates, to key-entry errors, to missing data elements, and beyond. Recently, Bernice Grossman and I worked with a group of savvy B-to-B marketers at a DMA conference to compile a list of difficult data problems. Here are six that will bring tears to your eyes—but don’t worry, we also offer some solutions.

B-to-B data continues to challenge marketers, who need to identify and communicate with customers and prospects, but who run into thorny issues every day. Problems range from duplicates, to key-entry errors, to missing data elements, and beyond. Recently, Bernice Grossman and I worked with a group of savvy B-to-B marketers at a DMA conference to compile a list of difficult data problems. Here are six that will bring tears to your eyes—but don’t worry, we also offer some solutions.

  1. How do I find out the names of individuals who visit my website?
    There are two ways to de-anonymize the website visit. First, add a registration invitation to your site. This could be an email sign-up, or a piece of gated content, like a white paper or research report, in exchange for providing important data elements like name, title, company name, address, phone and email.
    Second, use the IP address to identify the company from which the visitor arrived. This can be done by hand, using Google Analytics, or more easily by using any number of services that enable IP address look-up. Marketing automation systems are increasingly baking this option into their tools.

    But the IP address method will still not get you the name of the visitor. You can infer the visitor’s interests and, possibly, role by looking at the time spent on various pages. And you can drop a cookie and retarget the visitor with text or banner ads later.

  2. Job titles are increasingly inconsistent-and proliferating. Categories like marketing manager and financial analyst don’t seem to work anymore.
    Several companies offer job title standardization services, called something like title mapping, title translation or title beautification. A resource like that is a good first step.

    Then, consider sending an outbound email, perhaps with a follow-up phone call, positioned as a “contact verification” message. Invite the target to indicate his or her functional job title, from a list.

    After that, you will be left with a relatively smaller list of remaining titles. At that point, you need to decide on a default for the rest of them. For example, anything that sounds like IT will go in an IT functional bucket. And, depending on how often you query your customers, you can always gather answers to this question over time.

    Then, you are faced with the remaining issue, which is far more difficult, namely the crazy new titles that some people are using these days. We’ve seen bizarre titles like Chief Instigating Officer and Marketing Diva. With these, you have two options.

    • Force aberrant titles into your standards, by hand, using your best guess. Use a default code for anything you can’t really figure out.
    • Leave them as they are, and link them to a table of standardized job functions. But maintain the self-reported wacky title, too, so you can still address the person the way he or she wants to be addressed.

    You might also consider using forced drop-down menus for job function and job title, at the point of key entry.

  3. How should I handle job changes? When an employee leaves and goes to another company, does his or her history with my company go along?
    We are going to assume—a big assumption—that you actually know the person has gone to a new company. It’s more likely that you will not know. This is why it’s a good idea to do periodic de-duplications by functional title to get a sense of new names that have popped up at the companies in your database.

    When you know that there is a job change and you have the new information, you must move the contact to the new company in your database. It’s a good idea to send along behavioral data like communications preferences. You might also add a LinkedIn profile URL to the record. If you believe the prior behavioral data is important, then take it as a duplicate, and put it in a separate field, not attributing it to the new company record.

    The purchase history belongs with the original company, and should stay there. Indicate in the company record that the individual has left.
    As a general rule, in marketing databases, never overwrite. Keep everything data stamped.

  4. We want our sales people to be selling, and keep administrative tasks to a minimum. But these people are also the closest resources to our customers. How can we motivate them to capture important data about the customers and prospects they are interacting with?
    Boil down the mission to just one or two key data points that reps are asked to collect and report. Job title, buying role and email address are among the most likely to change, and perhaps the most important to keep current. Train and reward the reps on consistent reporting on the selected elements.
  5. In an effort to improve web-form response rates, we are asking for only name and email address. What’s the best way to create a company record in this situation?
    We recommend that you consider hiring a service that will fill in the company record on the spot, as a start. Or send the file out to a third party compiler to append the records you need.

    Another way is to parse the email address. Take the letters after the @ and before the .com. For example, if the email is formatted as firstname.lastname@hp.com, the meaningful letters are hp. Search for other emails with these letters in this position in your file, and build a business rule that every email with these letters shall be assigned that company name. If you have a standard record on your file, import it.

    If the email address is a generic one, like gmail.com or yahoo.com, it’s more difficult. Email the prospect and ask for more data. You could also consider preventing email addresses other than those from company domains from being accepted on the web form. But keep in mind that there is some evidence that individuals filling out web forms with personal email addresses tend to be more responsive over time.

  6. We need to get our international customer data under control. Where should we start?
    First, add country name as a required field in your web forms and other response vehicles, so that future data collection will be set. Use a dropdown menu to improve capture of a standardized country name. Prevent the record from moving forward until the country is specified.

    Then, look at what parts of the world you do business in. Estimate how many countries, and how many customer records in each country, so you can see how big an issue this is.

    Then, figure out which records in the database are non-U.S. This will take some effort. Many databases don’t have a non-domestic indicator. There is no easy way around it.

    Country names are increasingly important as laws change. Consider Canada’s onerous new email law, which requires proven opt in before emailing. You can’t assume that those email addresses ending with .ca are the only Canadian emails on your file. One suggestion is to update your web forms with a message like “If you are in Canada, opt in here.”

You can find more thorny data issues and solutions in our new white paper, available for free download. Please submit any other issues you may be facing, using the comments section here, and we’ll be happy to suggest some solutions.

Not All Databases Are Created Equal

Not all databases are created equal. No kidding. That is like saying that not all cars are the same, or not all buildings are the same. But somehow, “judging” databases isn’t so easy. First off, there is no tangible “tire” that you can kick when evaluating databases or data sources. Actually, kicking the tire is quite useless, even when you are inspecting an automobile. Can you really gauge the car’s handling, balance, fuel efficiency, comfort, speed, capacity or reliability based on how it feels when you kick “one” of the tires? I can guarantee that your toes will hurt if you kick it hard enough, and even then you won’t be able to tell the tire pressure within 20 psi. If you really want to evaluate an automobile, you will have to sign some papers and take it out for a spin (well, more than one spin, but you know what I mean). Then, how do we take a database out for a spin? That’s when the tool sets come into play.

Not all databases are created equal. No kidding. That is like saying that not all cars are the same, or not all buildings are the same. But somehow, “judging” databases isn’t so easy. First off, there is no tangible “tire” that you can kick when evaluating databases or data sources. Actually, kicking the tire is quite useless, even when you are inspecting an automobile. Can you really gauge the car’s handling, balance, fuel efficiency, comfort, speed, capacity or reliability based on how it feels when you kick “one” of the tires? I can guarantee that your toes will hurt if you kick it hard enough, and even then you won’t be able to tell the tire pressure within 20 psi. If you really want to evaluate an automobile, you will have to sign some papers and take it out for a spin (well, more than one spin, but you know what I mean). Then, how do we take a database out for a spin? That’s when the tool sets come into play.

However, even when the database in question is attached to analytical, visualization, CRM or drill-down tools, it is not so easy to evaluate it completely, as such practice reveals only a few aspects of a database, hardly all of them. That is because such tools are like window treatments of a building, through which you may look into the database. Imagine a building inspector inspecting a building without ever entering it. Would you respect the opinion of the inspector who just parks his car outside the building, looks into the building through one or two windows, and says, “Hey, we’re good to go”? No way, no sir. No one should judge a book by its cover.

In the age of the Big Data (you should know by now that I am not too fond of that word), everything digitized is considered data. And data reside in databases. And databases are supposed be designed to serve specific purposes, just like buildings and cars are. Although many modern databases are just mindless piles of accumulated data, granted that the database design is decent and functional, we can still imagine many different types of databases depending on the purposes and their contents.

Now, most of the Big Data discussions these days are about the platform, environment, or tool sets. I’m sure you heard or read enough about those, so let me boldly skip all that and their related techie words, such as Hadoop, MongoDB, Pig, Python, MapReduce, Java, SQL, PHP, C++, SAS or anything related to that elusive “cloud.” Instead, allow me to show you the way to evaluate databases—or data sources—from a business point of view.

For businesspeople and decision-makers, it is not about NoSQL vs. RDB; it is just about the usefulness of the data. And the usefulness comes from the overall content and database management practices, not just platforms, tool sets and buzzwords. Yes, tool sets are important, but concert-goers do not care much about the types and brands of musical instruments that are being used; they just care if the music is entertaining or not. Would you be impressed with a mediocre guitarist just because he uses the same brand of guitar that his guitar hero uses? Nope. Likewise, the usefulness of a database is not about the tool sets.

In my past column, titled “Big Data Must Get Smaller,” I explained that there are three major types of data, with which marketers can holistically describe their target audience: (1) Descriptive Data, (2) Transaction/Behavioral Data, and (3) Attitudinal Data. In short, if you have access to all three dimensions of the data spectrum, you will have a more complete portrait of customers and prospects. Because I already went through that subject in-depth, let me just say that such types of data are not the basis of database evaluation here, though the contents should be on top of the checklist to meet business objectives.

In addition, throughout this series, I have been repeatedly emphasizing that the database and analytics management philosophy must originate from business goals. Basically, the business objective must dictate the course for analytics, and databases must be designed and optimized to support such analytical activities. Decision-makers—and all involved parties, for that matter—suffer a great deal when that hierarchy is reversed. And unfortunately, that is the case in many organizations today. Therefore, let me emphasize that the evaluation criteria that I am about to introduce here are all about usefulness for decision-making processes and supporting analytical activities, including predictive analytics.

Let’s start digging into key evaluation criteria for databases. This list would be quite useful when examining internal and external data sources. Even databases managed by professional compilers can be examined through these criteria. The checklist could also be applicable to investors who are about to acquire a company with data assets (as in, “Kick the tire before you buy it.”).

1. Depth
Let’s start with the most obvious one. What kind of information is stored and maintained in the database? What are the dominant data variables in the database, and what is so unique about them? Variety of information matters for sure, and uniqueness is often related to specific business purposes for which databases are designed and created, along the lines of business data, international data, specific types of behavioral data like mobile data, categorical purchase data, lifestyle data, survey data, movement data, etc. Then again, mindless compilation of random data may not be useful for any business, regardless of the size.

Generally, data dictionaries (lack of it is a sure sign of trouble) reveal the depth of the database, but we need to dig deeper, as transaction and behavioral data are much more potent predictors and harder to manage in comparison to demographic and firmographic data, which are very much commoditized already. Likewise, Lifestyle variables that are derived from surveys that may have been conducted a long time ago are far less valuable than actual purchase history data, as what people say they do and what they actually do are two completely different things. (For more details on the types of data, refer to the second half of “Big Data Must Get Smaller.”)

Innovative ideas should not be overlooked, as data packaging is often very important in the age of information overflow. If someone or some company transformed many data points into user-friendly formats using modeling or other statistical techniques (imagine pre-developed categorical models targeting a variety of human behaviors, or pre-packaged segmentation or clustering tools), such effort deserves extra points, for sure. As I emphasized numerous times in this series, data must be refined to provide answers to decision-makers. That is why the sheer size of the database isn’t so impressive, and the depth of the database is not just about the length of the variable list and the number of bytes that go along with it. So, data collectors, impress us—because we’ve seen a lot.

2. Width
No matter how deep the information goes, if the coverage is not wide enough, the database becomes useless. Imagine well-organized, buyer-level POS (Point of Service) data coming from actual stores in “real-time” (though I am sick of this word, as it is also overused). The data go down to SKU-level details and payment methods. Now imagine that the data in question are collected in only two stores—one in Michigan, and the other in Delaware. This, by the way, is not a completely made -p story, and I faced similar cases in the past. Needless to say, we had to make many assumptions that we didn’t want to make in order to make the data useful, somehow. And I must say that it was far from ideal.

Even in the age when data are collected everywhere by every device, no dataset is ever complete (refer to “Missing Data Can Be Meaningful“). The limitations are everywhere. It could be about brand, business footprint, consumer privacy, data ownership, collection methods, technical limitations, distribution of collection devices, and the list goes on. Yes, Apple Pay is making a big splash in the news these days. But would you believe that the data collected only through Apple iPhone can really show the overall consumer trend in the country? Maybe in the future, but not yet. If you can pick only one credit card type to analyze, such as American Express for example, would you think that the result of the study is free from any bias? No siree. We can easily assume that such analysis would skew toward the more affluent population. I am not saying that such analyses are useless. And in fact, they can be quite useful if we understand the limitations of data collection and the nature of the bias. But the point is that the coverage matters.

Further, even within multisource databases in the market, the coverage should be examined variable by variable, simply because some data points are really difficult to obtain even by professional data compilers. For example, any information that crosses between the business and the consumer world is sparsely populated in many cases, and the “occupation” variable remains mostly blank or unknown on the consumer side. Similarly, any data related to young children is difficult or even forbidden to collect, so a seemingly simple variable, such as “number of children,” is left unknown for many households. Automobile data used to be abundant on a household level in the past, but a series of laws made sure that the access to such data is forbidden for many users. Again, don’t be impressed with the existence of some variables in the data menu, but look into it to see “how much” is available.

3. Accuracy
In any scientific analysis, a “false positive” is a dangerous enemy. In fact, they are worse than not having the information at all. Many folks just assume that any data coming out a computer is accurate (as in, “Hey, the computer says so!”). But data are not completely free from human errors.

Sheer accuracy of information is hard to measure, especially when the data sources are unique and rare. And the errors can happen in any stage, from data collection to imputation. If there are other known sources, comparing data from multiple sources is one way to ensure accuracy. Watching out for fluctuations in distributions of important variables from update to update is another good practice.

Nonetheless, the overall quality of the data is not just up to the person or department who manages the database. Yes, in this business, the last person who touches the data is responsible for all the mistakes that were made to it up to that point. However, when the garbage goes in, the garbage comes out. So, when there are errors, everyone who touched the database at any point must share in the burden of guilt.

Recently, I was part of a project that involved data collected from retail stores. We ran all kinds of reports and tallies to check the data, and edited many data values out when we encountered obvious errors. The funniest one that I saw was the first name “Asian” and the last name “Tourist.” As an openly Asian-American person, I was semi-glad that they didn’t put in “Oriental Tourist” (though I still can’t figure out who decided that word is for objects, but not people). We also found names like “No info” or “Not given.” Heck, I saw in the news that this refugee from Afghanistan (he was a translator for the U.S. troops) obtained a new first name as he was granted an entry visa, “Fnu.” That would be short for “First Name Unknown” as the first name in his new passport. Welcome to America, Fnu. Compared to that, “Andolini” becoming “Corleone” on Ellis Island is almost cute.

Data entry errors are everywhere. When I used to deal with data files from banks, I found that many last names were “Ira.” Well, it turned out that it wasn’t really the customers’ last names, but they all happened to have opened “IRA” accounts. Similarly, movie phone numbers like 777-555-1234 are very common. And fictitious names, such as “Mickey Mouse,” or profanities that are not fit to print are abundant, as well. At least fake email addresses can be tested and eliminated easily, and erroneous addresses can be corrected by time-tested routines, too. So, yes, maintaining a clean database is not so easy when people freely enter whatever they feel like. But it is not an impossible task, either.

We can also train employees regarding data entry principles, to a certain degree. (As in, “Do not enter your own email address,” “Do not use bad words,” etc.). But what about user-generated data? Search and kill is the only way to do it, and the job would never end. And the meta-table for fictitious names would grow longer and longer. Maybe we should just add “Thor” and “Sponge Bob” to that Mickey Mouse list, while we’re at it. Yet, dealing with this type of “text” data is the easy part. If the database manager in charge is not lazy, and if there is a bit of a budget allowed for data hygiene routines, one can avoid sending emails to “Dear Asian Tourist.”

Numeric errors are much harder to catch, as numbers do not look wrong to human eyes. That is when comparison to other known sources becomes important. If such examination is not possible on a granular level, then median value and distribution curves should be checked against historical transaction data or known public data sources, such as U.S. Census Data in the case of demographic information.

When it’s about the companies’ own data, follow your instincts and get rid of data that look too good or too bad to be true. We all can afford to lose a few records in our databases, and there is nothing wrong with deleting the “outliers” with extreme values. Erroneous names, like “No Information,” may be attached to a seven-figure lifetime spending sum, and you know that can’t be right.

The main takeaways are: (1) Never trust the data just because someone bothered to store them in computers, and (2) Constantly look for bad data in reports and listings, at times using old-fashioned eye-balling methods. Computers do not know what is “bad,” until we specifically tell them what bad data are. So, don’t give up, and keep at it. And if it’s about someone else’s data, insist on data tallies and data hygiene stats.

4. Recency
Outdated data are really bad for prediction or analysis, and that is a different kind of badness. Many call it a “Data Atrophy” issue, as no matter how fresh and accurate a data point may be today, it will surely deteriorate over time. Yes, data have a finite shelf-life, too. Let’s say that you obtained a piece of information called “Golf Interest” on an individual level. That information could be coming from a survey conducted a long time ago, or some golf equipment purchase data from a while ago. In any case, someone who is attached to that flag may have stopped shopping for new golf equipment, as he doesn’t play much anymore. Without a proper database update and a constant feed of fresh data, irrelevant data will continue to drive our decisions.

The crazy thing is that, the harder it is to obtain certain types of data—such as transaction or behavioral data—the faster they will deteriorate. By nature, transaction or behavioral data are time-sensitive. That is why it is important to install time parameters in databases for behavioral data. If someone purchased a new golf driver, when did he do that? Surely, having bought a golf driver in 2009 (“Hey, time for a new driver!”) is different from having purchased it last May.

So-called “Hot Line Names” literally cease to be hot after two to three months, or in some cases much sooner. The evaporation period maybe different for different product types, as one may stay longer in the market for an automobile than for a new printer. Part of the job of a data scientist is to defer the expiration date of data, finding leads or prospects who are still “warm,” or even “lukewarm,” with available valid data. But no matter how much statistical work goes into making the data “look” fresh, eventually the models will cease to be effective.

For decision-makers who do not make real-time decisions, a real-time database update could be an expensive solution. But the databases must be updated constantly (I mean daily, weekly, monthly or even quarterly). Otherwise, someone will eventually end up making a wrong decision based on outdated data.

5. Consistency
No matter how much effort goes into keeping the database fresh, not all data variables will be updated or filled in consistently. And that is the reality. The interesting thing is that, especially when using them for advanced analytics, we can still provide decent predictions if the data are consistent. It may sound crazy, but even not-so-accurate-data can be used in predictive analytics, if they are “consistently” wrong. Modeling is developing an algorithm that differentiates targets and non-targets, and if the descriptive variables are “consistently” off (or outdated, like census data from five years ago) on both sides, the model can still perform.

Conversely, if there is a huge influx of a new type of data, or any drastic change in data collection or in a business model that supports such data collection, all bets are off. We may end up predicting such changes in business models or in methodologies, not the differences in consumer behavior. And that is one of the worst kinds of errors in the predictive business.

Last month, I talked about dealing with missing data (refer to “Missing Data Can Be Meaningful“), and I mentioned that data can be inferred via various statistical techniques. And such data imputation is OK, as long as it returns consistent values. I have seen so many so-called professionals messing up popular models, like “Household Income,” from update to update. If the inferred values jump dramatically due to changes in the source data, there is no amount of effort that can save the targeting models that employed such variables, short of re-developing them.

That is why a time-series comparison of important variables in databases is so important. Any changes of more than 5 percent in distribution of variables when compared to the previous update should be investigated immediately. If you are dealing with external data vendors, insist on having a distribution report of key variables for every update. Consistency of data is more important in predictive analytics than sheer accuracy of data.

6. Connectivity
As I mentioned earlier, there are many types of data. And the predictive power of data multiplies as different types of data get to be used together. For instance, demographic data, which is quite commoditized, still plays an important role in predictive modeling, even when dominant predictors are behavioral data. It is partly because no one dataset is complete, and because different types of data play different roles in algorithms.

The trouble is that many modern datasets do not share any common matching keys. On the demographic side, we can easily imagine using PII (Personally Identifiable Information), such as name, address, phone number or email address for matching. Now, if we want to add some transaction data to the mix, we would need some match “key” (or a magic decoder ring) by which we can link it to the base records. Unfortunately, many modern databases completely lack PII, right from the data collection stage. The result is that such a data source would remain in a silo. It is not like all is lost in such a situation, as they can still be used for trend analysis. But to employ multisource data for one-to-one targeting, we really need to establish the connection among various data worlds.

Even if the connection cannot be made to household, individual or email levels, I would not give up entirely, as we can still target based on IP addresses, which may lead us to some geographic denominations, such as ZIP codes. I’d take ZIP-level targeting anytime over no targeting at all, even though there are many analytical and summarization steps required for that (more on that subject in future articles).

Not having PII or any hard matchkey is not a complete deal-breaker, but the maneuvering space for analysts and marketers decreases significantly without it. That is why the existence of PII, or even ZIP codes, is the first thing that I check when looking into a new data source. I would like to free them from isolation.

7. Delivery Mechanisms
Users judge databases based on visualization or reporting tool sets that are attached to the database. As I mentioned earlier, that is like judging the entire building based just on the window treatments. But for many users, that is the reality. After all, how would a casual user without programming or statistical background would even “see” the data? Through tool sets, of course.

But that is the only one end of it. There are so many types of platforms and devices, and the data must flow through them all. The important point is that data is useless if it is not in the hands of decision-makers through the device of their choice, at the right time. Such flow can be actualized via API feed, FTP, or good, old-fashioned batch installments, and no database should stay too far away from the decision-makers. In my earlier column, I emphasized that data players must be good at (1) Collection, (2) Refinement, and (3) Delivery (refer to “Big Data is Like Mining Gold for a Watch—Gold Can’t Tell Time“). Delivering the answers to inquirers properly closes one iteration of information flow. And they must continue to flow to the users.

8. User-Friendliness
Even when state-of-the-art (I apologize for using this cliché) visualization, reporting or drill-down tool sets are attached to the database, if the data variables are too complicated or not intuitive, users will get frustrated and eventually move away from it. If that happens after pouring a sick amount of money into any data initiative, that would be a shame. But it happens all the time. In fact, I am not going to name names here, but I saw some ridiculously hard to understand data dictionary from a major data broker in the U.S.; it looked like the data layout was designed for robots by the robots. Please. Data scientists must try to humanize the data.

This whole Big Data movement has a momentum now, and in the interest of not killing it, data players must make every aspect of this data business easy for the users, not harder. Simpler data fields, intuitive variable names, meaningful value sets, pre-packaged variables in forms of answers, and completeness of a data dictionary are not too much to ask after the hard work of developing and maintaining the database.

This is why I insist that data scientists and professionals must be businesspeople first. The developers should never forget that end-users are not trained data experts. And guess what? Even professional analysts would appreciate intuitive variable sets and complete data dictionaries. So, pretty please, with sugar on top, make things easy and simple.

9. Cost
I saved this important item for last for a good reason. Yes, the dollar sign is a very important factor in all business decisions, but it should not be the sole deciding factor when it comes to databases. That means CFOs should not dictate the decisions regarding data or databases without considering the input from CMOs, CTOs, CIOs or CDOs who should be, in turn, concerned about all the other criteria listed in this article.

Playing with the data costs money. And, at times, a lot of money. When you add up all the costs for hardware, software, platforms, tool sets, maintenance and, most importantly, the man-hours for database development and maintenance, the sum becomes very large very fast, even in the age of the open-source environment and cloud computing. That is why many companies outsource the database work to share the financial burden of having to create infrastructures. But even in that case, the quality of the database should be evaluated based on all criteria, not just the price tag. In other words, don’t just pick the lowest bidder and hope to God that it will be alright.

When you purchase external data, you can also apply these evaluation criteria. A test-match job with a data vendor will reveal lots of details that are listed here; and metrics, such as match rate and variable fill-rate, along with complete the data dictionary should be carefully examined. In short, what good is lower unit price per 1,000 records, if the match rate is horrendous and even matched data are filled with missing or sub-par inferred values? Also consider that, once you commit to an external vendor and start building models and analytical framework around their its, it becomes very difficult to switch vendors later on.

When shopping for external data, consider the following when it comes to pricing options:

  • Number of variables to be acquired: Don’t just go for the full option. Pick the ones that you need (involve analysts), unless you get a fantastic deal for an all-inclusive option. Generally, most vendors provide multiple-packaging options.
  • Number of records: Processed vs. Matched. Some vendors charge based on “processed” records, not just matched records. Depending on the match rate, it can make a big difference in total cost.
  • Installment/update frequency: Real-time, weekly, monthly, quarterly, etc. Think carefully about how often you would need to refresh “demographic” data, which doesn’t change as rapidly as transaction data, and how big the incremental universe would be for each update. Obviously, a real-time API feed can be costly.
  • Delivery method: API vs. Batch Delivery, for example. Price, as well as the data menu, change quite a bit based on the delivery options.
  • Availability of a full-licensing option: When the internal database becomes really big, full installment becomes a good option. But you would need internal capability for a match and append process that involves “soft-match,” using similar names and addresses (imagine good-old name and address merge routines). It becomes a bit of commitment as the match and append becomes a part of the internal database update process.

Business First
Evaluating a database is a project in itself, and these nine evaluation criteria will be a good guideline. Depending on the businesses, of course, more conditions could be added to the list. And that is the final point that I did not even include in the list: That the database (or all data, for that matter) should be useful to meet the business goals.

I have been saying that “Big Data Must Get Smaller,” and this whole Big Data movement should be about (1) Cutting down on the noise, and (2) Providing answers to decision-makers. If the data sources in question do not serve the business goals, cut them out of the plan, or cut loose the vendor if they are from external sources. It would be an easy decision if you “know” that the database in question is filled with dirty, sporadic and outdated data that cost lots of money to maintain.

But if that database is needed for your business to grow, clean it, update it, expand it and restructure it to harness better answers from it. Just like the way you’d maintain your cherished automobile to get more mileage out of it. Not all databases are created equal for sure, and some are definitely more equal than others. You just have to open your eyes to see the differences.

Beyond RFM Data

In the world of predictive analytics, the transaction data is the king of the hill. The master of the domain. The protector of the realm. Why? Because they are hands-down the most powerful predictors. If I may borrow the term that my mentor coined for our cooperative venture more than a decade ago (before anyone even uttered the word “Big Data”), “The past behavior is the best predictor of the future behavior.” Indeed. Back then, we had built a platform that nowadays could easily have qualified as Big Data. The platform predicted people’s future behaviors on a massive scale, and it worked really well, so I still stand by that statement.

In the world of predictive analytics, the transaction data is the king of the hill. The master of the domain. The protector of the realm. Why? Because they are hands-down the most powerful predictors. If I may borrow the term that my mentor coined for our cooperative venture more than a decade ago (before anyone even uttered the word “Big Data”), “The past behavior is the best predictor of the future behavior.” Indeed. Back then, we had built a platform that nowadays could easily have qualified as Big Data. The platform predicted people’s future behaviors on a massive scale, and it worked really well, so I still stand by that statement.

How so? At the risk of sounding like a pompous mathematical smartypants (I’m really not), it is because people do not change that much, or if so, not so rapidly. Every move you make is on some predictive curve. What you been buying, clicking, browsing, smelling or coveting somehow leads to the next move. Well, not all the time. (Maybe you just like to “look” at pretty shoes?) But with enough data, we can calculate the probability with some confidence that you would be an outdoors type, or a golfer, or a relaxing type on a cruise ship, or a risk-averse investor, or a wine enthusiast, or into fashion, or a passionate gardener, or a sci-fi geek, or a professional wrestling fan. Beyond affinity scores listed here, we can predict future value of each customer or prospect and possible attrition points, as well. And behind all those predictive models (and I have seen countless algorithms), the leading predictors are mostly transaction data, if you are lucky enough to get your hands on them. In the age of ubiquitous data and at the dawn of the “Internet of Things,” more marketers will be in that lucky group if they are diligent about data collection and refinement. Yes, in the near future, even a refrigerator will be able to order groceries, but don’t forget that only the collection mechanism will be different there. We still have to collect, refine and analyze the transaction data.

Last month, I talked about three major types of data (refer to “Big Data Must Get Smaller“), which are:
1. Descriptive Data
2. Behavioral Data (mostly Transaction Data)
3. Attitudinal Data.

If you gain access to all three elements with decent coverage, you will have tremendous predictive power when it comes to human behaviors. Unfortunately, it is really difficult to accumulate attitudinal data on a large scale with individual-level details (i.e., knowing who’s behind all those sentiments). Behavioral data, mostly in forms of transaction data, are also not easy to collect and maintain (non-transaction behavioral data are even bigger and harder to handle), but I’d say it is definitely worth the effort, as most of what we call Big Data fall under this category. Conversely, one can just purchase descriptive data, which are what we generally call demographic or firmographic data, from data compilers or brokers. The sellers (there are many) will even do the data-append processing for you and they may also throw in a few free profile reports with it.

Now, when we start talking about the transaction data, many marketers will respond “Oh, you mean RFM data?” Well, that is not completely off-base, because “Recency, Frequency and Monetary” data certainly occupy important positions in the family of transaction data. But they hardly are the whole thing, and the term is misused as frequently as “Big Data.” Transaction data are so much more than simple RFM variables.

RFM Data Is Just a Good Start
The term RFM should be used more as a checklist for marketers, not as design guidelines—or limitations in many cases—for data professionals. How recently did this particular customer purchase our product, and how frequently did she do that and how much money did she spend with us? Answering these questions is a good start, but stopping there would seriously limit the potential of transaction data. Further, this line of questioning would lead the interrogation efforts to simple “filtering,” as in: “Select all customers who purchased anything with a price tag over $100 more than once in past 12 months.” Many data users may think that this query is somewhat complex, but it really is just a one-dimensional view of the universe. And unfortunately, no customer is one-dimensional. And this query is just one slice of truth from the marketer’s point of view, not the customer’s. If you want to get really deep, the view must be “buyer-centric,” not product-, channel-, division-, seller- or company-centric. And the database structure should reflect that view (refer to “It’s All About Ranking,” where the concept of “Analytical Sandbox” is introduced).

Transaction data by definition describe the transactions, not the buyers. If you would like to describe a buyer or if you are trying to predict the buyer’s future behavior, you need to convert the transaction data into “descriptors of the buyers” first. What is the difference? It is the same data looked at through a different window—front vs. side window—but the effect is huge.

Even if we think about just one simple transaction with one item, instead of describing the shopping basket as “transaction happened on July 3, 2014, containing the Coldplay’s latest CD ‘Ghost Stories’ priced at $11.88,” a buyer-centric description would read: “A recent CD buyer in Rock genre with an average spending level in the music category under $20.” The trick is to describe the buyer, not the product or the transaction. If that customer has many orders and items in his purchase history (let’s say he downloaded a few songs to his portable devices, as well), the description of the buyer would become much richer. If you collect all of his past purchase history, it gets even more colorful, as in: “A recent music CD or MP3 buyer in rock, classical and jazz genres with 24-month purchase totaling to 13 orders containing 16 items with total spending valued in $100-$150 range and $11 average order size.” Of course you would store all this using many different variables (such as genre indicators, number of orders, number of items, total dollars spent during the past 24 months, average order amount and number of weeks since last purchase in the music category, etc.). But the point is that the story would come out this way when you change the perspective.

Creating a Buyer-Centric Portrait
The whole process of creating a buyer-centric portrait starts with data summarization (or de-normalization). A typical structure of the table (or database) that needs to capture every transaction detail, such as transaction date and amount, would require an entry for every transaction, and the database designers call it the “normal” state. As I explained in my previous article (“Ranking is the key”), if you would like to rank in terms of customer value, the data record must be on a customer level, as well. If you are ranking households or companies, you would then need to summarize the data on those levels, too.

Now, this summarization (or de-normalization) is not a process of eliminating duplicate entries of names, as you wouldn’t want to throw away any transaction details. If there are multiple orders per person, what is the total number of orders? What is the total amount of spending on an individual level? What would be average spending level per transaction, or per year? If you are allowed to have only one line of entry per person, how would you summarize the purchase dates, as you cannot just add them up? In that case, you can start with the first and last transaction date of each customer. Now, when you have the first and last transaction date for every customer, what would be the tenure of each customer and what would be the number of days since the last purchase? How many days, on average, are there in between orders then? Yes, all these figures are related to basic RFM metrics, but they are far more colorful this way.

The attached exhibit displays a very simple example of a before and after picture of such summarization process. On the left-hand side, there resides a typical order table containing customer ID, order number, order date and transaction amount. If a customer has multiple orders in a given period, an equal number of lines are required to record the transaction details. In real life, other order level information, such as payment method (very predictive, by the way), tax amount, discount or coupon amount and, if applicable, shipping amount would be on this table, as well.

On the right-hand side of the chart, you will find there is only one line per customer. As I mentioned in my previous columns, establishing consistent and accurate customer ID cannot be neglected—for this reason alone. How would you rely on the summary data if one person may have multiple IDs? The customer may have moved to a new address, or shopped from multiple stores or sites, or there could have been errors in data collections. Relying on email address is a big no-no, as we all carry many email addresses. That is why the first step of building a functional marketing database is to go through the data hygiene and consolidation process. (There are many data processing vendors and software packages for it.) Once a persistent customer (or individual) ID system is in place, you can add up the numbers to create customer-level statistics, such as total orders, total dollars, and first and last order dates, as you see in the chart.

Remember R, F, M, P and C
The real fun begins when you combine these numeric summary figures with product, channel and other important categorical variables. Because product (or service) and channel are the most distinctive dividers of customer behaviors, let’s just add P and C to the famous RFM (remember, we are using RFM just as a checklist here), and call it R, F, M, P and C.

Product (rather, product category) is an important separator, as people often show completely different spending behavior for different types of products. For example, you can send me fancy-shmancy fashion catalogs all you want, but I won’t look at it with an intention of purchase, as most men will look at the models and not what they are wearing. So my active purchase history in the sports, home electronics or music categories won’t mean anything in the fashion category. In other words, those so-called “hotline” names should be treated differently for different categories.

Channel information is also important, as there are active online buyers who would never buy certain items, such as apparel or home furnishing products, without physically touching them first. For example, even in the same categories, I would buy guitar strings or golf balls online. But I would not purchase a guitar or a driver without trying them out first. Now, when I say channel, I mean the channel that the customer used to make the purchase, not the channel through which the marketer chose to communicate with him. Channel information should be treated as a two-way street, as no marketer “owns” a customer through a particular channel (refer to “The Future of Online is Offline“).

As an exercise, let’s go back to the basic RFM data and create some actual variables. For “each” customer, we can start with basic RFM measures, as exhibited in the chart:

· Number of Transactions
· Total Dollar Amount
· Number of Days (or Weeks) since the Last Transaction
· Number of Days (or Weeks) since the First Transaction

Notice that the days are counted from today’s point of view (practically the day the database is updated), as the actual date’s significance changes as time goes by (e.g., a day in February would feel different when looked back on from April vs. November). “Recency” is a relative concept; therefore, we should relativize the time measurements to express it.

From these basic figures, we can derive other related variables, such as:

· Average Dollar Amount per Customer
· Average Dollar Amount per Transaction
· Average Dollar Amount per Year
· Lifetime Highest Amount per Item
· Lifetime Lowest Amount per Transaction
· Average Number of Days Between Transactions
· Etc., etc…

Now, imagine you have all these measurements by channels, such as retail, Web, catalog, phone or mail-in, and separately by product categories. If you imagine a gigantic spreadsheet, the summarized table would have fewer numbers of rows, but a seemingly endless number of columns. I will discuss categorical and non-numeric variables in future articles. But for this exercise, let’s just imagine having these sets of variables for all major product categories. The result is that the recency factor now becomes more like “Weeks since Last Online Order”—not just any order. Frequency measurements would be more like “Number of Transactions in Dietary Supplement Category”—not just for any product. Monetary values can be expressed in “Average Spending Level in Outdoor Sports Category through Online Channel”—not just the customer’s average dollar amount, in general.

Why stop there? We may slice and dice the data by offer type, customer status, payment method or time intervals (e.g., lifetime, 24-month, 48-months, etc.) as well. I am not saying that all the RFM variables should be cut out this way, but having “Number of Transaction by Payment Method,” for example, could be very revealing about the customer, as everybody uses multiple payment methods, while some may never use a debit card for a large purchase, for example. All these little measurements become building blocks in predictive modeling. Now, too many variables can also be troublesome. And knowing the balance (i.e., knowing where to stop) comes from the experience and preliminary analysis. That is when experts and analysts should be consulted for this type of uniform variable creation. Nevertheless, the point is that RFM variables are not just three simple measures that happen be a part of the larger transaction data menu. And we didn’t even touch non-transaction based behavioral elements, such as clicks, views, miles or minutes.

The Time Factor
So, if such data summarization is so useful for analytics and modeling, should we always include everything that has been collected since the inception of the database? The answer is yes and no. Sorry for being cryptic here, but it really depends on what your product is all about; how the buyers would relate to it; and what you, as a marketer, are trying to achieve. As for going back forever, there is a danger in that kind of data hoarding, as “Life-to-Date” data always favors tenured customers over new customers who have a relatively short history. In reality, many new customers may have more potential in terms of value than a tenured customer with lots of transaction records from a long time ago, but with no recent activity. That is why we need to create a level playing field in terms of time limit.

If a “Life-to-Date” summary is not ideal for predictive analytics, then where should you place the cutoff line? If you are selling cars or home furnishing products, we may need to look at a 4- to 5-year history. If your products are consumables with relatively short purchase cycles, then a 1-year examination would be enough. If your product is seasonal in nature—like gardening, vacation or heavily holiday-related items, then you may have to look at a minimum of two consecutive years of history to capture seasonal patterns. If you have mixed seasonality or longevity of products (e.g., selling golf balls and golf clubs sets through the same store or site), then you may have to summarize the data with multiple timelines, where the above metrics would be separated by 12 months, 24 months, 48 months, etc. If you have lifetime value models or any time-series models in the plan, then you may have to break the timeline down even more finely. Again, this is where you may need professional guidance, but marketers’ input is equally important.

Analytical Sandbox
Lastly, who should be doing all of this data summary work? I talked about the concept of the “Analytical Sandbox,” where all types of data conversion, hygiene, transformation, categorization and summarization are done in a consistent manner, and analytical activities, such as sampling, profiling, modeling and scoring are done with proper toolsets like SAS, R or SPSS (refer to “It’s All About Ranking“). The short and final answer is this: Do not leave that to analysts or statisticians. They are the main players in that playground, not the architects or developers of it. If you are serious about employing analytics for your business, plan to build the Analytical Sandbox along with the team of analysts.

My goal as a database designer has always been serving the analysts and statisticians with “model-ready” datasets on silver platters. My promise to them has been that the modelers would spend no time fixing the data. Instead, they would be spending their valuable time thinking about the targets and statistical methodologies to fulfill the marketing goals. After all, answers that we seek come out of those mighty—but often elusive—algorithms, and the algorithms are made of data variables. So, in the interest of getting the proper answers fast, we must build lots of building blocks first. And no, simple RFM variables won’t cut it.

Big Data Must Get Smaller

Like many folks who worked in the data business for a long time, I don’t even like the words “Big Data.” Yeah, data is big now, I get it. But so what? Faster and bigger have been the theme in the computing business since the first calculator was invented. In fact, I don’t appreciate the common definition of Big Data that is often expressed in the three Vs: volume, velocity and variety. So, if any kind of data are big and fast, it’s all good? I don’t think so. If you have lots of “dumb” data all over the place, how does that help you? Well, as much as all the clutter that’s been piled on in your basement since 1971. It may yield some profit on an online auction site one day. Who knows? Maybe some collector will pay good money for some obscure Coltrane or Moody Blues albums that you never even touched since your last turntable (Ooh, what is that?) died on you. Those oversized album jackets were really cool though, weren’t they?

Like many folks who worked in the data business for a long time, I don’t even like the words “Big Data.” Yeah, data is big now, I get it. But so what? Faster and bigger have been the theme in the computing business since the first calculator was invented. In fact, I don’t appreciate the common definition of Big Data that is often expressed in the three Vs: volume, velocity and variety. So, if any kind of data are big and fast, it’s all good? I don’t think so. If you have lots of “dumb” data all over the place, how does that help you? Well, as much as all the clutter that’s been piled on in your basement since 1971. It may yield some profit on an online auction site one day. Who knows? Maybe some collector will pay good money for some obscure Coltrane or Moody Blues albums that you never even touched since your last turntable (Ooh, what is that?) died on you. Those oversized album jackets were really cool though, weren’t they?

Seriously, the word “Big” only emphasizes the size element, and that is a sure way to miss the essence of the data business. And many folks are missing even that little point by calling all decision-making activities that involve even small-sized data “Big Data.” It is entirely possible that this data stuff seems all new to someone, but the data-based decision-making process has been with us for a very long time. If you use that “B” word to differentiate old-fashioned data analytics of yesteryear and ridiculously large datasets of the present day, yes, that is a proper usage of it. But we all know most people do not mean it that way. One side benefit of this bloated and hyped up buzzword is data professionals like myself do not have to explain what we do for living for 20 minutes anymore by simply uttering the word “Big Data,” though that is a lot like a grandmother declaring all her grandchildren work on computers for living. Better yet, that magic “B” word sometimes opens doors to new business opportunities (or at least a chance to grab a microphone in non-data-related meetings and conferences) that data geeks of the past never dreamed of.

So, I guess it is not all that bad. But lest we forget, all hypes lead to overinvestments, and all overinvestments leads to disappointments, and all disappointments lead to purging of related personnel and vendors that bear that hyped-up dirty word in their titles or division names. If this Big Data stuff does not yield significant profit (or reduction in cost), I am certain that those investment bubbles will burst soon enough. Yes, some data folks may be lucky enough to milk it for another two or three years, but brace for impact if all those collected data do not lead to some serious dollar signs. I know how the storage and processing cost decreased significantly in recent years, but they ain’t totally free, and related man-hours aren’t exactly cheap, either. Also, if this whole data business is a new concept to an organization, any money spent on the promise of Big Data easily becomes a liability for the reluctant bunch.

This is why I open up my speeches and lectures with this question: “Have you made any money with this Big Data stuff yet?” Surely, you didn’t spend all that money to provide faster toys and nicer playgrounds to IT folks? Maybe the head of IT had some fun with it, but let’s ask that question to CFOs, not CTOs, CIOs or CDOs. I know some colleagues (i.e., fellow data geeks) who are already thinking about a new name for this—”decision-making activities, based on data and analytics”—because many of us will be still doing that “data stuff” even after Big Data cease to be cool after the judgment day. Yeah, that Gangnam Style dance was fun for a while, but who still jumps around like a horse?

Now, if you ask me (though nobody did yet), I’d say the Big Data should have been “Smart Data,” “Intelligent Data” or something to that extent. Because data must provide insights. Answers to questions. Guidance to decision-makers. To data professionals, piles of data—especially the ones that are fragmented, unstructured and unformatted, no matter what kind of fancy names the operating system and underlying database technology may bear—it is just a good start. For non-data-professionals, unrefined data—whether they are big or small—would remain distant and obscure. Offering mounds of raw data to end-users is like providing a painting kit when someone wants a picture on the wall. Bragging about the size of the data with impressive sounding new measurements that end with “bytes” is like counting grains of rice in California in front of a hungry man.

Big Data must get smaller. People want yes/no answers to their specific questions. If such clarity is not possible, probability figures to such questions should be provided; as in, “There’s an 80 percent chance of thunderstorms on the day of the company golf outing,” “An above-average chance to close a deal with a certain prospect” or “Potential value of a customer who is repeatedly complaining about something on the phone.” It is about easy-to-understand answers to business questions, not a quintillion bytes of data stored in some obscure cloud somewhere. As I stated at the end of my last column, the Big Data movement should be about (1) Getting rid of the noise, and (2) Providing simple answers to decision-makers. And getting to such answers is indeed the process of making data smaller and smaller.

In my past columns, I talked about the benefits of statistical models in the age of Big Data, as they are the best way to compact big and complex information in forms of simple answers (refer to “Why Model?”). Models built to predict (or point out) who is more likely to be into outdoor sports, to be a risk-averse investor, to go on a cruise vacation, to be a member of discount club, to buy children’s products, to be a bigtime donor or to be a NASCAR fan, are all providing specific answers to specific questions, while each model score is a result of serious reduction of information, often compressing thousands of variables into one answer. That simplification process in itself provides incredible value to decision-makers, as most wouldn’t know where to cut out unnecessary information to answer specific questions. Using mathematical techniques, we can cut down the noise with conviction.

In model development, “Variable Reduction” is the first major step after the target variable is determined (refer to “The Art of Targeting“). It is often the most rigorous and laborious exercise in the whole model development process, where the characteristics of models are often determined as each statistician has his or her unique approach to it. Now, I am not about to initiate a debate about the best statistical method for variable reduction (I haven’t met two statisticians who completely agree with each other in terms of methodologies), but I happened to know that many effective statistical analysts separate variables in terms of data types and treat them differently. In other words, not all data variables are created equal. So, what are the major types of data that database designers and decision-makers (i.e., non-mathematical types) should be aware of?

In the business of predictive analytics for marketing, the following three types of data make up three dimensions of a target individual’s portrait:

  1. Descriptive Data
  2. Transaction Data / Behavioral Data
  3. Attitudinal Data

In other words, if we get to know all three aspects of a person, it will be much easier to predict what the person is about and/or what the person will do. Why do we need these three dimensions? If an individual has a high income and is living in a highly valued home (demographic element, which is descriptive); and if he is an avid golfer (behavioral element often derived from his purchase history), can we just assume that he is politically conservative (attitudinal element)? Well, not really, and not all the time. Sometimes we have to stop and ask what the person’s attitude and outlook on life is all about. Now, because it is not practical to ask everyone in the country about every subject, we often build models to predict the attitudinal aspect with available data. If you got a phone call from a political party that “assumes” your political stance, that incident was probably not random or accidental. Like I emphasized many times, analytics is about making the best of what is available, as there is no such thing as a complete dataset, even in this age of ubiquitous data. Nonetheless, these three dimensions of the data spectrum occupy a unique and distinct place in the business of predictive analytics.

So, in the interest of obtaining, maintaining and utilizing all possible types of data—or, conversely, reducing the size of data with conviction by knowing what to ignore, let us dig a little deeper:

Descriptive Data
Generally, demographic data—such as people’s income, age, number of children, housing size, dwelling type, occupation, etc.—fall under this category. For B-to-B applications, “Firmographic” data—such as number of employees, sales volume, year started, industry type, etc.—would be considered as descriptive data. It is about what the targets “look like” and, generally, they are frozen in the present time. Many prominent data compilers (or data brokers, as the U.S. government calls them) collect, compile and refine the data and make hundreds of variables available to users in various industry sectors. They also fill in the blanks using predictive modeling techniques. In other words, the compilers may not know the income range of every household, but using statistical techniques and other available data—such as age, home ownership, housing value, and many other variables—they provide their best estimates in case of missing values. People often have some allergic reaction to such data compilation practices siting privacy concerns, but these types of data are not about looking up one person at a time, but about analyzing and targeting groups (or segments) of individuals and households. In terms of predictive power, they are quite effective and results are very consistent. The best part is that most of the variables are available for every household in the country, whether they are actual or inferred.

Other types of descriptive data include geo-demographic data, and the Census Data by the U.S. Census Bureau falls under this category. These datasets are organized by geographic denominations such as Census Block Group, Census Tract, Country or ZIP Code Tabulation Area (ZCTA, much like postal ZIP codes, but not exactly the same). Although they are not available on an individual or a household level, the Census data are very useful in predictive modeling, as every target record can be enhanced with it, even when name and address are not available, and data themselves are very stable. The downside is that while the datasets are free through Census Bureau, the raw datasets contain more than 40,000 variables. Plus, due to the budget cut and changes in survey methods during the past decade, the sample size (yes, they sample) decreased significantly, rendering some variables useless at lower geographic denominations, such as Census Block Group. There are professional data companies that narrowed down the list of variables to manageable sizes (300 to 400 variables) and filled in the missing values. Because they are geo-level data, variables are in the forms of percentages, averages or median values of elements, such as gender, race, age, language, occupation, education level, real estate value, etc. (as in, percent male, percent Asian, percent white-collar professionals, average income, median school years, median rent, etc.).

There are many instances where marketers cannot pinpoint the identity of a person due to privacy issues or challenges in data collection, and the Census Data play a role of effective substitute for individual- or household-level demographic data. In predictive analytics, duller variables that are available nearly all the time are often more valuable than precise information with limited availability.

Transaction Data/Behavioral Data
While descriptive data are about what the targets look like, behavioral data are about what they actually did. Often, behavioral data are in forms of transactions. So many just call it transaction data. What marketers commonly refer to as RFM (Recency, Frequency and Monetary) data fall under this category. In terms of predicting power, they are truly at the top of the food chain. Yes, we can build models to guess who potential golfers are with demographic data, such as age, gender, income, occupation, housing value and other neighborhood-level information, but if you get to “know” that someone is a buyer of a box of golf balls every six weeks or so, why guess? Further, models built with transaction data can even predict the nature of future purchases, in terms of monetary value and frequency intervals. Unfortunately, many who have access to RFM data are using them only in rudimentary filtering, as in “select everyone who spends more than $200 in a gift category during the past 12 months,” or something like that. But we can do so much more with rich transaction data in every stage of the marketing life cycle for prospecting, cultivating, retaining and winning back.

Other types of behavioral data include non-transaction data, such as click data, page views, abandoned shopping baskets or movement data. This type of behavioral data is getting a lot of attention as it is truly “big.” The data have been out of reach for many decision-makers before the emergence of new technology to capture and store them. In terms of predictability, nevertheless, they are not as powerful as real transaction data. These non-transaction data may provide directional guidance, as they are what some data geeks call “a-camera-on-everyone’s-shoulder” type of data. But we all know that there is a clear dividing line between people’s intentions and their commitments. And it can be very costly to follow every breath you take, every move you make, and every step you take. Due to their distinct characteristics, transaction data and non-transaction data must be managed separately. And if used together in models, they should be clearly labeled, so the analysts will never treat them the same way by accident. You really don’t want to mix intentions and commitments.

The trouble with the behavioral data are, (1) they are difficult to compile and manage, (2) they get big; sometimes really big, (3) they are generally confined within divisions or companies, and (4) they are not easy to analyze. In fact, most of the examples that I used in this series are about the transaction data. Now, No. 3 here could be really troublesome, as it equates to availability (or lack thereof). Yes, you may know everything that happened with your customers, but do you know where else they are shopping? Fortunately, there are co-op companies that can answer that question, as they are compilers of transaction data across multiple merchants and sources. And combined data can be exponentially more powerful than data in silos. Now, because transaction data are not always available for every person in databases, analysts often combine behavioral data and descriptive data in their models. Transaction data usually become the dominant predictors in such cases, while descriptive data play the supporting roles filling in the gaps and smoothing out the predictive curves.

As I stated repeatedly, predictive analytics in marketing is all about finding out (1) whom to engage, and (2) if you decided to engage someone, what to offer to that person. Using carefully collected transaction data for most of their customers, there are supermarket chains that achieved 100 percent customization rates for their coupon books. That means no two coupon books are exactly the same, which is a quite impressive accomplishment. And that is all transaction data in action, and it is a great example of “Big Data” (or rather, “Smart Data”).

Attitudinal Data
In the past, attitudinal data came from surveys, primary researches and focus groups. Now, basically all social media channels function as gigantic focus groups. Through virtual places, such as Facebook, Twitter or other social media networks, people are freely volunteering what they think and feel about certain products and services, and many marketers are learning how to “listen” to them. Sentiment analysis falls under that category of analytics, and many automatically think of this type of analytics when they hear “Big Data.”

The trouble with social data is:

  1. We often do not know who’s behind the statements in question, and
  2. They are in silos, and it is not easy to combine such data with transaction or demographic data, due to lack of identity of their sources.

Yes, we can see that a certain political candidate is trending high after an impressive speech, but how would we connect that piece of information to whom will actually donate money for the candidate’s causes? If we can find out “where” the target is via an IP address and related ZIP codes, we may be able to connect the voter to geo-demographic data, such as the Census. But, generally, personally identifiable information (PII) is only accessible by the data compilers, if they even bothered to collect them.

Therefore, most such studies are on a macro level, citing trends and directions, and types of analysts in that field are quite different from the micro-level analysts who deal with behavioral data and descriptive data. Now, the former provide important insights regarding the “why” part of the equation, which is often the hardest thing to predict; while the latter provide answers to “who, what, where and when.” (“Who” is the easiest to answer, and “when” is the hardest.) That “why” part may dictate a product development part of the decision-making process at the conceptual stage (as in, “Why would customers care for a new type of dishwasher?”), while “who, what, where and when” are more about selling the developed products (as in “Let’s sell those dishwashers in the most effective ways.”). So, it can be argued that these different types of data call for different types of analytics for different cycles in the decision-making processes.

Obviously, there are more types of data out there. But for marketing applications dealing with humans, these three types of data complete the buyers’ portraits. Now, depending on what marketers are trying to do with the data, they can prioritize where to invest first and what to ignore (for now). If they are early in the marketing cycle trying to develop a new product for the future, they need to understand why people want something and behave in certain ways. If signing up as many new customers as possible is the immediate goal, finding out who and where the ideal prospects are becomes the most imminent task. If maximizing the customer value is the ongoing objective, then you’d better start analyzing transaction data more seriously. If preventing attrition is the goal, then you will have to line up the transaction data in time series format for further analysis.

The business goals must dictate the analytics, and the analytics call for specific types of data to meet the goals, and the supporting datasets should be in “analytics-ready” formats. Not the other way around, where businesses are dictated by the limitations of analytics, and analytics are hampered by inadequate data clutters. That type of business-oriented hierarchy should be the main theme of effective data management, and with clear goals and proper data strategy, you will know where to invest first and what data to ignore as a decision-maker, not necessarily as a mathematical analyst. And that is the first step toward making the Big Data smaller. Don’t be impressed by the size of the data, as they often blur the big picture and not all data are created equal.