Data Must Flow, But Not All of Them

Like any resource like water, data may be locked in wrong places or in inadequate forms. We hear about all kinds of doomsday scenarios related to the water supply in Africa, and it is because of uneven distribution of water thanks to drastic climate change and border disputes.

data flow and Marketing channelsThree quarters of this planet’s surface is covered with water. Yet, human collectives have to work constantly to maintain a steady supply of fresh water. When one area is flooded, another region may be going through some serious drought. It is about distribution of resources, not about the sheer amount of them.

Data management is the same way. We are clearly living in the age of abundant data, but many decision-makers complain that there are not enough “useful” data or insights. Why is that?

Like any resource like water, data may be locked in wrong places or in inadequate forms. We hear about all kinds of doomsday scenarios related to the water supply in Africa, and it is because of uneven distribution of water thanks to drastic climate change and border disputes. Conversely, California is running out of its water sources, even as the state is sitting right next to a huge pond called the Pacific Ocean. Water, in that case, is in a wrong form for the end-users there.

Data must flow through organizations like water; and to be useful, they must be in consumable formats. I have been emphasizing the importance of the data refinement process throughout this series (refer to “Cheat Sheet: Is Your Database Marketing Ready?” and “It’s All about Ranking”). In the data business, too much emphasis has been put on data collection platforms and toolsets that enable user interface, but not enough on the middle part where data are aligned, cleaned and reformatted though analytics. Most of the trouble, unfortunately, happens due to inadequate data, not because of storage platforms and reporting tools.

This month, nonetheless, let’s talk about the distribution of data. It doesn’t matter how clean and organized the data sources are, if they are locked in silos. Ironically, that is how this term “360-degree customer view” became popular, as most datasets are indeed channel- or division-centric, not customer-centric.

It is not so difficult to get to that consensus in any meeting. Yeah sure, let’s put all the data together in one place. Then, if we just open the flood gates and lead all of the data to a central location, will all the data issues go away? Can we just call that new data pond a “marketing database”? (Refer to “Marketing and IT; Cats and Dogs.”)

The short answer is “No way, no sir.” I have seen too many instances where IT and marketing try to move the river of data and fail miserably, thanks to the sheer size of such construction work. Maybe they should have thought about reducing the amount of data before constructing a monumental canal of data? Like in life, moving time is the best time to throw things away.

IT managers instinctively try to avoid any infrastructure failure, along with countless questions that would rise out of dumping “all” of the data on top of marketers’ laps. And for the sake of the users who can’t really plow through every bit of data anyway, we’ve got to be smarter about moving the data around.

The first thing that data players must consider is the purpose of the data project. Depending on the goal, the list of “must-haves” changes drastically.

So, let’s make an example out of the aforementioned “360-degree customer view” (or “single customer view”). What is the purpose of building such a thing? It is to stay relevant with the target customers. How do we go about doing that? Just collect anything and everything about them? If we are to “predict” their future behavior, or to estimate their propensities in order to pamper them through every channel that we get to use, one may think that we have to know absolutely everything about the customers.

Marketing and IT; Cats and Dogs

Cats and dogs do not get along unless they grew up together since birth. That is because cats and dogs have rather fundamental communication problems with each other. A dog would wag his tail in an upward position when he wants to play. To a cat though, upward-tail is a sure sign of hostility, as in “What’s up, dawg?!” In fact, if you observe an angry or nervous cat, you will see that everything is up; tail, hair, toes, even her spine. So imagine the dog’s confusion in this situation, where he just sent a friendly signal that he wants to play with the cat, and what he gets back are loud hisses and scary evil eyes—but along with an upward tail that “looks” like a peace sign to him. Yeah, I admit that I am a bona-fide dog person, so I looked at this from his perspective, first. But I sympathize with the cat, too. As from her point of view, the dog started to mess with her, disrupting an afternoon slumber in her favorite sunny spot by wagging his stupid tail. Encounters like this cannot end well. Thank goodness that us Homo sapiens lost our tails during our evolutionary journey, as that would have been one more thing that clueless guys would have to decode regarding the mood of our female companions. Imagine a conversation like “How could you not see that I didn’t mean it? My tail was pointing the ground when I said that!” Then a guy would say, “Oh jeez, because I was looking at your lips moving up and down when you were saying something?”

 

Cats and dogs do not get along unless they grew up together since birth. That is because cats and dogs have rather fundamental communication problems with each other. A dog would wag his tail in an upward position when he wants to play. To a cat though, upward-tail is a sure sign of hostility, as in “What’s up, dawg?!” In fact, if you observe an angry or nervous cat, you will see that everything is up; tail, hair, toes, even her spine. So imagine the dog’s confusion in this situation, where he just sent a friendly signal that he wants to play with the cat, and what he gets back are loud hisses and scary evil eyes—but along with an upward tail that “looks” like a peace sign to him. Yeah, I admit that I am a bona-fide dog person, so I looked at this from his perspective first. But I sympathize with the cat, too. As from her point of view, the dog started to mess with her, disrupting an afternoon slumber in her favorite sunny spot by wagging his stupid tail. Encounters like this cannot end well. Thank goodness that us Homo sapiens lost our tails during our evolutionary journey, as that would have been one more thing that clueless guys would have to decode regarding the mood of our female companions. Imagine a conversation like “How could you not see that I didn’t mean it? My tail was pointing the ground when I said that!” Then a guy would say, “Oh jeez, because I was looking at your lips moving up and down when you were saying something?”

Of course I am generalizing for a comedic effect here, but I see communication breakdowns like this all the time in business environments, especially between the marketing and IT teams. You think men are from Mars and women are from Venus? I think IT folks are from Vulcan and marketing people are from Betazed (if you didn’t get this, find a Trekkie around you and ask).

Now that we are living in the age of Big Data where marketing messages must be custom-tailored based on data, we really need to find a way to narrow the gap between the marketing and the IT world. I wouldn’t dare to say which side is more like a dog or a cat, as I will surely offend someone. But I think even non-Trekkies would agree that it could be terribly frustrating to talk to a Vulcan who thinks that every sentence must be logically impeccable, or a Betazed who thinks that someone’s emotional state is the way it is just because she read it that way. How do they meet in the middle? They need a translator—generally a “human” captain of a starship—between the two worlds, and that translator had better speak both languages fluently and understand both cultures without any preconceived notions.

Similarly, we need translators between the IT world and the marketing world, too. Call such translators “data scientists” if you want (refer to “How to Be a Good Data Scientist”). Or, at times a data strategist or a consultant like myself plays that role. Call us “bats” caught in between the beasts and the birds in an Aesop’s tale, as we need to be marginal people who don’t really belong to one specific world 100 percent. At times, it is a lonely place as we are understood by none, and often we are blamed for representing “the other side.” It is hard enough to be an expert in data and analytics, and we now have to master the artistry of diplomacy. But that is the reality, and I have seen plenty of evidence as to why people whose main job it is to harness meanings out of data must act as translators, as well.

IT is a very special function in modern organizations, regardless of their business models. Systems must run smoothly without errors, and all employees and outside collaborators must constantly be in connection through all imaginable devices and operating systems. Data must be securely stored and backed up regularly, and permissions to access them must be granted based on complex rules, based on job levels and functions. Then there are constant requests to install and maintain new and strange software and technologies, which should be patched and updated diligently. And God forbid if anything fails to work even for a few seconds on a weekend, all hell will break lose. Simply, the end-users—many of them in positions of dealing with customers and clients directly—do not care about IT when things run smoothly, as they take them all for granted. But when they don’t, you know the consequences. Thankless job? You bet. It is like a utility company never getting praises when the lights are up, but everyone yelling and screaming if the service is disrupted, even for a natural cause.

On the other side of the world, there are marketers, salespeople and account executives who deal with customers, clients and their bosses, who would treat IT like their servants, not partners, when things do not “seem” to work properly or when “their” sales projections are not met. The craziest part is that most customers, clients and bosses state their goals and complaints in the most ambiguous terms, as in “This ad doesn’t look slick enough,” “This copy doesn’t talk to me,” “This app doesn’t stick” or “We need to find the right audience.” What the IT folks often do not grasp is that (1) it really stinks when you get yelled at by customers and clients for any reason, and (2) not all business goals are easily translatable to logical statements. And this is when all data elements and systems are functioning within normal parameters.

Without a proper translator, marketers often self-prescribe solutions that call for data work and analytics. Often, they think that all the problems will go away if they have unlimited access to every piece of data ever collected. So they ask for exactly that. IT will respond that such request will put a terrible burden on the system, which has to support not just marketing but also other operations. Eventually they may meet in the middle and the marketer will have access to more data than ever possible in the past. Then the marketers realize that their business issues do not go away just because they have more data in their hands. In fact, their job seems to have gotten even more complicated. They think that it is because data elements are too difficult to understand and they start blaming the data dictionary or lack thereof. They start using words like Data Governance and Quality Control, which may sound almost offensive to diligent IT personnel. IT will respond that they showed every useful bit of data they are allowed to share without breaking the security protocol, and the data dictionaries are all up to date. Marketers say the data dictionaries are hard to understand, and they are filled with too many similar variables and seemingly conflicting information. IT now says they need yet another tool set to properly implement data governance protocols and deploy them. Heck, I have seen cases where some heads of IT went for complete re-platforming of their system, as if that would answer all the marketing questions. Now, does this sound familiar so far? Does it sound like your own experience, like when you are reading “Dilbert” comic strips? It is because you are not alone in all this.

Allow me to be a little more specific with an example. Marketers often talk about “High-Value Customers.” To people who deal with 1s and 0s, that means less than nothing. What does that even mean? Because “high-value customers” could be:

  • High-dollar spenders—But what if they do not purchase often?
  • Frequent shoppers—But what if they don’t spend much at all?
  • Recent customers—Oh, those coveted “hotline” names … but will they stay that way, even for another few months?
  • Tenured customers—But are they loyal to your business, now?
  • Customers with high loyalty points—Or are they just racking up points and they would do anything to accumulate points?
  • High activity—Such as point redemptions and other non-monetary activities, but what if all those activities do not generate profit?
  • Profitable customers—The nice ones who don’t need much hand-holding. And where do we get the “cost” side of the equation on a personal level?
  • Customers who purchases extra items—Such as cruisers who drink a lot on board or diners who order many special items, as suggested.
  • Etc., etc …

Now it gets more complex, as these definitions must be represented in numbers and figures, and depending on the industry, whether be they for retailers, airlines, hotels, cruise ships, credit cards, investments, utilities, non-profit or business services, variables that would be employed to define seemingly straightforward “high-value customers” would be vastly different. But let’s say that we pick an airline as an example. Let me ask you this; how frequent is frequent enough for anyone to be called a frequent flyer?

Let’s just assume that we are going through an exercise of defining a frequent flyer for an airline company, not for any other travel-related businesses or even travel agencies (that would deal with lots of non-flyers). Granted that we have access to all necessary data, we may consider using:

1. Number of Miles—But for how many years? If we go back too far, shouldn’t we have to examine further if the customer is still active with the airline in question? And what does “active” mean to you?

2. Dollars Spent—Again for how long? In what currency? Converted into U.S. dollars at what point in time?

3. Number of Full-Price Ticket Purchases—OK, do we get to see all the ticket codes that define full price? What about customers who purchased tickets through partners and agencies vs. direct buyers through the airline’s website? Do they share a common coding system?

4. Days Between Travel—What date shall we use? Booking date, payment date or travel date? What time zones should we use for consistency? If UTC/GMT is to be used, how will we know who is booking trips during business hours vs. evening hours in their own time zone?

After a considerable hours of debate, let’s say that we reached the conclusion that all involved parties could live with. Then we find out that the databases from the IT department are all on “event” levels (such as clicks, views, bookings, payments, boarding, redemption, etc.), and we would have to realign and summarize the data—in terms of miles, dollars and trips—on an individual customer level to create a definition of “frequent flyers.”

In other words, we would need to see the data from the customer-centric point of view, just to begin the discussion about frequent flyers, not to mention how to communicate with each customer in the future. Now, it that a job for IT or marketing? Who will put the bell on the cat’s neck? (Hint: Not the dog.) Well, it depends. But this definitely is not a traditional IT function, nor is it a standalone analytical project. It is something in between, requiring a translator.

Customer-Centric Database, Revisited
I have been emphasizing the importance of a customer-centric view throughout this series, and I also shared some details regarding databases designed for marketing functions (refer to: “Cheat Sheet: Is Your Database Marketing Ready?”). But allow me to reiterate this point.

In the age of abundant and ubiquitous data, omnichannel marketing communication—optimized based on customers’ past transaction history, product preferences, and demographic and behavioral personas—should be an effortless routine. The reality is far from it for many organizations, as it is very common that much of the vital information is locked in silos without being properly consolidated or governed by a standard set of business rules. It is not that creating such a marketing-oriented database (or data-mart) is solely the IT department’s responsibility, but having a dedicated information source for efficient personalization should be an organizational priority in modern days.

Most databases nowadays are optimized for data collection, storage and rapid retrieval, and such design in general does not provide a customer-centric view—which is essential for any type of personalized communication via all conceivable channels and devices of the present and future. Using brand-, division-, product-, channel- or device-centric datasets is often the biggest obstacle in the journey to an optimal customer experience, as those describe events and transactions, not individuals. Further, bits and pieces of information must be transformed into answers to questions through advanced analytics, including statistical models.

In short, all analytical efforts must be geared toward meeting business objectives, and databases must be optimized for analytics (refer to “Chicken or the Egg? Data or Analytics?”). Unfortunately, the situation is completely reversed in many organizations, where analytical maneuvering is limited due to inadequate source data, and decision-making processes are dictated by limitations of available analytics. Visible symptoms of such cases are, to list a few, elongated project cycle time, decreasing response rates, ineffective customer communication, saturation of data sources due to overexposure, and—as I was emphasizing in this article—communication breakdown among divisions and team members. I can even go as far as to say that the lack of a properly designed analytical environment is the No. 1 cause of miscommunications between IT and marketing.

Without a doubt, key pieces of data must reside in the centralized data depository—generally governed by IT—for effective marketing. But that is only the beginning and still is just a part of the data collection process. Collected data must be consolidated around the solid definition of a “customer,” and all product-, transaction-, event- and channel-level information should be transformed into descriptors of customers, via data standardization, categorization, transformation and summarization. Then the data may be further enhanced via third-party data acquisition and statistical modeling, using all available data. In other words, raw data must be refined through these steps to be useful in marketing and other customer interactions, online or offline (refer to “‘Big Data’ Is Like Mining Gold for a Watch—Gold Can’t Tell Time“). It does not matter how well the original transaction- or event-level data are stored in the main database without visible errors, or what kind of state-of-the-art communication tool sets a company is equipped with. Trying to use raw data for a near real-time personalization engine is like putting unrefined oil into a high-performance sports car.

This whole data refinement process may sound like a daunting task, but it is not nearly as painful as analytical efforts to derive meanings out of unstructured, unconsolidated and uncategorized data that are scattered all over the organization. A customer-centric marketing database (call it a data-mart if “database” sounds too much like it should solely belong to IT) created with standard business rules and uniform variables sets would, in turn, provide an “analytics-ready” environment, where statistical modeling and other advanced analytics efforts would gain tremendous momentum. In the end, the decision-making process would become much more efficient as analytics would provide answers to questions, not just bits and pieces of fragmented data, to the ultimate beneficiaries of data. And answers to questions do not require an enormous data dictionary, either; fast-acting marketing machines do not have time to look up dictionaries, anyway.

Data Roadmap—Phased Approach
For the effort to build a consolidated marketing data platform that is analytics-ready (hence, marketing-ready), I always recommend a phased approach, as (1) inevitable complexity of a data consolidation project will be contained and managed more efficiently in carefully defined phases, and (2) each phase will require different types of expertise, tool sets and technologies. Nevertheless, the overall project must be managed by an internal champion, along with a group of experts who possess long-term vision and tactical knowledge in both database and analytics technologies. That means this effort must reside above IT and marketing, and it should be seen as a strategic effort for an entire organization. If the company already hired a Chief Data Officer, I would say that this should be one of the top priorities for that position. If not, outsourcing would be a good option, as an impartial decision-maker, who would play a role of a referee, may have to come from the outside.

The following are the major steps:

  1. Formulate Questions: “All of the above” is not a good way to start a complex project. In order to come up with the most effective way to build a centralized data depository, we first need to understand what questions must be answered by it. Too many database projects call for cars that must fly, as well.
  2. Data Inventory: Every organization has more data than it expected, and not all goldmines are in plain sight. All the gatekeepers of existing databases should be interviewed, and any data that could be valuable for customer descriptions or behavioral predictions should be considered, starting with product, transaction, promotion and response data, stemming from all divisions and marketing channels.
  3. Data Hygiene and Standardization: All available data fields should be examined and cleaned up, where some data may be discarded or modified. Free form fields would deserve special attention, as categorization and tagging are one of the key steps to opening up new intelligence.
  4. Customer Definition: Any existing Customer ID systems (such as loyalty program ID, account number, etc.) will be examined. It may be further enhanced with available PII (personally identifiable information), as there could be inconsistencies among different systems, and customers often move their residency or use multiple email addresses, creating duplicate identities. A consistent and reliable Customer ID system becomes the backbone of a customer-centric database.
  5. Data Consolidation: Data from different silos and divisions will be merged together based on the master Customer ID. A customer-centric database begins to take shape here. The database update process should be thoroughly tested, as “incremental” updates are often found to be more difficult than the initial build. The job is simply not done until after a few successful iterations of updates.
  6. Data Transformation: Once a solid Customer ID system is in place, all transaction- and event-level data will be transformed to “descriptors” of individual customers, via summarization by categories and creation of analytical variables. For example, all product information will be aligned for each customer, and transaction data will be converted into personal-level monetary summaries and activities, in both static and time-series formats. Promotion and response history data will go through similar processes, yielding individual-level ROI metrics by channel and time period. This is the single-most critical step in all of this, requiring deep knowledge in business, data and analytics, as the stage is being set for all future analytics and reporting efforts. Due to variety and uniqueness of business goals in different industries, a one-size-fits-all approach will not work, either.
  7. Analytical Projects: Test projects will be selected and the entire process will be done on the new platform. Ad-hoc reporting and complex modeling projects will be conducted, and the results will be graded on timing, accuracy, consistency and user-friendliness. An iterative approach is required, as it is impossible to foresee all possible user requests and related complexities upfront. A database should be treated as a living, breathing organism, not something rigid and inflexible. Marketers will “break-in” the database as they use it more routinely.
  8. Applying the Knowledge: The outcomes of analytical projects will be applied to the entire customer base, and live campaigns will be run based on them. Often, major breakdowns happen at the large-scale deployment stages; especially when dealing with millions of customers and complex mathematical formulae at the same time. A model-ready database will definitely minimize the risk (hence, the term “in-database scoring”), but the process will still require some fine-tuning. To proliferate gained knowledge throughout the organization, some model scores—which pack deep intelligence in small sizes—may be transferred back to the main databases managed by IT. Imagine model scores driving operational decisions—live, on the ground.
  9. Result Analysis: Good marketing intelligence engines must be equipped with feedback mechanisms, effectively closing the “loop” where each iteration of marketing efforts improves its effectiveness with accumulated knowledge on a customer level. It is very unfortunate that many marketers just move through the tracks set up by their predecessors, mainly because existing database environments are not even equipped to link necessary data elements on a customer level. Too many back-end analyses are just event-, offer- or channel-driven, not customer-centric. Can you easily tell which customer is over-, under- or adequately promoted, based on a personal-level promotion-and-response ratio? With a customer-centric view established, you can.

These are just high-level summaries of key steps, and each step should be managed as independent projects within a large-scale initiative with common goals. Some steps may run concurrently to reduce the overall timeline, and tactical knowledge in all required technologies and tool sets is the key for the successful implementation of centralized marketing intelligence.

Who Will Do the Work?
Then, who will be in charge of all this and who will actually do the work? As I mentioned earlier, a job of this magnitude requires a champion, and a CDO may be a good fit. But each of these steps will require different skill sets, so some outsourcing may be inevitable (more on how to pick an outsourcing partner in future articles).

But the case that should not be is the IT team or the analytics team solely dictating the whole process. Creating a central depository of marketing intelligence is something that sits between IT and marketing, and the decisions must be made with business goals in mind, not just limitations and challenges that IT faces. If the CDO or the champion of this type of initiative starts representing IT issues before overall business goals, then the project is doomed from the beginning. Again, it is not about touching the core database of the company, but realigning existing data assets to create new intelligence. Raw data (no matter how clean they are at the collection stage) are like unrefined raw materials to the users. What the decision-makers need are simple answers to their questions, not hundreds of data pieces.

From the user’s point of view, data should be:

  • Easy to understand and use (intuitive to non-mathematicians)
  • Bite-size (i.e., small answers, not mounds of raw data)
  • Useful and effective (consistently accurate)
  • Broad (answers should be available most of time, not just “sometimes”)
  • Readily available (data should be easily accessible via users’ favorite devices/channels)

And getting to this point is the job of a translator who sits in between marketing and IT. Call them data scientists or data strategists, if you like. But they do not belong to just marketing or IT, even though they have to understand both sides really well. Do not be rigid, insisting that all pilots must belong to the Air Force; some pilots do belong to the Navy.

Lastly, let me add this at the risk of sounding like I am siding with technologists. Marketers, please don’t be bad patients. Don’t be that bad patient who shows up at a doctor’s office with a specific prescription, as in “Don’t ask me why, but just give me these pills, now.” I’ve even met an executive who wanted a neural-net model for his business without telling me why. I just said to myself, “Hmm, he must have been to one of those analytics conferences recently.” Then after listening to his “business” issues, I prescribed an entirely different solution package.

So, instead of blurting out requests for pieces of data variables or queries using cool-sounding, semi-technical terms, state the business issues and challenges that you are facing as clearly as possible. IT and analytics specialists will prescribe the right solution for you if they understand the ultimate goals better. Too often, requesters determine the solutions they want without any understanding of underlying technical issues. Don’t forget that the end-users of any technology are only exposed to symptoms, not the causes.

And if Mr. Spock doesn’t seem to understand your issues and keeps saying that your statements are illogical, then call in a translator, even if you have to hire him for just one day. I know this all too well, because after all, this one phrase summarizes my entire career: “A bridge person between the marketing world and the IT world.” Although it ain’t easy to live a life as a marginal person.

Channel Collaboration or Web Cannibalization?

Multichannel marketers experience the frequent concern that online is competing with, or “cannibalizing,” sales in other channels. It seems like a reasonable problem for those responsible, for instance, for the P&L of the retail business to consider; same goes for the general managers responsible for the store-level P&L. I like to do something that we “digital natives” (professionals whose career has only been digitally driven) miss all too often. We talk to retail people and customers in the stores, store managers, general managers, sales and service staff.

Multichannel marketers experience the frequent concern that online is competing with, or “cannibalizing,” sales in other channels. It seems like a reasonable problem for those responsible, for instance, for the P&L of the retail business to consider; same goes for the general managers responsible for the store-level P&L.

I like to do something that we “digital natives” (professionals whose career has only been digitally driven) miss all too often. We talk to retail people and customers in the stores, store managers, general managers, sales and service staff. Imagine that … left-brain dominant Data Athletes who want to talk to people! Actually, a true Data Athlete will always engage the stakeholders to inform their analysis with tacit knowledge.

Every time we do this, we learn something about the customer that we quite frankly could not have gleaned from website analytics, transactional data or third-party data alone. We learn about how different kinds of customers engage with the product and their experiences are in an environment that, to this day, is far more immersive than we can create online. It’s nothing short of fascinating for the left-brainers. Moreover, access and connection with the field interaction does something powerful when we turn back to mining the data mass that grows daily. It creates context that inspires better analysis and greater performance.

This best practice may seem obvious, but is missed so often. It is just too easy to get “sucked into the data” first for a right-brain-dominant analyst. The same thing happens in an online-only environment. I can’t count how many times I sat with and coached truly brilliant Web analysts inside of organization who are talking through a data-backed hypothesis they are working through from Web analytics data, observing and measuring behaviors and drawing inferences … and they haven’t looked at the specific screens and treatments on the website or mobile app where those experiences are happening. They are disconnected from the consumer experience. If you look in your organization, odds are you’ll find examples of this kind of disconnect.

So Does The Web Compete with Retail Stores? Well, that depends.
While many businesses are seeing the same shift to digital consumption and engagement, especially on mobile devices, the evidence is clear that it’s a mistake to assume that you have a definitive answer. In fact, it is virtually always a nuanced answer that informs strategy and can help better-focus your investments in online and omnichannel marketing approaches.

In order to answer this question you need a singular view of a customer. Sounds easy, I know. So here’s the first test if you are ready to answer that question:

How many customers do you have?

If you don’t know with precision, you’re not ready to determine if the Web is competing or “cannibalizing” retail sales.

More often than not, what you’ll hear is the number of transactions, the number of visitors (from Web analytics) or the number of email addresses or postal addresses on file—or some other “proxy” that’s considered relevant.

The challenge is, these proxy values for customer-count belie a greater challenge. Without a well-thought-out data blending approach that converts transaction files into an actionable customer profile, we can’t begin to tell who bought what and how many times.

Once we have this covered, we’re now able to begin constructing metrics and developing counts of orders by customer, over time periods.

Summarization is Key
If you want to act on the data, you’ll likely need to develop a summarization routine—that is, that does the breakout of order counts and order values. This isn’t trivial. Leaving this step out creates a material amount of work slicing the data.

A few good examples of how you would summarize the data to answer the question by channel include totals:

  • by month
  • by quarter
  • by year
  • last year
  • prior quarter
  • by customer lifetime
  • and many more

Here’s The Key Takeaway: It’s not just one or the other.
Your customers buy across multiple channels. Across many brands and many datasets, we’ve always seen different pictures of the breakout between and across online and retail store transactions.

But you’re actually measuring the overlap and should focus your analysis on that overlap population. To go further, you’ll require summarization “snapshots” of the data so you can determine if the channel preference has changed over time.

The Bottom Line
While no one can say that the Web does or doesn’t definitively “cannibalize sales,” the evidence is overwhelming that buyers want to use the channel that is best for them for the specific product or service, at the time that works for them.

This being the case, it is almost inevitable that you will see omnichannel behaviors when your data is prepared and organized effectively to begin to see that shift in behavior.

Oftentimes, that shift can effectively equate to buyers spending more across channels, as specific products may sell better in person. It’s hard to feel the silky qualities of a cashmere scarf online, but you might reorder razor blades only online.

The analysis should hardly stop at channel shift and channel preference. Layering in promotion consumption can tell you how a buyer waits for the promotion online, or is more likely to buy “full-price” in a retail store. We’ve seen both of these frequently, but not always. Every data set is different.

Start by creating the most actionable customer file you can, integrating the transactions, behavioral and lifestyle data, and the depth that you can understand how customers choose between the channels you deliver becomes increasingly rich and actionable. Most of all—remember, it’s better to shift the sale to an alternative channel the customer prefers, than to lose it to a competitor who did a better job.

Not All Databases Are Created Equal

Not all databases are created equal. No kidding. That is like saying that not all cars are the same, or not all buildings are the same. But somehow, “judging” databases isn’t so easy. First off, there is no tangible “tire” that you can kick when evaluating databases or data sources. Actually, kicking the tire is quite useless, even when you are inspecting an automobile. Can you really gauge the car’s handling, balance, fuel efficiency, comfort, speed, capacity or reliability based on how it feels when you kick “one” of the tires? I can guarantee that your toes will hurt if you kick it hard enough, and even then you won’t be able to tell the tire pressure within 20 psi. If you really want to evaluate an automobile, you will have to sign some papers and take it out for a spin (well, more than one spin, but you know what I mean). Then, how do we take a database out for a spin? That’s when the tool sets come into play.

Not all databases are created equal. No kidding. That is like saying that not all cars are the same, or not all buildings are the same. But somehow, “judging” databases isn’t so easy. First off, there is no tangible “tire” that you can kick when evaluating databases or data sources. Actually, kicking the tire is quite useless, even when you are inspecting an automobile. Can you really gauge the car’s handling, balance, fuel efficiency, comfort, speed, capacity or reliability based on how it feels when you kick “one” of the tires? I can guarantee that your toes will hurt if you kick it hard enough, and even then you won’t be able to tell the tire pressure within 20 psi. If you really want to evaluate an automobile, you will have to sign some papers and take it out for a spin (well, more than one spin, but you know what I mean). Then, how do we take a database out for a spin? That’s when the tool sets come into play.

However, even when the database in question is attached to analytical, visualization, CRM or drill-down tools, it is not so easy to evaluate it completely, as such practice reveals only a few aspects of a database, hardly all of them. That is because such tools are like window treatments of a building, through which you may look into the database. Imagine a building inspector inspecting a building without ever entering it. Would you respect the opinion of the inspector who just parks his car outside the building, looks into the building through one or two windows, and says, “Hey, we’re good to go”? No way, no sir. No one should judge a book by its cover.

In the age of the Big Data (you should know by now that I am not too fond of that word), everything digitized is considered data. And data reside in databases. And databases are supposed be designed to serve specific purposes, just like buildings and cars are. Although many modern databases are just mindless piles of accumulated data, granted that the database design is decent and functional, we can still imagine many different types of databases depending on the purposes and their contents.

Now, most of the Big Data discussions these days are about the platform, environment, or tool sets. I’m sure you heard or read enough about those, so let me boldly skip all that and their related techie words, such as Hadoop, MongoDB, Pig, Python, MapReduce, Java, SQL, PHP, C++, SAS or anything related to that elusive “cloud.” Instead, allow me to show you the way to evaluate databases—or data sources—from a business point of view.

For businesspeople and decision-makers, it is not about NoSQL vs. RDB; it is just about the usefulness of the data. And the usefulness comes from the overall content and database management practices, not just platforms, tool sets and buzzwords. Yes, tool sets are important, but concert-goers do not care much about the types and brands of musical instruments that are being used; they just care if the music is entertaining or not. Would you be impressed with a mediocre guitarist just because he uses the same brand of guitar that his guitar hero uses? Nope. Likewise, the usefulness of a database is not about the tool sets.

In my past column, titled “Big Data Must Get Smaller,” I explained that there are three major types of data, with which marketers can holistically describe their target audience: (1) Descriptive Data, (2) Transaction/Behavioral Data, and (3) Attitudinal Data. In short, if you have access to all three dimensions of the data spectrum, you will have a more complete portrait of customers and prospects. Because I already went through that subject in-depth, let me just say that such types of data are not the basis of database evaluation here, though the contents should be on top of the checklist to meet business objectives.

In addition, throughout this series, I have been repeatedly emphasizing that the database and analytics management philosophy must originate from business goals. Basically, the business objective must dictate the course for analytics, and databases must be designed and optimized to support such analytical activities. Decision-makers—and all involved parties, for that matter—suffer a great deal when that hierarchy is reversed. And unfortunately, that is the case in many organizations today. Therefore, let me emphasize that the evaluation criteria that I am about to introduce here are all about usefulness for decision-making processes and supporting analytical activities, including predictive analytics.

Let’s start digging into key evaluation criteria for databases. This list would be quite useful when examining internal and external data sources. Even databases managed by professional compilers can be examined through these criteria. The checklist could also be applicable to investors who are about to acquire a company with data assets (as in, “Kick the tire before you buy it.”).

1. Depth
Let’s start with the most obvious one. What kind of information is stored and maintained in the database? What are the dominant data variables in the database, and what is so unique about them? Variety of information matters for sure, and uniqueness is often related to specific business purposes for which databases are designed and created, along the lines of business data, international data, specific types of behavioral data like mobile data, categorical purchase data, lifestyle data, survey data, movement data, etc. Then again, mindless compilation of random data may not be useful for any business, regardless of the size.

Generally, data dictionaries (lack of it is a sure sign of trouble) reveal the depth of the database, but we need to dig deeper, as transaction and behavioral data are much more potent predictors and harder to manage in comparison to demographic and firmographic data, which are very much commoditized already. Likewise, Lifestyle variables that are derived from surveys that may have been conducted a long time ago are far less valuable than actual purchase history data, as what people say they do and what they actually do are two completely different things. (For more details on the types of data, refer to the second half of “Big Data Must Get Smaller.”)

Innovative ideas should not be overlooked, as data packaging is often very important in the age of information overflow. If someone or some company transformed many data points into user-friendly formats using modeling or other statistical techniques (imagine pre-developed categorical models targeting a variety of human behaviors, or pre-packaged segmentation or clustering tools), such effort deserves extra points, for sure. As I emphasized numerous times in this series, data must be refined to provide answers to decision-makers. That is why the sheer size of the database isn’t so impressive, and the depth of the database is not just about the length of the variable list and the number of bytes that go along with it. So, data collectors, impress us—because we’ve seen a lot.

2. Width
No matter how deep the information goes, if the coverage is not wide enough, the database becomes useless. Imagine well-organized, buyer-level POS (Point of Service) data coming from actual stores in “real-time” (though I am sick of this word, as it is also overused). The data go down to SKU-level details and payment methods. Now imagine that the data in question are collected in only two stores—one in Michigan, and the other in Delaware. This, by the way, is not a completely made -p story, and I faced similar cases in the past. Needless to say, we had to make many assumptions that we didn’t want to make in order to make the data useful, somehow. And I must say that it was far from ideal.

Even in the age when data are collected everywhere by every device, no dataset is ever complete (refer to “Missing Data Can Be Meaningful“). The limitations are everywhere. It could be about brand, business footprint, consumer privacy, data ownership, collection methods, technical limitations, distribution of collection devices, and the list goes on. Yes, Apple Pay is making a big splash in the news these days. But would you believe that the data collected only through Apple iPhone can really show the overall consumer trend in the country? Maybe in the future, but not yet. If you can pick only one credit card type to analyze, such as American Express for example, would you think that the result of the study is free from any bias? No siree. We can easily assume that such analysis would skew toward the more affluent population. I am not saying that such analyses are useless. And in fact, they can be quite useful if we understand the limitations of data collection and the nature of the bias. But the point is that the coverage matters.

Further, even within multisource databases in the market, the coverage should be examined variable by variable, simply because some data points are really difficult to obtain even by professional data compilers. For example, any information that crosses between the business and the consumer world is sparsely populated in many cases, and the “occupation” variable remains mostly blank or unknown on the consumer side. Similarly, any data related to young children is difficult or even forbidden to collect, so a seemingly simple variable, such as “number of children,” is left unknown for many households. Automobile data used to be abundant on a household level in the past, but a series of laws made sure that the access to such data is forbidden for many users. Again, don’t be impressed with the existence of some variables in the data menu, but look into it to see “how much” is available.

3. Accuracy
In any scientific analysis, a “false positive” is a dangerous enemy. In fact, they are worse than not having the information at all. Many folks just assume that any data coming out a computer is accurate (as in, “Hey, the computer says so!”). But data are not completely free from human errors.

Sheer accuracy of information is hard to measure, especially when the data sources are unique and rare. And the errors can happen in any stage, from data collection to imputation. If there are other known sources, comparing data from multiple sources is one way to ensure accuracy. Watching out for fluctuations in distributions of important variables from update to update is another good practice.

Nonetheless, the overall quality of the data is not just up to the person or department who manages the database. Yes, in this business, the last person who touches the data is responsible for all the mistakes that were made to it up to that point. However, when the garbage goes in, the garbage comes out. So, when there are errors, everyone who touched the database at any point must share in the burden of guilt.

Recently, I was part of a project that involved data collected from retail stores. We ran all kinds of reports and tallies to check the data, and edited many data values out when we encountered obvious errors. The funniest one that I saw was the first name “Asian” and the last name “Tourist.” As an openly Asian-American person, I was semi-glad that they didn’t put in “Oriental Tourist” (though I still can’t figure out who decided that word is for objects, but not people). We also found names like “No info” or “Not given.” Heck, I saw in the news that this refugee from Afghanistan (he was a translator for the U.S. troops) obtained a new first name as he was granted an entry visa, “Fnu.” That would be short for “First Name Unknown” as the first name in his new passport. Welcome to America, Fnu. Compared to that, “Andolini” becoming “Corleone” on Ellis Island is almost cute.

Data entry errors are everywhere. When I used to deal with data files from banks, I found that many last names were “Ira.” Well, it turned out that it wasn’t really the customers’ last names, but they all happened to have opened “IRA” accounts. Similarly, movie phone numbers like 777-555-1234 are very common. And fictitious names, such as “Mickey Mouse,” or profanities that are not fit to print are abundant, as well. At least fake email addresses can be tested and eliminated easily, and erroneous addresses can be corrected by time-tested routines, too. So, yes, maintaining a clean database is not so easy when people freely enter whatever they feel like. But it is not an impossible task, either.

We can also train employees regarding data entry principles, to a certain degree. (As in, “Do not enter your own email address,” “Do not use bad words,” etc.). But what about user-generated data? Search and kill is the only way to do it, and the job would never end. And the meta-table for fictitious names would grow longer and longer. Maybe we should just add “Thor” and “Sponge Bob” to that Mickey Mouse list, while we’re at it. Yet, dealing with this type of “text” data is the easy part. If the database manager in charge is not lazy, and if there is a bit of a budget allowed for data hygiene routines, one can avoid sending emails to “Dear Asian Tourist.”

Numeric errors are much harder to catch, as numbers do not look wrong to human eyes. That is when comparison to other known sources becomes important. If such examination is not possible on a granular level, then median value and distribution curves should be checked against historical transaction data or known public data sources, such as U.S. Census Data in the case of demographic information.

When it’s about the companies’ own data, follow your instincts and get rid of data that look too good or too bad to be true. We all can afford to lose a few records in our databases, and there is nothing wrong with deleting the “outliers” with extreme values. Erroneous names, like “No Information,” may be attached to a seven-figure lifetime spending sum, and you know that can’t be right.

The main takeaways are: (1) Never trust the data just because someone bothered to store them in computers, and (2) Constantly look for bad data in reports and listings, at times using old-fashioned eye-balling methods. Computers do not know what is “bad,” until we specifically tell them what bad data are. So, don’t give up, and keep at it. And if it’s about someone else’s data, insist on data tallies and data hygiene stats.

4. Recency
Outdated data are really bad for prediction or analysis, and that is a different kind of badness. Many call it a “Data Atrophy” issue, as no matter how fresh and accurate a data point may be today, it will surely deteriorate over time. Yes, data have a finite shelf-life, too. Let’s say that you obtained a piece of information called “Golf Interest” on an individual level. That information could be coming from a survey conducted a long time ago, or some golf equipment purchase data from a while ago. In any case, someone who is attached to that flag may have stopped shopping for new golf equipment, as he doesn’t play much anymore. Without a proper database update and a constant feed of fresh data, irrelevant data will continue to drive our decisions.

The crazy thing is that, the harder it is to obtain certain types of data—such as transaction or behavioral data—the faster they will deteriorate. By nature, transaction or behavioral data are time-sensitive. That is why it is important to install time parameters in databases for behavioral data. If someone purchased a new golf driver, when did he do that? Surely, having bought a golf driver in 2009 (“Hey, time for a new driver!”) is different from having purchased it last May.

So-called “Hot Line Names” literally cease to be hot after two to three months, or in some cases much sooner. The evaporation period maybe different for different product types, as one may stay longer in the market for an automobile than for a new printer. Part of the job of a data scientist is to defer the expiration date of data, finding leads or prospects who are still “warm,” or even “lukewarm,” with available valid data. But no matter how much statistical work goes into making the data “look” fresh, eventually the models will cease to be effective.

For decision-makers who do not make real-time decisions, a real-time database update could be an expensive solution. But the databases must be updated constantly (I mean daily, weekly, monthly or even quarterly). Otherwise, someone will eventually end up making a wrong decision based on outdated data.

5. Consistency
No matter how much effort goes into keeping the database fresh, not all data variables will be updated or filled in consistently. And that is the reality. The interesting thing is that, especially when using them for advanced analytics, we can still provide decent predictions if the data are consistent. It may sound crazy, but even not-so-accurate-data can be used in predictive analytics, if they are “consistently” wrong. Modeling is developing an algorithm that differentiates targets and non-targets, and if the descriptive variables are “consistently” off (or outdated, like census data from five years ago) on both sides, the model can still perform.

Conversely, if there is a huge influx of a new type of data, or any drastic change in data collection or in a business model that supports such data collection, all bets are off. We may end up predicting such changes in business models or in methodologies, not the differences in consumer behavior. And that is one of the worst kinds of errors in the predictive business.

Last month, I talked about dealing with missing data (refer to “Missing Data Can Be Meaningful“), and I mentioned that data can be inferred via various statistical techniques. And such data imputation is OK, as long as it returns consistent values. I have seen so many so-called professionals messing up popular models, like “Household Income,” from update to update. If the inferred values jump dramatically due to changes in the source data, there is no amount of effort that can save the targeting models that employed such variables, short of re-developing them.

That is why a time-series comparison of important variables in databases is so important. Any changes of more than 5 percent in distribution of variables when compared to the previous update should be investigated immediately. If you are dealing with external data vendors, insist on having a distribution report of key variables for every update. Consistency of data is more important in predictive analytics than sheer accuracy of data.

6. Connectivity
As I mentioned earlier, there are many types of data. And the predictive power of data multiplies as different types of data get to be used together. For instance, demographic data, which is quite commoditized, still plays an important role in predictive modeling, even when dominant predictors are behavioral data. It is partly because no one dataset is complete, and because different types of data play different roles in algorithms.

The trouble is that many modern datasets do not share any common matching keys. On the demographic side, we can easily imagine using PII (Personally Identifiable Information), such as name, address, phone number or email address for matching. Now, if we want to add some transaction data to the mix, we would need some match “key” (or a magic decoder ring) by which we can link it to the base records. Unfortunately, many modern databases completely lack PII, right from the data collection stage. The result is that such a data source would remain in a silo. It is not like all is lost in such a situation, as they can still be used for trend analysis. But to employ multisource data for one-to-one targeting, we really need to establish the connection among various data worlds.

Even if the connection cannot be made to household, individual or email levels, I would not give up entirely, as we can still target based on IP addresses, which may lead us to some geographic denominations, such as ZIP codes. I’d take ZIP-level targeting anytime over no targeting at all, even though there are many analytical and summarization steps required for that (more on that subject in future articles).

Not having PII or any hard matchkey is not a complete deal-breaker, but the maneuvering space for analysts and marketers decreases significantly without it. That is why the existence of PII, or even ZIP codes, is the first thing that I check when looking into a new data source. I would like to free them from isolation.

7. Delivery Mechanisms
Users judge databases based on visualization or reporting tool sets that are attached to the database. As I mentioned earlier, that is like judging the entire building based just on the window treatments. But for many users, that is the reality. After all, how would a casual user without programming or statistical background would even “see” the data? Through tool sets, of course.

But that is the only one end of it. There are so many types of platforms and devices, and the data must flow through them all. The important point is that data is useless if it is not in the hands of decision-makers through the device of their choice, at the right time. Such flow can be actualized via API feed, FTP, or good, old-fashioned batch installments, and no database should stay too far away from the decision-makers. In my earlier column, I emphasized that data players must be good at (1) Collection, (2) Refinement, and (3) Delivery (refer to “Big Data is Like Mining Gold for a Watch—Gold Can’t Tell Time“). Delivering the answers to inquirers properly closes one iteration of information flow. And they must continue to flow to the users.

8. User-Friendliness
Even when state-of-the-art (I apologize for using this cliché) visualization, reporting or drill-down tool sets are attached to the database, if the data variables are too complicated or not intuitive, users will get frustrated and eventually move away from it. If that happens after pouring a sick amount of money into any data initiative, that would be a shame. But it happens all the time. In fact, I am not going to name names here, but I saw some ridiculously hard to understand data dictionary from a major data broker in the U.S.; it looked like the data layout was designed for robots by the robots. Please. Data scientists must try to humanize the data.

This whole Big Data movement has a momentum now, and in the interest of not killing it, data players must make every aspect of this data business easy for the users, not harder. Simpler data fields, intuitive variable names, meaningful value sets, pre-packaged variables in forms of answers, and completeness of a data dictionary are not too much to ask after the hard work of developing and maintaining the database.

This is why I insist that data scientists and professionals must be businesspeople first. The developers should never forget that end-users are not trained data experts. And guess what? Even professional analysts would appreciate intuitive variable sets and complete data dictionaries. So, pretty please, with sugar on top, make things easy and simple.

9. Cost
I saved this important item for last for a good reason. Yes, the dollar sign is a very important factor in all business decisions, but it should not be the sole deciding factor when it comes to databases. That means CFOs should not dictate the decisions regarding data or databases without considering the input from CMOs, CTOs, CIOs or CDOs who should be, in turn, concerned about all the other criteria listed in this article.

Playing with the data costs money. And, at times, a lot of money. When you add up all the costs for hardware, software, platforms, tool sets, maintenance and, most importantly, the man-hours for database development and maintenance, the sum becomes very large very fast, even in the age of the open-source environment and cloud computing. That is why many companies outsource the database work to share the financial burden of having to create infrastructures. But even in that case, the quality of the database should be evaluated based on all criteria, not just the price tag. In other words, don’t just pick the lowest bidder and hope to God that it will be alright.

When you purchase external data, you can also apply these evaluation criteria. A test-match job with a data vendor will reveal lots of details that are listed here; and metrics, such as match rate and variable fill-rate, along with complete the data dictionary should be carefully examined. In short, what good is lower unit price per 1,000 records, if the match rate is horrendous and even matched data are filled with missing or sub-par inferred values? Also consider that, once you commit to an external vendor and start building models and analytical framework around their its, it becomes very difficult to switch vendors later on.

When shopping for external data, consider the following when it comes to pricing options:

  • Number of variables to be acquired: Don’t just go for the full option. Pick the ones that you need (involve analysts), unless you get a fantastic deal for an all-inclusive option. Generally, most vendors provide multiple-packaging options.
  • Number of records: Processed vs. Matched. Some vendors charge based on “processed” records, not just matched records. Depending on the match rate, it can make a big difference in total cost.
  • Installment/update frequency: Real-time, weekly, monthly, quarterly, etc. Think carefully about how often you would need to refresh “demographic” data, which doesn’t change as rapidly as transaction data, and how big the incremental universe would be for each update. Obviously, a real-time API feed can be costly.
  • Delivery method: API vs. Batch Delivery, for example. Price, as well as the data menu, change quite a bit based on the delivery options.
  • Availability of a full-licensing option: When the internal database becomes really big, full installment becomes a good option. But you would need internal capability for a match and append process that involves “soft-match,” using similar names and addresses (imagine good-old name and address merge routines). It becomes a bit of commitment as the match and append becomes a part of the internal database update process.

Business First
Evaluating a database is a project in itself, and these nine evaluation criteria will be a good guideline. Depending on the businesses, of course, more conditions could be added to the list. And that is the final point that I did not even include in the list: That the database (or all data, for that matter) should be useful to meet the business goals.

I have been saying that “Big Data Must Get Smaller,” and this whole Big Data movement should be about (1) Cutting down on the noise, and (2) Providing answers to decision-makers. If the data sources in question do not serve the business goals, cut them out of the plan, or cut loose the vendor if they are from external sources. It would be an easy decision if you “know” that the database in question is filled with dirty, sporadic and outdated data that cost lots of money to maintain.

But if that database is needed for your business to grow, clean it, update it, expand it and restructure it to harness better answers from it. Just like the way you’d maintain your cherished automobile to get more mileage out of it. Not all databases are created equal for sure, and some are definitely more equal than others. You just have to open your eyes to see the differences.

Missing Data Can Be Meaningful

No matter how big the Big Data gets, we will never know everything about everything. Well, according to the super-duper computer called “Deep Thought” in the movie “The Hitchhiker’s Guide to the Galaxy” (don’t bother to watch it if you don’t care for the British sense of humour), the answer to “The Ultimate Question of Life, the Universe, and Everything” is “42.” Coincidentally, that is also my favorite number to bet on (I have my reasons), but I highly doubt that even that huge fictitious computer with unlimited access to “everything” provided that numeric answer with conviction after 7½ million years of computing and checking. At best, that “42” is an estimated figure of a sort, based on some fancy algorithm. And in the movie, even Deep Thought pointed out that “the answer is meaningless, because the beings who instructed it never actually knew what the Question was.” Ha! Isn’t that what I have been saying all along? For any type of analytics to be meaningful, one must properly define the question first. And what to do with the answer that comes out of an algorithm is entirely up to us humans, or in the business world, the decision-makers. (Who are probably human.)

No matter how big the Big Data gets, we will never know everything about everything. Well, according to the super-duper computer called “Deep Thought” in the movie “The Hitchhiker’s Guide to the Galaxy” (don’t bother to watch it if you don’t care for the British sense of humour), the answer to “The Ultimate Question of Life, the Universe, and Everything” is “42.” Coincidentally, that is also my favorite number to bet on (I have my reasons), but I highly doubt that even that huge fictitious computer with unlimited access to “everything” provided that numeric answer with conviction after 7½ million years of computing and checking. At best, that “42” is an estimated figure of a sort, based on some fancy algorithm. And in the movie, even Deep Thought pointed out that “the answer is meaningless, because the beings who instructed it never actually knew what the Question was.” Ha! Isn’t that what I have been saying all along? For any type of analytics to be meaningful, one must properly define the question first. And what to do with the answer that comes out of an algorithm is entirely up to us humans, or in the business world, the decision-makers. (Who are probably human.)

Analytics is about making the best of what we know. Good analysts do not wait for a perfect dataset (it will never come by, anyway). And businesspeople have no patience to wait for anything. Big Data is big because we digitize everything, and everything that is digitized is stored somewhere in forms of data. For example, even if we collect mobile device usage data from just pockets of the population with certain brands of mobile services in a particular area, the sheer size of the resultant dataset becomes really big, really fast. And most unstructured databases are designed to collect and store what is known. If you flip that around to see if you know every little behavior through mobile devices for “everyone,” you will be shocked to see how small the size of the population associated with meaningful data really is. Let’s imagine that we can describe human beings with 1,000 variables coming from all sorts of sources, out of 200 million people. How many would have even 10 percent of the 1,000 variables filled with some useful information? Not many, and definitely not 100 percent. Well, we have more data than ever in the history of mankind, but still not for every case for everyone.

In my previous columns, I pointed out that decision-making is about ranking different options, and to rank anything properly. We must employee predictive analytics (refer to “It’s All About Ranking“). And for ranking based on the scores resulting from predictive models to be effective, the datasets must be summarized to the level that is to be ranked (e.g., individuals, households, companies, emails, etc.). That is why transaction or event-level datasets must be transformed to “buyer-centric” portraits before any modeling activity begins. Again, it is not about the transaction or the products, but it is about the buyers, if you are doing all this to do business with people.

Trouble with buyer- or individual-centric databases is that such transformation of data structure creates lots of holes. Even if you have meticulously collected every transaction record that matters (and that will be the day), if someone did not buy a certain item, any variable that is created based on the purchase record of that particular item will have nothing to report for that person. Likewise, if you have a whole series of variables to differentiate online and offline channel behaviors, what would the online portion contain if the consumer in question never bought anything through the Web? Absolutely nothing. But in the business of predictive analytics, what did not happen is as important as what happened. Even a simple concept of “response” is only meaningful when compared to “non-response,” and the difference between the two groups becomes the basis for the “response” model algorithm.

Capturing the Meanings Behind Missing Data
Missing data are all around us. And there are many reasons why they are missing, too. It could be that there is nothing to report, as in aforementioned examples. Or, there could be errors in data collection—and there are lots of those, too. Maybe you don’t have access to certain pockets of data due to corporate, legal, confidentiality or privacy reasons. Or, maybe records did not match properly when you tried to merge disparate datasets or append external data. These things happen all the time. And, in fact, I have never seen any dataset without a missing value since I left school (and that was a long time ago). In school, the professors just made up fictitious datasets to emphasize certain phenomena as examples. In real life, databases have more holes than Swiss cheese. In marketing databases? Forget about it. We all make do with what we know, even in this day and age.

Then, let’s ask a philosophical question here:

  • If missing data are inevitable, what do we do about it?
  • How would we record them in databases?
  • Should we just leave them alone?
  • Or should we try to fill in the gaps?
  • If so, how?

The answer to all this is definitely not 42, but I’ll tell you this: Even missing data have meanings, and not all missing data are created equal, either.

Furthermore, missing data often contain interesting stories behind them. For example, certain demographic variables may be missing only for extremely wealthy people and very poor people, as their residency data are generally not exposed (for different reasons, of course). And that, in itself, is a story. Likewise, some data may be missing in certain geographic areas or for certain age groups. Collection of certain types of data may be illegal in some states. “Not” having any data on online shopping behavior or mobile activity may mean something interesting for your business, if we dig deeper into it without falling into the trap of predicting legal or corporate boundaries, instead of predicting consumer behaviors.

In terms of how to deal with missing data, let’s start with numeric data, such as dollars, days, counters, etc. Some numeric data simply may not be there, if there is no associated transaction to report. Now, if they are about “total dollar spending” and “number of transactions” in a certain category, for example, they can be initiated as zero and remain as zero in cases like this. The counter simply did not start clicking, and it can be reported as zero if nothing happened.

Some numbers are incalculable, though. If you are calculating “Average Amount per Online Transaction,” and if there is no online transaction for a particular customer, that is a situation for mathematical singularity—as we can’t divide anything by zero. In such cases, the average amount should be recorded as: “.”, blank, or any value that represents a pure missing value. But it should never be recorded as zero. And that is the key in dealing with missing numeric information; that zero should be reserved for real zeros, and nothing else.

I have seen too many cases where missing numeric values are filled with zeros, and I must say that such a practice is definitely frowned-upon. If you have to pick just one takeaway from this article, that’s it. Like I emphasized, not all missing values are the same, and zero is not the way you record them. Zeros should never represent lack of information.

Take the example of a popular demographic variable, “Number of Children in the Household.” This is a very predictable variable—not just for purchase behavior of children’s products, but for many other things. Now, it is a simple number, but it should never be treated as a simple variable—as, in this case, lack of information is not the evidence of non-existence. Let’s say that you are purchasing this data from a third-party data compiler (or a data broker). If you don’t see a positive number in that field, it could be because:

  1. The household in question really does not have a child;
  2. Even the data-collector doesn’t have the information; or
  3. The data collector has the information, but the household record did not match to the vendor’s record, for some reason.

If that field contains a number like 1, 2 or 3, that’s easy, as they will represent the number of children in that household. But the zero should be reserved for cases where the data collector has a positive confirmation that the household in question indeed does not have any children. If it is unknown, it should be marked as blank, “.” (Many statistical softwares, such as SAS, record missing values this way.) Or use “U” (though an alpha character should not be in a numeric field).

If it is a case of non-match to the external data source, then there should be a separate indicator for it. The fact that the record did not match to a professional data compiler’s list may mean something. And I’ve seen cases where such non-matching indicators are made to model algorithms along with other valid data, as in the case where missing indicators of income display the same directional tendency as high-income households.

Now, if the data compiler in question boldly inputs zeros for the cases of unknowns? Take a deep breath, fire the vendor, and don’t deal with the company again, as it is a sign that its representatives do not know what they are doing in the data business. I have done so in the past, and you can do it, too. (More on how to shop for external data in future articles.)

For non-numeric categorical data, similar rules apply. Some values could be truly “blank,” and those should be treated separately from “Unknown,” or “Not Available.” As a practice, let’s list all kinds of possible missing values in codes, texts or other character fields:

  • ” “—blank or “null”
  • “N/A,” “Not Available,” or “Not Applicable”
  • “Unknown”
  • “Other”—If it is originating from some type of multiple choice survey or pull-down menu
  • “Not Answered” or “Not Provided”—This indicates that the subjects were asked, but they refused to answer. Very different from “Unknown.”
  • “0”—In this case, the answer can be expressed in numbers. Again, only for known zeros.
  • “Non-match”—Not matched to other internal or external data sources
  • Etc.

It is entirely possible that all these values may be highly correlated to each other and move along the same predictive direction. However, there are many cases where they do not. And if they are combined into just one value, such as zero or blank, we will never be able to detect such nuances. In fact, I’ve seen many cases where one or more of these missing indicators move together with other “known” values in models. Again, missing data have meanings, too.

Filling in the Gaps
Nonetheless, missing data do not have to left as missing, blank or unknown all the time. With statistical modeling techniques, we can fill in the gaps with projected values. You didn’t think that all those data compilers really knew the income level of every household in the country, did you? It is not a big secret that much of those figures are modeled with other available data.

Such inferred statistics are everywhere. Popular variables, such as householder age, home owner/renter indicator, housing value, household income or—in the case of business data—the number of employees and sales volume contain modeled values. And there is nothing wrong with that, in the world where no one really knows everything about everything. If you understand the limitations of modeling techniques, it is quite alright to employ modeled values—which are much better alternatives to highly educated guesses—in decision-making processes. We just need to be a little careful, as models often fail to predict extreme values, such as household incomes over $500,000/year, or specific figures, such as incomes of $87,500. But “ranges” of household income, for example, can be predicted at a high confidence level, though it technically requires many separate algorithms and carefully constructed input variables in various phases. But such technicality is an issue that professional number crunchers should deal with, like in any other predictive businesses. Decision-makers should just be aware of the reality of real and inferred data.

Such imputation practices can be applied to any data source, not just compiled databases by professional data brokers. Statisticians often impute values when they encounter missing values, and there are many different methods of imputation. I haven’t met two statisticians who completely agree with each other when it comes to imputation methodologies, though. That is why it is important for an organization to have a unified rule for each variable regarding its imputation method (or lack thereof). When multiple analysts employ different methods, it often becomes the very source of inconsistent or erroneous results at the application stage. It is always more prudent to have the calculation done upfront, and store the inferred values in a consistent manner in the main database.

In terms of how that is done, there could be a long debate among the mathematical geeks. Will it be a simple average of non-missing values? If such a method is to be employed, what is the minimum required fill-rate of the variable in question? Surely, you do not want to project 95 percent of the population with 5 percent known values? Or will the missing values be replaced with modeled values, as in previous examples? If so, what would be the source of target data? What about potential biases that may exist because of data collection practices and their limitations? What should be the target definition? In what kind of ranges? Or should the target definition remain as a continuous figure? How would you differentiate modeled and real values in the database? Would you embed indicators for inferred values? Or would you forego such flags in the name of speed and convenience for users?

The important matter is not the rules or methodologies, but the consistency of them throughout the organization and the databases. That way, all users and analysts will have the same starting point, no matter what the analytical purposes are. There could be a long debate in terms of what methodology should be employed and deployed. But once the dust settles, all data fields should be treated by pre-determined rules during the database update processes, avoiding costly errors in the downstream. All too often, inconsistent imputation methods lead to inconsistent results.

If, by some chance, individual statisticians end up with freedom to come up with their own ways to fill in the blanks, then the model-scoring code in question must include missing value imputation algorithms without an exception, granted that such practice will elongate the model application processes and significantly increase chances for errors. It is also important that non-statistical users should be educated about the basics of missing data and associated imputation methods, so that everyone who has access to the database shares a common understanding of what they are dealing with. That list includes external data providers and partners, and it is strongly recommended that data dictionaries must include employed imputation rules wherever applicable.

Keep an Eye on the Missing Rate
Often, we get to find out that the missing rate of certain variables is going out of control because models become ineffective and campaigns start to yield disappointing results. Conversely, it can be stated that fluctuations in missing data ratios greatly affect the predictive power of models or any related statistical works. It goes without saying that a consistent influx of fresh data matters more than the construction and the quality of models and algorithms. It is a classic case of a garbage-in-garbage-out scenario, and that is why good data governance practices must include a time-series comparison of the missing rate of every critical variable in the database. If, all of a sudden, an important predictor’s fill-rate drops below a certain point, no analyst in this world can sustain the predictive power of the model algorithm, unless it is rebuilt with a whole new set of variables. The shelf life of models is definitely finite, but nothing deteriorates effectiveness of models faster than inconsistent data. And a fluctuating missing rate is a good indicator of such an inconsistency.

Likewise, if the model score distribution starts to deviate from the original model curve from the development and validation samples, it is prudent to check the missing rate of every variable used in the model. Any sudden changes in model score distribution are a good indicator that something undesirable is going on in the database (more on model quality control in future columns).

These few guidelines regarding the treatment of missing data will add more flavors to statistical models and analytics in general. In turn, proper handling of missing data will prolong the predictive power of models, as well. Missing data have hidden meanings, but they are revealed only when they are treated properly. And we need to do that until the day we get to know everything about everything. Unless you are just happy with that answer of “42.”

Beyond RFM Data

In the world of predictive analytics, the transaction data is the king of the hill. The master of the domain. The protector of the realm. Why? Because they are hands-down the most powerful predictors. If I may borrow the term that my mentor coined for our cooperative venture more than a decade ago (before anyone even uttered the word “Big Data”), “The past behavior is the best predictor of the future behavior.” Indeed. Back then, we had built a platform that nowadays could easily have qualified as Big Data. The platform predicted people’s future behaviors on a massive scale, and it worked really well, so I still stand by that statement.

In the world of predictive analytics, the transaction data is the king of the hill. The master of the domain. The protector of the realm. Why? Because they are hands-down the most powerful predictors. If I may borrow the term that my mentor coined for our cooperative venture more than a decade ago (before anyone even uttered the word “Big Data”), “The past behavior is the best predictor of the future behavior.” Indeed. Back then, we had built a platform that nowadays could easily have qualified as Big Data. The platform predicted people’s future behaviors on a massive scale, and it worked really well, so I still stand by that statement.

How so? At the risk of sounding like a pompous mathematical smartypants (I’m really not), it is because people do not change that much, or if so, not so rapidly. Every move you make is on some predictive curve. What you been buying, clicking, browsing, smelling or coveting somehow leads to the next move. Well, not all the time. (Maybe you just like to “look” at pretty shoes?) But with enough data, we can calculate the probability with some confidence that you would be an outdoors type, or a golfer, or a relaxing type on a cruise ship, or a risk-averse investor, or a wine enthusiast, or into fashion, or a passionate gardener, or a sci-fi geek, or a professional wrestling fan. Beyond affinity scores listed here, we can predict future value of each customer or prospect and possible attrition points, as well. And behind all those predictive models (and I have seen countless algorithms), the leading predictors are mostly transaction data, if you are lucky enough to get your hands on them. In the age of ubiquitous data and at the dawn of the “Internet of Things,” more marketers will be in that lucky group if they are diligent about data collection and refinement. Yes, in the near future, even a refrigerator will be able to order groceries, but don’t forget that only the collection mechanism will be different there. We still have to collect, refine and analyze the transaction data.

Last month, I talked about three major types of data (refer to “Big Data Must Get Smaller“), which are:
1. Descriptive Data
2. Behavioral Data (mostly Transaction Data)
3. Attitudinal Data.

If you gain access to all three elements with decent coverage, you will have tremendous predictive power when it comes to human behaviors. Unfortunately, it is really difficult to accumulate attitudinal data on a large scale with individual-level details (i.e., knowing who’s behind all those sentiments). Behavioral data, mostly in forms of transaction data, are also not easy to collect and maintain (non-transaction behavioral data are even bigger and harder to handle), but I’d say it is definitely worth the effort, as most of what we call Big Data fall under this category. Conversely, one can just purchase descriptive data, which are what we generally call demographic or firmographic data, from data compilers or brokers. The sellers (there are many) will even do the data-append processing for you and they may also throw in a few free profile reports with it.

Now, when we start talking about the transaction data, many marketers will respond “Oh, you mean RFM data?” Well, that is not completely off-base, because “Recency, Frequency and Monetary” data certainly occupy important positions in the family of transaction data. But they hardly are the whole thing, and the term is misused as frequently as “Big Data.” Transaction data are so much more than simple RFM variables.

RFM Data Is Just a Good Start
The term RFM should be used more as a checklist for marketers, not as design guidelines—or limitations in many cases—for data professionals. How recently did this particular customer purchase our product, and how frequently did she do that and how much money did she spend with us? Answering these questions is a good start, but stopping there would seriously limit the potential of transaction data. Further, this line of questioning would lead the interrogation efforts to simple “filtering,” as in: “Select all customers who purchased anything with a price tag over $100 more than once in past 12 months.” Many data users may think that this query is somewhat complex, but it really is just a one-dimensional view of the universe. And unfortunately, no customer is one-dimensional. And this query is just one slice of truth from the marketer’s point of view, not the customer’s. If you want to get really deep, the view must be “buyer-centric,” not product-, channel-, division-, seller- or company-centric. And the database structure should reflect that view (refer to “It’s All About Ranking,” where the concept of “Analytical Sandbox” is introduced).

Transaction data by definition describe the transactions, not the buyers. If you would like to describe a buyer or if you are trying to predict the buyer’s future behavior, you need to convert the transaction data into “descriptors of the buyers” first. What is the difference? It is the same data looked at through a different window—front vs. side window—but the effect is huge.

Even if we think about just one simple transaction with one item, instead of describing the shopping basket as “transaction happened on July 3, 2014, containing the Coldplay’s latest CD ‘Ghost Stories’ priced at $11.88,” a buyer-centric description would read: “A recent CD buyer in Rock genre with an average spending level in the music category under $20.” The trick is to describe the buyer, not the product or the transaction. If that customer has many orders and items in his purchase history (let’s say he downloaded a few songs to his portable devices, as well), the description of the buyer would become much richer. If you collect all of his past purchase history, it gets even more colorful, as in: “A recent music CD or MP3 buyer in rock, classical and jazz genres with 24-month purchase totaling to 13 orders containing 16 items with total spending valued in $100-$150 range and $11 average order size.” Of course you would store all this using many different variables (such as genre indicators, number of orders, number of items, total dollars spent during the past 24 months, average order amount and number of weeks since last purchase in the music category, etc.). But the point is that the story would come out this way when you change the perspective.

Creating a Buyer-Centric Portrait
The whole process of creating a buyer-centric portrait starts with data summarization (or de-normalization). A typical structure of the table (or database) that needs to capture every transaction detail, such as transaction date and amount, would require an entry for every transaction, and the database designers call it the “normal” state. As I explained in my previous article (“Ranking is the key”), if you would like to rank in terms of customer value, the data record must be on a customer level, as well. If you are ranking households or companies, you would then need to summarize the data on those levels, too.

Now, this summarization (or de-normalization) is not a process of eliminating duplicate entries of names, as you wouldn’t want to throw away any transaction details. If there are multiple orders per person, what is the total number of orders? What is the total amount of spending on an individual level? What would be average spending level per transaction, or per year? If you are allowed to have only one line of entry per person, how would you summarize the purchase dates, as you cannot just add them up? In that case, you can start with the first and last transaction date of each customer. Now, when you have the first and last transaction date for every customer, what would be the tenure of each customer and what would be the number of days since the last purchase? How many days, on average, are there in between orders then? Yes, all these figures are related to basic RFM metrics, but they are far more colorful this way.

The attached exhibit displays a very simple example of a before and after picture of such summarization process. On the left-hand side, there resides a typical order table containing customer ID, order number, order date and transaction amount. If a customer has multiple orders in a given period, an equal number of lines are required to record the transaction details. In real life, other order level information, such as payment method (very predictive, by the way), tax amount, discount or coupon amount and, if applicable, shipping amount would be on this table, as well.

On the right-hand side of the chart, you will find there is only one line per customer. As I mentioned in my previous columns, establishing consistent and accurate customer ID cannot be neglected—for this reason alone. How would you rely on the summary data if one person may have multiple IDs? The customer may have moved to a new address, or shopped from multiple stores or sites, or there could have been errors in data collections. Relying on email address is a big no-no, as we all carry many email addresses. That is why the first step of building a functional marketing database is to go through the data hygiene and consolidation process. (There are many data processing vendors and software packages for it.) Once a persistent customer (or individual) ID system is in place, you can add up the numbers to create customer-level statistics, such as total orders, total dollars, and first and last order dates, as you see in the chart.

Remember R, F, M, P and C
The real fun begins when you combine these numeric summary figures with product, channel and other important categorical variables. Because product (or service) and channel are the most distinctive dividers of customer behaviors, let’s just add P and C to the famous RFM (remember, we are using RFM just as a checklist here), and call it R, F, M, P and C.

Product (rather, product category) is an important separator, as people often show completely different spending behavior for different types of products. For example, you can send me fancy-shmancy fashion catalogs all you want, but I won’t look at it with an intention of purchase, as most men will look at the models and not what they are wearing. So my active purchase history in the sports, home electronics or music categories won’t mean anything in the fashion category. In other words, those so-called “hotline” names should be treated differently for different categories.

Channel information is also important, as there are active online buyers who would never buy certain items, such as apparel or home furnishing products, without physically touching them first. For example, even in the same categories, I would buy guitar strings or golf balls online. But I would not purchase a guitar or a driver without trying them out first. Now, when I say channel, I mean the channel that the customer used to make the purchase, not the channel through which the marketer chose to communicate with him. Channel information should be treated as a two-way street, as no marketer “owns” a customer through a particular channel (refer to “The Future of Online is Offline“).

As an exercise, let’s go back to the basic RFM data and create some actual variables. For “each” customer, we can start with basic RFM measures, as exhibited in the chart:

· Number of Transactions
· Total Dollar Amount
· Number of Days (or Weeks) since the Last Transaction
· Number of Days (or Weeks) since the First Transaction

Notice that the days are counted from today’s point of view (practically the day the database is updated), as the actual date’s significance changes as time goes by (e.g., a day in February would feel different when looked back on from April vs. November). “Recency” is a relative concept; therefore, we should relativize the time measurements to express it.

From these basic figures, we can derive other related variables, such as:

· Average Dollar Amount per Customer
· Average Dollar Amount per Transaction
· Average Dollar Amount per Year
· Lifetime Highest Amount per Item
· Lifetime Lowest Amount per Transaction
· Average Number of Days Between Transactions
· Etc., etc…

Now, imagine you have all these measurements by channels, such as retail, Web, catalog, phone or mail-in, and separately by product categories. If you imagine a gigantic spreadsheet, the summarized table would have fewer numbers of rows, but a seemingly endless number of columns. I will discuss categorical and non-numeric variables in future articles. But for this exercise, let’s just imagine having these sets of variables for all major product categories. The result is that the recency factor now becomes more like “Weeks since Last Online Order”—not just any order. Frequency measurements would be more like “Number of Transactions in Dietary Supplement Category”—not just for any product. Monetary values can be expressed in “Average Spending Level in Outdoor Sports Category through Online Channel”—not just the customer’s average dollar amount, in general.

Why stop there? We may slice and dice the data by offer type, customer status, payment method or time intervals (e.g., lifetime, 24-month, 48-months, etc.) as well. I am not saying that all the RFM variables should be cut out this way, but having “Number of Transaction by Payment Method,” for example, could be very revealing about the customer, as everybody uses multiple payment methods, while some may never use a debit card for a large purchase, for example. All these little measurements become building blocks in predictive modeling. Now, too many variables can also be troublesome. And knowing the balance (i.e., knowing where to stop) comes from the experience and preliminary analysis. That is when experts and analysts should be consulted for this type of uniform variable creation. Nevertheless, the point is that RFM variables are not just three simple measures that happen be a part of the larger transaction data menu. And we didn’t even touch non-transaction based behavioral elements, such as clicks, views, miles or minutes.

The Time Factor
So, if such data summarization is so useful for analytics and modeling, should we always include everything that has been collected since the inception of the database? The answer is yes and no. Sorry for being cryptic here, but it really depends on what your product is all about; how the buyers would relate to it; and what you, as a marketer, are trying to achieve. As for going back forever, there is a danger in that kind of data hoarding, as “Life-to-Date” data always favors tenured customers over new customers who have a relatively short history. In reality, many new customers may have more potential in terms of value than a tenured customer with lots of transaction records from a long time ago, but with no recent activity. That is why we need to create a level playing field in terms of time limit.

If a “Life-to-Date” summary is not ideal for predictive analytics, then where should you place the cutoff line? If you are selling cars or home furnishing products, we may need to look at a 4- to 5-year history. If your products are consumables with relatively short purchase cycles, then a 1-year examination would be enough. If your product is seasonal in nature—like gardening, vacation or heavily holiday-related items, then you may have to look at a minimum of two consecutive years of history to capture seasonal patterns. If you have mixed seasonality or longevity of products (e.g., selling golf balls and golf clubs sets through the same store or site), then you may have to summarize the data with multiple timelines, where the above metrics would be separated by 12 months, 24 months, 48 months, etc. If you have lifetime value models or any time-series models in the plan, then you may have to break the timeline down even more finely. Again, this is where you may need professional guidance, but marketers’ input is equally important.

Analytical Sandbox
Lastly, who should be doing all of this data summary work? I talked about the concept of the “Analytical Sandbox,” where all types of data conversion, hygiene, transformation, categorization and summarization are done in a consistent manner, and analytical activities, such as sampling, profiling, modeling and scoring are done with proper toolsets like SAS, R or SPSS (refer to “It’s All About Ranking“). The short and final answer is this: Do not leave that to analysts or statisticians. They are the main players in that playground, not the architects or developers of it. If you are serious about employing analytics for your business, plan to build the Analytical Sandbox along with the team of analysts.

My goal as a database designer has always been serving the analysts and statisticians with “model-ready” datasets on silver platters. My promise to them has been that the modelers would spend no time fixing the data. Instead, they would be spending their valuable time thinking about the targets and statistical methodologies to fulfill the marketing goals. After all, answers that we seek come out of those mighty—but often elusive—algorithms, and the algorithms are made of data variables. So, in the interest of getting the proper answers fast, we must build lots of building blocks first. And no, simple RFM variables won’t cut it.

When Viral Marketing Goes Too Far

A couple of years ago, our local newspaper, The Philadelphia Inquirer, ran a disturbing story about how a mortgage loan company in Phoenix had sent spam advertising messages which appeared on the screens of thousands of wireless phone customers. Not only were the messages not requested, but these customers had to pay to retrieve them.

A couple of years ago, our local newspaper, The Philadelphia Inquirer, ran a disturbing story about how a mortgage loan company in Phoenix had sent spam advertising messages which appeared on the screens of thousands of wireless phone customers. Not only were the messages not requested, but these customers had to pay to retrieve them.

In the United States, phone numbers are allocated to wireless companies in blocks of 9,999, all beginning with the same three-digit prefix following the area code. The text messaging address for each mobile phone is derived from the phone number assigned to each customer’s handset and the wireless company’s name. This means that an advertiser can simply choose any three digit prefix in an area code and send a message to 10,000 people by changing the last four digits after the prefix

One industry analyst noted that this is just the tip of the iceberg. This type of spam is cheap and easy for advertisers to use. Wireless text messaging is widely used in the U.S.; and, while some carriers are taking precautions to protect their customers from text message advertising, so far neither the direct marketing industry nor the federal government has been able to control this form of spam. As the president of the mortgage company noted, the advertising had brought in new clients and “There still isn’t any rule against emailing.” Online, the concept of “permission marketing” is similarly tossed aside each day with the receipt of unsolicited promotional emails.

We call this indiscriminate solicitation of prospective customers one variation of the “Casanova Complex” customer acquisition model, reflective of the 18th century Italian adventurer, perhaps best known for his many female “conquests.” In the haste to bring in customers, companies can often forget to court the right customers, those who represent the best long-term revenue potential, or who won’t overtax the company’s customer service and support structure.

If offline instances of the Casanova Complex are a disease, then it is an epidemic among Internet companies. Many online retail sites have engaged in sweepstakes and other customer generation programs. Their objectives, they say, are to create “viral” promotions which create excitement for their sites and build their databases of available names both inexpensively and quickly. In one instance, a portal site which runs more than 1,000 websites featuring links to other sites signed up 50,000 registrants in a “Win Up to $4,000” game. Another sweepstakes program secured 126,000 registrants. An online travel products retailer, offering 1 million air miles to the winner, generated more than 60,000 names in 90 days, almost all of whom were new to the site.

The big issue for any of these sites is—do these promotions and schemes draw attractive customers who can then be cultivated over time through the various marketing tools available today? And, once these customers are on board, are companies doing enough of the right things to keep them? Or is this just another extrapolation of the Casanova Complex? As one site marketing executive said: “This is a great, low-cost way for us to acquire new names. The jury’s still out on how many of those new people will come back.” Companies involved in developing or using promotional tools like sweepstakes, unsolicited email, or wireless spam seem inclined, though, at least for the moment, to believe that these possibilities generally don’t apply to them.

For traditional offline companies, the Internet may be “commoditizing” their industry or undermining customer relationships. Many brick and mortar CEOs say a key corporate goal is to transition more of their offline customers to online, self-transactional usage. Why? Because an online transaction costs dramatically less than a brick-and-mortar transaction, there is less risk for service error, and the company can more effectively capture and leverage information from an online transaction, to cite a few advantages. Certainly, the transactional advantages of e-commerce are very appealing. But what about the effects on loyalty—especially for new customers?

One of the important ways both online and offline companies can discipline themselves to avoid the Casanova Complex is to apply personalization in all contact with customers, both new and established. This, at least, gives companies a better chance of establishing the basis of a value-based, viral relationship with these customers.

While it’s been estimated that more than 80 percent of e-commerce sites have customer and visitor email personalization capabilities (Opens as a PDF), less than 10 percent of the sites used personalization in follow-on marketing campaigns. For websites favoring incentive devices like sweepstakes and frontal assault “push” email programs to attract potential customers, personalized communication is the perhaps the best opportunity to demonstrate ongoing interest in customers—especially new ones.

Personalization is at the heart of the “relationship” in successful online CRM programs. Ultimately, it’s what makes any CRM effort viral.

Big Data Must Get Smaller

Like many folks who worked in the data business for a long time, I don’t even like the words “Big Data.” Yeah, data is big now, I get it. But so what? Faster and bigger have been the theme in the computing business since the first calculator was invented. In fact, I don’t appreciate the common definition of Big Data that is often expressed in the three Vs: volume, velocity and variety. So, if any kind of data are big and fast, it’s all good? I don’t think so. If you have lots of “dumb” data all over the place, how does that help you? Well, as much as all the clutter that’s been piled on in your basement since 1971. It may yield some profit on an online auction site one day. Who knows? Maybe some collector will pay good money for some obscure Coltrane or Moody Blues albums that you never even touched since your last turntable (Ooh, what is that?) died on you. Those oversized album jackets were really cool though, weren’t they?

Like many folks who worked in the data business for a long time, I don’t even like the words “Big Data.” Yeah, data is big now, I get it. But so what? Faster and bigger have been the theme in the computing business since the first calculator was invented. In fact, I don’t appreciate the common definition of Big Data that is often expressed in the three Vs: volume, velocity and variety. So, if any kind of data are big and fast, it’s all good? I don’t think so. If you have lots of “dumb” data all over the place, how does that help you? Well, as much as all the clutter that’s been piled on in your basement since 1971. It may yield some profit on an online auction site one day. Who knows? Maybe some collector will pay good money for some obscure Coltrane or Moody Blues albums that you never even touched since your last turntable (Ooh, what is that?) died on you. Those oversized album jackets were really cool though, weren’t they?

Seriously, the word “Big” only emphasizes the size element, and that is a sure way to miss the essence of the data business. And many folks are missing even that little point by calling all decision-making activities that involve even small-sized data “Big Data.” It is entirely possible that this data stuff seems all new to someone, but the data-based decision-making process has been with us for a very long time. If you use that “B” word to differentiate old-fashioned data analytics of yesteryear and ridiculously large datasets of the present day, yes, that is a proper usage of it. But we all know most people do not mean it that way. One side benefit of this bloated and hyped up buzzword is data professionals like myself do not have to explain what we do for living for 20 minutes anymore by simply uttering the word “Big Data,” though that is a lot like a grandmother declaring all her grandchildren work on computers for living. Better yet, that magic “B” word sometimes opens doors to new business opportunities (or at least a chance to grab a microphone in non-data-related meetings and conferences) that data geeks of the past never dreamed of.

So, I guess it is not all that bad. But lest we forget, all hypes lead to overinvestments, and all overinvestments leads to disappointments, and all disappointments lead to purging of related personnel and vendors that bear that hyped-up dirty word in their titles or division names. If this Big Data stuff does not yield significant profit (or reduction in cost), I am certain that those investment bubbles will burst soon enough. Yes, some data folks may be lucky enough to milk it for another two or three years, but brace for impact if all those collected data do not lead to some serious dollar signs. I know how the storage and processing cost decreased significantly in recent years, but they ain’t totally free, and related man-hours aren’t exactly cheap, either. Also, if this whole data business is a new concept to an organization, any money spent on the promise of Big Data easily becomes a liability for the reluctant bunch.

This is why I open up my speeches and lectures with this question: “Have you made any money with this Big Data stuff yet?” Surely, you didn’t spend all that money to provide faster toys and nicer playgrounds to IT folks? Maybe the head of IT had some fun with it, but let’s ask that question to CFOs, not CTOs, CIOs or CDOs. I know some colleagues (i.e., fellow data geeks) who are already thinking about a new name for this—”decision-making activities, based on data and analytics”—because many of us will be still doing that “data stuff” even after Big Data cease to be cool after the judgment day. Yeah, that Gangnam Style dance was fun for a while, but who still jumps around like a horse?

Now, if you ask me (though nobody did yet), I’d say the Big Data should have been “Smart Data,” “Intelligent Data” or something to that extent. Because data must provide insights. Answers to questions. Guidance to decision-makers. To data professionals, piles of data—especially the ones that are fragmented, unstructured and unformatted, no matter what kind of fancy names the operating system and underlying database technology may bear—it is just a good start. For non-data-professionals, unrefined data—whether they are big or small—would remain distant and obscure. Offering mounds of raw data to end-users is like providing a painting kit when someone wants a picture on the wall. Bragging about the size of the data with impressive sounding new measurements that end with “bytes” is like counting grains of rice in California in front of a hungry man.

Big Data must get smaller. People want yes/no answers to their specific questions. If such clarity is not possible, probability figures to such questions should be provided; as in, “There’s an 80 percent chance of thunderstorms on the day of the company golf outing,” “An above-average chance to close a deal with a certain prospect” or “Potential value of a customer who is repeatedly complaining about something on the phone.” It is about easy-to-understand answers to business questions, not a quintillion bytes of data stored in some obscure cloud somewhere. As I stated at the end of my last column, the Big Data movement should be about (1) Getting rid of the noise, and (2) Providing simple answers to decision-makers. And getting to such answers is indeed the process of making data smaller and smaller.

In my past columns, I talked about the benefits of statistical models in the age of Big Data, as they are the best way to compact big and complex information in forms of simple answers (refer to “Why Model?”). Models built to predict (or point out) who is more likely to be into outdoor sports, to be a risk-averse investor, to go on a cruise vacation, to be a member of discount club, to buy children’s products, to be a bigtime donor or to be a NASCAR fan, are all providing specific answers to specific questions, while each model score is a result of serious reduction of information, often compressing thousands of variables into one answer. That simplification process in itself provides incredible value to decision-makers, as most wouldn’t know where to cut out unnecessary information to answer specific questions. Using mathematical techniques, we can cut down the noise with conviction.

In model development, “Variable Reduction” is the first major step after the target variable is determined (refer to “The Art of Targeting“). It is often the most rigorous and laborious exercise in the whole model development process, where the characteristics of models are often determined as each statistician has his or her unique approach to it. Now, I am not about to initiate a debate about the best statistical method for variable reduction (I haven’t met two statisticians who completely agree with each other in terms of methodologies), but I happened to know that many effective statistical analysts separate variables in terms of data types and treat them differently. In other words, not all data variables are created equal. So, what are the major types of data that database designers and decision-makers (i.e., non-mathematical types) should be aware of?

In the business of predictive analytics for marketing, the following three types of data make up three dimensions of a target individual’s portrait:

  1. Descriptive Data
  2. Transaction Data / Behavioral Data
  3. Attitudinal Data

In other words, if we get to know all three aspects of a person, it will be much easier to predict what the person is about and/or what the person will do. Why do we need these three dimensions? If an individual has a high income and is living in a highly valued home (demographic element, which is descriptive); and if he is an avid golfer (behavioral element often derived from his purchase history), can we just assume that he is politically conservative (attitudinal element)? Well, not really, and not all the time. Sometimes we have to stop and ask what the person’s attitude and outlook on life is all about. Now, because it is not practical to ask everyone in the country about every subject, we often build models to predict the attitudinal aspect with available data. If you got a phone call from a political party that “assumes” your political stance, that incident was probably not random or accidental. Like I emphasized many times, analytics is about making the best of what is available, as there is no such thing as a complete dataset, even in this age of ubiquitous data. Nonetheless, these three dimensions of the data spectrum occupy a unique and distinct place in the business of predictive analytics.

So, in the interest of obtaining, maintaining and utilizing all possible types of data—or, conversely, reducing the size of data with conviction by knowing what to ignore, let us dig a little deeper:

Descriptive Data
Generally, demographic data—such as people’s income, age, number of children, housing size, dwelling type, occupation, etc.—fall under this category. For B-to-B applications, “Firmographic” data—such as number of employees, sales volume, year started, industry type, etc.—would be considered as descriptive data. It is about what the targets “look like” and, generally, they are frozen in the present time. Many prominent data compilers (or data brokers, as the U.S. government calls them) collect, compile and refine the data and make hundreds of variables available to users in various industry sectors. They also fill in the blanks using predictive modeling techniques. In other words, the compilers may not know the income range of every household, but using statistical techniques and other available data—such as age, home ownership, housing value, and many other variables—they provide their best estimates in case of missing values. People often have some allergic reaction to such data compilation practices siting privacy concerns, but these types of data are not about looking up one person at a time, but about analyzing and targeting groups (or segments) of individuals and households. In terms of predictive power, they are quite effective and results are very consistent. The best part is that most of the variables are available for every household in the country, whether they are actual or inferred.

Other types of descriptive data include geo-demographic data, and the Census Data by the U.S. Census Bureau falls under this category. These datasets are organized by geographic denominations such as Census Block Group, Census Tract, Country or ZIP Code Tabulation Area (ZCTA, much like postal ZIP codes, but not exactly the same). Although they are not available on an individual or a household level, the Census data are very useful in predictive modeling, as every target record can be enhanced with it, even when name and address are not available, and data themselves are very stable. The downside is that while the datasets are free through Census Bureau, the raw datasets contain more than 40,000 variables. Plus, due to the budget cut and changes in survey methods during the past decade, the sample size (yes, they sample) decreased significantly, rendering some variables useless at lower geographic denominations, such as Census Block Group. There are professional data companies that narrowed down the list of variables to manageable sizes (300 to 400 variables) and filled in the missing values. Because they are geo-level data, variables are in the forms of percentages, averages or median values of elements, such as gender, race, age, language, occupation, education level, real estate value, etc. (as in, percent male, percent Asian, percent white-collar professionals, average income, median school years, median rent, etc.).

There are many instances where marketers cannot pinpoint the identity of a person due to privacy issues or challenges in data collection, and the Census Data play a role of effective substitute for individual- or household-level demographic data. In predictive analytics, duller variables that are available nearly all the time are often more valuable than precise information with limited availability.

Transaction Data/Behavioral Data
While descriptive data are about what the targets look like, behavioral data are about what they actually did. Often, behavioral data are in forms of transactions. So many just call it transaction data. What marketers commonly refer to as RFM (Recency, Frequency and Monetary) data fall under this category. In terms of predicting power, they are truly at the top of the food chain. Yes, we can build models to guess who potential golfers are with demographic data, such as age, gender, income, occupation, housing value and other neighborhood-level information, but if you get to “know” that someone is a buyer of a box of golf balls every six weeks or so, why guess? Further, models built with transaction data can even predict the nature of future purchases, in terms of monetary value and frequency intervals. Unfortunately, many who have access to RFM data are using them only in rudimentary filtering, as in “select everyone who spends more than $200 in a gift category during the past 12 months,” or something like that. But we can do so much more with rich transaction data in every stage of the marketing life cycle for prospecting, cultivating, retaining and winning back.

Other types of behavioral data include non-transaction data, such as click data, page views, abandoned shopping baskets or movement data. This type of behavioral data is getting a lot of attention as it is truly “big.” The data have been out of reach for many decision-makers before the emergence of new technology to capture and store them. In terms of predictability, nevertheless, they are not as powerful as real transaction data. These non-transaction data may provide directional guidance, as they are what some data geeks call “a-camera-on-everyone’s-shoulder” type of data. But we all know that there is a clear dividing line between people’s intentions and their commitments. And it can be very costly to follow every breath you take, every move you make, and every step you take. Due to their distinct characteristics, transaction data and non-transaction data must be managed separately. And if used together in models, they should be clearly labeled, so the analysts will never treat them the same way by accident. You really don’t want to mix intentions and commitments.

The trouble with the behavioral data are, (1) they are difficult to compile and manage, (2) they get big; sometimes really big, (3) they are generally confined within divisions or companies, and (4) they are not easy to analyze. In fact, most of the examples that I used in this series are about the transaction data. Now, No. 3 here could be really troublesome, as it equates to availability (or lack thereof). Yes, you may know everything that happened with your customers, but do you know where else they are shopping? Fortunately, there are co-op companies that can answer that question, as they are compilers of transaction data across multiple merchants and sources. And combined data can be exponentially more powerful than data in silos. Now, because transaction data are not always available for every person in databases, analysts often combine behavioral data and descriptive data in their models. Transaction data usually become the dominant predictors in such cases, while descriptive data play the supporting roles filling in the gaps and smoothing out the predictive curves.

As I stated repeatedly, predictive analytics in marketing is all about finding out (1) whom to engage, and (2) if you decided to engage someone, what to offer to that person. Using carefully collected transaction data for most of their customers, there are supermarket chains that achieved 100 percent customization rates for their coupon books. That means no two coupon books are exactly the same, which is a quite impressive accomplishment. And that is all transaction data in action, and it is a great example of “Big Data” (or rather, “Smart Data”).

Attitudinal Data
In the past, attitudinal data came from surveys, primary researches and focus groups. Now, basically all social media channels function as gigantic focus groups. Through virtual places, such as Facebook, Twitter or other social media networks, people are freely volunteering what they think and feel about certain products and services, and many marketers are learning how to “listen” to them. Sentiment analysis falls under that category of analytics, and many automatically think of this type of analytics when they hear “Big Data.”

The trouble with social data is:

  1. We often do not know who’s behind the statements in question, and
  2. They are in silos, and it is not easy to combine such data with transaction or demographic data, due to lack of identity of their sources.

Yes, we can see that a certain political candidate is trending high after an impressive speech, but how would we connect that piece of information to whom will actually donate money for the candidate’s causes? If we can find out “where” the target is via an IP address and related ZIP codes, we may be able to connect the voter to geo-demographic data, such as the Census. But, generally, personally identifiable information (PII) is only accessible by the data compilers, if they even bothered to collect them.

Therefore, most such studies are on a macro level, citing trends and directions, and types of analysts in that field are quite different from the micro-level analysts who deal with behavioral data and descriptive data. Now, the former provide important insights regarding the “why” part of the equation, which is often the hardest thing to predict; while the latter provide answers to “who, what, where and when.” (“Who” is the easiest to answer, and “when” is the hardest.) That “why” part may dictate a product development part of the decision-making process at the conceptual stage (as in, “Why would customers care for a new type of dishwasher?”), while “who, what, where and when” are more about selling the developed products (as in “Let’s sell those dishwashers in the most effective ways.”). So, it can be argued that these different types of data call for different types of analytics for different cycles in the decision-making processes.

Obviously, there are more types of data out there. But for marketing applications dealing with humans, these three types of data complete the buyers’ portraits. Now, depending on what marketers are trying to do with the data, they can prioritize where to invest first and what to ignore (for now). If they are early in the marketing cycle trying to develop a new product for the future, they need to understand why people want something and behave in certain ways. If signing up as many new customers as possible is the immediate goal, finding out who and where the ideal prospects are becomes the most imminent task. If maximizing the customer value is the ongoing objective, then you’d better start analyzing transaction data more seriously. If preventing attrition is the goal, then you will have to line up the transaction data in time series format for further analysis.

The business goals must dictate the analytics, and the analytics call for specific types of data to meet the goals, and the supporting datasets should be in “analytics-ready” formats. Not the other way around, where businesses are dictated by the limitations of analytics, and analytics are hampered by inadequate data clutters. That type of business-oriented hierarchy should be the main theme of effective data management, and with clear goals and proper data strategy, you will know where to invest first and what data to ignore as a decision-maker, not necessarily as a mathematical analyst. And that is the first step toward making the Big Data smaller. Don’t be impressed by the size of the data, as they often blur the big picture and not all data are created equal.

It’s All About Ranking

The decision-making process is really all about ranking. As a marketer, to whom should you be talking first? What product should you offer through what channel? As a businessperson, whom should you hire among all the candidates? As an investor, what stocks or bonds should you purchase? As a vacationer, where should you visit first?

The decision-making process is really all about ranking. As a marketer, to whom should you be talking first? What product should you offer through what channel? As a businessperson, whom should you hire among all the candidates? As an investor, what stocks or bonds should you purchase? As a vacationer, where should you visit first?

Yes, “choice” is the keyword in all of these questions. And if you picked Paris over other places as an answer to the last question, you just made a choice based on some ranking order in your mind. The world is big, and there could have been many factors that contributed to that decision, such as culture, art, cuisine, attractions, weather, hotels, airlines, prices, deals, distance, convenience, language, etc., and I am pretty sure that not all factors carried the same weight for you. For example, if you put more weight on “cuisine,” I can see why London would lose a few points to Paris in that ranking order.

As a citizen, for whom should I vote? That’s the choice based on your ranking among candidates, too. Call me overly analytical (and I am), but I see the difference in political stances as differences in “weights” for many political (and sometimes not-so-political) factors, such as economy, foreign policy, defense, education, tax policy, entitlement programs, environmental issues, social issues, religious views, local policies, etc. Every voter puts different weights on these factors, and the sum of them becomes the score for each candidate in their minds. No one thinks that education is not important, but among all these factors, how much weight should it receive? Well, that is different for everybody; hence, the political differences.

I didn’t bring this up to start a political debate, but rather to point out that the decision-making process is based on ranking, and the ranking scores are made of many factors with different weights. And that is how the statistical models are designed in a nutshell (so, that means the models are “nuts”?). Analysts call those factors “independent variables,” which describe the target.

In my past columns, I talked about the importance of statistical models in the age of Big Data (refer to “Why Model?”), and why marketing databases must be “model-ready” (refer to “Chicken or the Egg? Data or Analytics?”). Now let’s dig a little deeper into the design of the “model-ready” marketing databases. And surprise! That is also all about “ranking.”

Let’s step back into the marketing world, where folks are not easily offended by the subject matter. If I give a spreadsheet that contains thousands of leads for your business, you wouldn’t be able to tell easily which ones are the “Glengarry Glen Ross” leads that came from Downtown, along with those infamous steak knives. What choice would you have then? Call everyone on the list? I guess you can start picking names out of a hat. If you think a little more about it, you may filter the list by the first name, as they may reflect the decade in which they were born. Or start calling folks who live in towns that sound affluent. Heck, you can start calling them in alphabetical order, but the point is that you would “sort” the list somehow.

Now, if the list came with some other valuable information, such as income, age, gender, education level, socio-economic status, housing type, number of children, etc., you may be able to pick and choose by which variables you would use to sort the list. You may start calling the high income folks first. Not all product sales are positively related to income, but it is an easy way to start the process. Then, you would throw in other variables to break the ties in rich areas. I don’t know what you’re selling, but maybe, you would want folks who live in a single-family house with kids. And sometimes, your “gut” feeling may lead you to the right place. But only sometimes. And only when the size of the list is not in millions.

If the list was not for prospecting calls, but for a CRM application where you also need to analyze past transaction and interaction history, the list of the factors (or variables) that you need to consider would be literally nauseating. Imagine the list contains all kinds of dollars, dates, products, channels and other related numbers and figures in a seemingly endless series of columns. You’d have to scroll to the right for quite some time just to see what’s included in the chart.

In situations like that, how nice would it be if some analyst threw in just two model scores for responsiveness to your product and the potential value of each customer, for example? The analysts may have considered hundreds (or thousands) of variables to derive such scores for you, and all you need to know is that the higher the score, the more likely the lead will be responsive or have higher potential values. For your convenience, the analyst may have converted all those numbers with many decimal places into easy to understand 1-10 or 1-20 scales. That would be nice, wouldn’t it be? Now you can just start calling the folks in the model group No. 1.

But let me throw in a curveball here. Let’s go back to the list with all those transaction data attached, but without the model scores. You may say, “Hey, that’s OK, because I’ve been doing alright without any help from a statistician so far, and I’ll just use the past dollar amount as their primary value and sort the list by it.” And that is a fine plan, in many cases. Then, when you look deeper into the list, you find out there are multiple entries for the same name all over the place. How can you sort the list of leads if the list is not even on an individual level? Welcome to the world of relational databases, where every transaction deserves an entry in a table.

Relational databases are optimized to store every transaction and retrieve them efficiently. In a relational database, tables are connected by match keys, and many times, tables are connected in what we call “1-to-many” relationships. Imagine a shopping basket. There is a buyer, and we need to record the buyer’s ID number, name, address, account number, status, etc. Each buyer may have multiple transactions, and for each transaction, we now have to record the date, dollar amount, payment method, etc. Further, if the buyer put multiple items in a shopping basket, that transaction, in turn, is in yet another 1-to-many relationship to the item table. You see, in order to record everything that just happened, this relational structure is very useful. If you are the person who has to create the shipping package, yes, you need to know all the item details, transaction value and the buyer’s information, including the shipping and billing address. Database designers love this completeness so much, they even call this structure the “normal” state.

But the trouble with the relational structure is that each line is describing transactions or items, not the buyers. Sure, one can “filter” people out by interrogating every line in the transaction table, say “Select buyers who had any transaction over $100 in past 12 months.” That is what I call rudimentary filtering, but once we start asking complex questions such as, “What is the buyer’s average transaction amount for past 12 months in the outdoor sports category, and what is the overall future value of the customers through online channels?” then you will need what we call “Buyer-centric” portraits, not transaction or item-centric records. Better yet, if I ask you to rank every customer in the order of such future value, well, good luck doing that when all the tables are describing transactions, not people. That would be exactly like the case where you have multiple lines for one individual when you need to sort the leads from high value to low.

So, how do we remedy this? We need to summarize the database on an individual level, if you would like to sort the leads on an individual level. If the goal is to rank households, email addresses, companies, business sites or products, then the summarization should be done on those levels, too. Now, database designers call it the “de-normalization” process, and the tables tend to get “wide” along that process, but that is the necessary step in order to rank the entities properly.

Now, the starting point in all the summarizations is proper identification numbers for those levels. It won’t be possible to summarize any table on a household level without a reliable household ID. One may think that such things are given, but I would have to disagree. I’ve seen so many so-called “state of the art” (another cliché that makes me nauseous) databases that do not have consistent IDs of any kind. If your database managers say they are using “plain name” or “email address” fields for matching or summarization, be afraid. Be very afraid. As a starter, you know how many email addresses one person may have. To add to that, consider how many people move around each year.

Things get worse in regard to ranking by model scores when it comes to “unstructured” databases. We see more and more of those, as the data sources are getting into uncharted territories, and the size of the databases is growing exponentially. There, all these bits and pieces of data are sitting on mysterious “clouds” as entries on their own. Here again, it is one thing to select or filter based on collected data, but ranking based on some statistical modeling is simply not possible in such a structure (or lack thereof). Just ask the database managers how many 24-month active customers they really have, considering a great many people move in that time period and change their addresses, creating multiple entries. If you get an answer like “2 million-ish,” well, that’s another scary moment. (Refer to “Cheat Sheet: Is Your Database Marketing Ready?”)

In order to develop models using variables that are descriptors of customers, not transactions, we must convert those relational or unstructured data into the structure that match the level by which you would like to rank the records. Even temporarily. As the size of databases are getting bigger and bigger and the storage is getting cheaper and cheaper, I’d say that the temporary time period could be, well, indefinite. And because the word “data-mart” is overused and confusing to many, let me just call that place the “Analytical Sandbox.” Sandboxes are fun, and yes, all kinds of fun stuff for marketers and analysts happen there.

The Analytical Sandbox is where samples are created for model development, actual models are built, models are scored for every record—no matter how many there are—without hiccups; targets are easily sorted and selected by model scores; reports are created in meaningful and consistent ways (consistency is even more important than sheer accuracy in what we do), and analytical language such as SAS, SPSS or R are spoken without being frowned up by other computing folks. Here, analysts will spend their time pondering upon target definitions and methodologies, not about database structures and incomplete data fields. Have you heard about a fancy term called “in-database scoring”? This is where that happens, too.

And what comes out of the Analytical Sandbox and back into the world of relational database or unstructured databases—IT folks often ask this question—is going to be very simple. Instead of having to move mountains of data back and forth, all the variables will be in forms of model scores, providing answers to marketing questions, without any missing values (by definition, every record can be scored by models). While the scores are packing tons of information in them, the sizes could be as small as a couple bytes or even less. Even if you carry over a few hundred affinity scores for 100 million people (or any other types of entities), I wouldn’t call the resultant file large, as it would be as small as a few video files, really.

In my future columns, I will explain how to create model-ready (and human-ready) variables using all kinds of numeric, character or free-form data. In Exhibit A, you will see what we call traditional analytical activities colored in dark blue on the right-hand side. In order to make those processes really hum, we must follow all the steps that are on the left-hand side of that big cylinder in the middle. Preventing garbage-in-garbage-out situations from happening, this is where all the data get collected in uniform fashion, properly converted, edited and standardized by uniform rules, categorized based on preset meta-tables, consolidated with consistent IDs, summarized to desired levels, and meaningful variables are created for more advanced analytics.

Even more than statistical methodologies, consistent and creative variables in form of “descriptors” of the target audience make or break the marketing plan. Many people think that purchasing expensive analytical software will provide all the answers. But lest we forget, fancy software only answers the right-hand side of Exhibit A, not all of it. Creating a consistent template for all useful information in a uniform fashion is the key to maximizing the power of analytics. If you look into any modeling bakeoff in the industry, you will see that the differences in methodologies are measured in fractions. Conversely, inconsistent and incomplete data create disasters in real world. And in many cases, companies can’t even attempt advanced analytics while sitting on mountains of data, due to structural inadequacies.

I firmly believe the Big Data movement should be about

  1. getting rid of the noise, and
  2. providing simple answers to decision-makers.

Bragging about the size and the speed element alone will not bring us to the next level, which is to “humanize” the data. At the end of the day (another cliché that I hate), it is all about supporting the decision-making processes, and the decision-making process is all about ranking different options. So, in the interest of keeping it simple, let’s start by creating an analytical haven where all those rankings become easy, in case you think that the sandbox is too juvenile.

Cheat Sheet: Is Your Database Marketing Ready?

Many data-related projects end up as big disappointments. And, in many cases, it is because they did not have any design philosophy behind them. Because many folks are more familiar with buildings and cars than geeky databases, allow me to use them as examples here.

Many data-related projects end up as big disappointments. And, in many cases, it is because they did not have any design philosophy behind them. Because many folks are more familiar with buildings and cars than geeky databases, allow me to use them as examples here.

Imagine someone started constructing a building without a clear purpose. What is it going to be? An office building or a residence? If residential, for how many people? For a family, or for 200 college kids? Are they going to just eat and sleep in there, or are they going to engage in other activities in it? What is the budget for development and ongoing maintenance?

If someone starts building a house without answering these basic questions, well, it is safe to say that the guy who commissioned such a project is not in the right state of mind. Then again, he may be a filthy rich rock star with some crazy ideas. But let us just say that is an exceptional case. Nonetheless, surprisingly, a great many database projects start out exactly this way.

Just like a house is not just a sum of bricks, mortar and metal, a database is not just a sum of data, and there has to be design philosophy behind it. And yet, many companies think that putting all available data in one place is just good enough. Call it a movie without a director or a building without an architect; you know and I know that such a project cannot end well.

Even when a professional database designer gets involved, too often the project goes out of control—as the business requirement document ends up being a summary of
everyone’s wish lists, without any prioritization or filtering. It is a case of a movie without a director. The goal becomes something like “a database that stores all conceivable marketing, accounting and payment activities, handling both prospecting and customer relationship management through all conceivable channels, including face-to-face sales and lead management for big accounts. And it should include both domestic and international activities, and the update has to be done in real time.”

Really. Someone in that organization must have attended a database marketing conference recently to get all that listed. It might be simpler and cheaper building a 2-ton truck that flies. But before we commission something like this from the get-go, shall we discuss why the truck has to fly, too? For one, if you want real-time updates, do you have a business case for it? (As in, someone in the field must make real-time decisions with real-time data.) Or do you just fancy a large object, moving really fast?

Companies that primarily sell database tools often do not help the matter, either. Some promise that the tool sets will categorize all kinds of input data, based on some auto-generated meta-tables. (Really?) The tool will clean the data automatically. (Is it a self-cleaning oven?) The tool will establish key links (by what?), build models on its own (with what target data?), deploy campaigns (every Monday?), and conduct result analysis (with responses from all channels?).

All these capabilities sound really wonderful, but does that system set long- and short-term marketing goals for you, too? Does it understand the subtle nuances in human behaviors and intentions?

Sorry for being a skeptic here. But in such cases, I think someone watched “Star Trek” too much. I have never seen a company that does not regret spending seven figures on a tool set that was supposed to do everything. Do you wonder why? It is not because such activities cannot be automated, but because:

  1. Machines do not think for us (not quite yet); and
  2. Such a system is often very expensive, as it needs to cover all contingencies (the opposite of “goal-oriented” cheaper options).

So it becomes nearly impossible to justify the cost with incremental improvements in marketing efficiency. Even if the response rates double, all related marketing costs go down by a quarter, and revenue jumps up by 200 percent, there are not many companies that can easily justify that kind of spending.

Worse yet, imagine that you just paid 10 times more for some factory-made suit than you would have paid for a custom-made Italian suit. Since when is an automated, cookie-cutter answer more desirable than custom-tailored ones? Ever since computing and storage costs started to go down significantly, and more so in this age of Big Data that has an “everything, all the time” mentality.

But let me ask you again: Do you really have a marketing database?

Let us just say that I am a car designer. A potential customer who has been doing a lot of research on the technology front presents me with a spec for a vehicle that is as big as a tractor-trailer and as quick as a passenger car. I guess that someone really needs to move lots of stuff, really fast. Now, let us assume that it will cost about $8 million or more to build a car like that, and that estimate is without the rocket booster (ah, my heart breaks). If my business model is to take a percentage out of that budget, I would say, “Yeah sure, we can build a car like that for you. When can we start?”

But let us stop for a moment and ask why the client would “need” (not “want”) a car like that in the first place. After some user interviews and prioritization, we may collectively conclude that a fleet of full-size vans can satisfy 98 percent of the business needs, saving about $7 million. If that client absolutely and positively has to get to that extra 2 percent to satisfy every possible contingency in his business and spend that money, well, that is his prerogative, is it not? But I have to ask the business questions first before initiating that inevitable long and winding journey without a roadmap.

Knowing exactly what the database is supposed to be doing must be the starting point. Not “let’s just gather everything in one place and hope to God that some user will figure something out eventually.” Also, let’s not forget that constantly adding new goals in any phase of the project will inevitably complicate the matter and increase the cost.

Conversely, repurposing a database designed for some other goal will cause lots of troubles down the line. Yeah, sure. Is it not possible to move 100 people from A to B with a 2-seater sports car, if you are willing to make lots of quick trips and get some speeding tickets along the way? Yes, but that would not be my first recommendation. Instead, here are some real possibilities.

Databases support many different types of activities. So let us name a few:

  • Order fulfillment
  • Inventory management and accounting
  • Contact management for sales
  • Dashboard and report generation
  • Queries and selections
  • Campaign management
  • Response analysis
  • Trend analysis
  • Predictive modeling and scoring
  • Etc., etc.

The list goes on, and some of the databases may be doing fine jobs in many areas already. But can we safely call them “marketing” databases? Or are marketers simply tapping into the central data depository somehow, just making do with lots of blood, sweat and tears?

As an exercise, let me ask a few questions to see if your organization has a functioning marketing database for CRM purposes:

  • What is the average order size per year for customers with tenure of more than one year? —You may have all the transaction data, but maybe not on an individual level in order to know the average.
  • What is the number of active and dormant customers based on the last transaction date? —You will be surprised to find out that many companies do not know exactly how many customers they really have. Beep! 1 million-“ish” is not a good answer.
  • What is the average number of days between activities for each channel for each customer? —With basic transaction data summarized “properly,” this is not a difficult question to answer. But it’s very difficult if there are divisional “channel-centric” databases scattered all over.
  • What is the average number of touches through all channels that you employ before your customer reaches the projected value potential? —This is a hard one. Without all the transaction and contact history by all channels in a “closed-loop” structure, one cannot even begin to formulate an answer for this one. And the “value potential” is a result of statistical modeling, is it not?
  • What are typical gateway products, and how are they correlated to other product purchases? —This may sound like a product question, but without knowing each customer’s purchase history lined up properly with fully standardized product categories, it may take a while to figure this one out.
  • Are basic RFM data—such as dollars, transactions, dates and intervals—routinely being used in predictive models? —The answer is a firm “no,” if the statisticians are spending the majority of their time fixing the data; and “not even close,” if you are still just using RFM data for rudimentary filtering.

Now, if your answer is “Well, with some data summarization and inner/outer joins here and there—though we don’t have all transaction records from last year, and if we can get all the campaign histories from all seven vendors who managed our marketing campaigns, except for emails—maybe?”, then I am sorry to inform you that you do not have a marketing database. Even if you can eventually get to the answer if some programmer takes two weeks to draw a 7-page flow chart.

Often, I get extra comments like “But we have a relational database!” Or, “We stored every transaction for the past 10 years in Hadoop and we can retrieve any one of them in less than a second!” To these comments, I would say “Congratulations, your car has four wheels, right?”

To answer the important marketing questions, the database should be organized in a “buyer-centric” format. Going back to the database philosophy question, the fundamental design of the database changes based on its main purpose, much like the way a sports sedan and an SUV that share the same wheel base and engine end up shaped differently.

Marketing is about people. And, at the center of the marketing database, there have to be people. Every data element in the base should be “describing” those people.

Unfortunately, most relational databases are transaction-, channel- or product-centric, describing events and transactions—but not the people. Unstructured databases that are tuned primarily for massive storage and rapid retrieval may just have pieces of data all over the place, necessitating serious rearrangement to answer some of the most basic business questions.

So, the question still stands. Is your database marketing ready? Because if it is, you would have taken no time to answer my questions listed above and say: “Yeah, I got this. Anything else?”

Now, imagine the difference between marketers who get to the answers with a few clicks vs. the ones who have no clue where to begin, even when sitting on mounds of data. The difference between the two is not the size of the investment, but the design philosophy.

I just hope that you did not buy a sports car when you needed a truck.