Don’t Let Old Habits Dictate Your Marketing Thoughts

When marketers play with data, we often get confined within the limitations of the datasets that are available to us, or worse, tool sets through which we get to access data. Some bad habits live through an organization for multiple generations, as we all get trained in marketing thoughts, in the beginning of our careers, by others who have been doing similar jobs.

marketing thoughts
Credit: Pixabay by Mohamed Hassan

When marketers play with data, we often get confined within the limitations of the datasets that are available to us, or worse, tool sets through which we get to access data. Some bad habits live through an organization for multiple generations, as we all get trained in marketing thoughts, in the beginning of our careers, by others who have been doing similar jobs.

When a few iterations of such training go on through a series of onboarding processes, the original intents of data, reporting and analytics get diluted. And the organization ends up just using those marketing thoughts to go through motions of producing lots of reports that no one cares about or benefits from. I’ve heard some radical claims that the majority of decision-makers today won’t miss over half of automatically generated reports.

We shouldn’t really look at a single report or initiate data-related projects without setting a clear goal first. Often, the most important role of a consultant is to remind clients “why” they should do anything in the first place.

For example, why should we all watch clickthrough rates every day, often locked in a set frame of time parameters? As in, compared to the same time last year, the clickthrough rate went down by 0.8%! The horror! Why do marketers make a big fuss about it, when the clickthrough rate is just one of many indicators, not even the most effective one at that, of actual purchases? Because someone in the past set the KPI reports up that way?

In other words, sometimes marketers and analysts who help them needed to be reminded that the goal is to sell more things and retain customers, not live and die with open and clickthrough rates. I am not flatly dismissing those important metrics at all; I’m just pointing out that we need to have a goal-oriented mindset when dealing with data and analytics. Otherwise, we end up in a maze of metrics and activities that do not really help us achieve organizational goals.

What are those ultimate goals? Not that I want to be a smart ass who would say “From Earth” to an innocuous question “Where are you from?”, but let’s really go to that high level for a moment; we play with data (1) to increase the revenue, or (2) to decrease the cost. Since Profit=Revenue-Cost, well, we can even reduce this whole thing to just one goal: Increase the Profit.

Why am I pointing out the obvious? Because I’ve seen too many data players who just go through motions without questioning the original intent of the activity or key metrics, and blindly believe that all that hard work will somehow lead to success. Unfortunately, that is far from the truth.

If you run on an airplane midflight, would you get to the destination any sooner? Definitely not. In fact, the captain may even go back to the originating airport to drop such crazy person off, further elongating the length of the journey.

You may think this analogy is silly, but in the world of data and analytics, such detours happen all of the time. All because no one questioned how and why any activity set in motion in the distant past would continue to help achieve long and short-term organizational goals – especially when goals need to be constantly adjusted thanks to ever-changing business environments. Nothing in scientific activities, no marketing thoughts, should be carved in stone.

That is why the first question by a seasoned consultant should be what the organization’s long and short-term goals are. Okay, we can all easily agree that we are all in this data and analytics game to increase profit, but what are the specific goals, and what are the immediate pain points? Of course, like any good doctor, a consultant must remedy immediate pain points first. But what do we call those doctors who make the patient’s condition worse just to relieve immediate pain? We call them quacks.

Bringing back this discussion to the world of marketing, having the clear long and short-term goals for every data and analytical activity is a must. If you do that, you may never need an expensive consultant just to remind you that you are wasting resources digging wrong places. Clear business goals beget proper problem statements (not just list of all symptoms and wish lists), which beget appropriate measurement metrics, which in turn lead us to proper digging points in terms of data and methodology, which would minimize waste of time and energy to achieve predetermined goals. In short, we can avoid lots of mishaps and detours just by remembering the original intents of data and analytics endeavors.

Don’t Do It Just Because You Can

Don’t do it just because you can. No kidding. … Any geek with moderate coding skills or any overzealous marketer with access to some data can do real damage to real human beings without any superpowers to speak of. Largely, we wouldn’t go so far as calling them permanent damages, but I must say that some marketing messages and practices are really annoying and invasive. Enough to classify them as “junk mail” or “spam.” Yeah, I said that, knowing full-well that those words are forbidden in the industry in which I built my career.

Don’t do it just because you can. No kidding. By the way, I could have gone with Ben Parker’s “With great power comes great responsibility” line, but I didn’t, as it has become an over-quoted cliché. Plus, I’m not much of a fan of “Spiderman.” Actually, I’m kidding this time. (Not the “Spiderman” part, as I’m more of a fan of “Thor.”) But the real reason is any geek with moderate coding skills or any overzealous marketer with access to some data can do real damage to real human beings without any superpowers to speak of. Largely, we wouldn’t go so far as calling them permanent damages, but I must say that some marketing messages and practices are really annoying and invasive. Enough to classify them as “junk mail” or “spam.” Yeah, I said that, knowing full-well that those words are forbidden in the industry in which I built my career.

All jokes aside, I received a call from my mother a few years ago asking me if this “urgent” letter that says her car warranty will expire if she does not act “right now” (along with a few exclamation marks) is something to which she must respond immediately. Many of us by now are impervious to such fake urgencies or outrageous claims (like “You’ve just won $10,000,000!!!”). But I then realized that there still are plenty of folks who would spend their hard-earned dollars based on such misleading messages. What really made me mad, other than the fact that my own mother was involved in that case, was that someone must have actually targeted her based on her age, ethnicity, housing value and, of course, the make and model of her automobile. I’ve been doing this job for too long to be unaware of potential data variables and techniques that must have played a part so that my mother to receive a series of such letters. Basically, some jerk must have created a segment that could be named as “old and gullible.” Without a doubt, this is a classic example of what should not be done just because one can.

One might dismiss it as an isolated case of a questionable practice done by questionable individuals with questionable moral integrity, but can we honestly say that? I, who knows the ins and outs of direct marketing practices quite well, fell into traps more than a few times, where supposedly a one-time order mysteriously turns into a continuity program without my consent, followed by an extremely cumbersome canceling process. Further, when I receive calls or emails from shady merchants with dubious offers, I can very well assume my information changed hands in very suspicious ways, if not through outright illegal routes.

Even without the criminal elements, as data become more ubiquitous and targeting techniques become more precise, an accumulation of seemingly inoffensive actions by innocuous data geeks can cause a big ripple in the offline (i.e., “real”) world. I am sure many of my fellow marketers remember the news about this reputable retail chain a few years ago; that they accurately predicted pregnancy in households based on their product purchase patterns and sent customized marketing messages featuring pregnancy-related products accordingly. Subsequently it became a big controversy, as such a targeted message was the way one particular head of household found out his teenage daughter was indeed pregnant. An unintended consequence? You bet.

I actually saw the presentation of the instigating statisticians in a predictive analytics conference before the whole incident hit the wire. At the time, the presenters were unaware of the consequences of their actions, so they proudly shared employed methodologies with the audience. But when I heard about what they were actually trying to predict, I immediately turned my head to look at the lead statistician in my then-analytical team sitting next to me, and saw that she had a concerned look that I must have had on my face, as well. And our concern was definitely not about the techniques, as we knew how to do the same when provided with similar sets of data. It was about the human consequences that such a prediction could bring, not just to the eventual targets, but also to the predictors and their fellow analysts in the industry who would all be lumped together as evil scientists by the outsiders. In predictive analytics, there is a price for being wrong; and at times, there is a price to pay for being right, too. Like I said, we shouldn’t do things just because we can.

Analysts do not have superpowers individually, but when technology and ample amounts of data are conjoined, the results can be quite influential and powerful, much like the way bombs can be built with common materials available at any hardware store. Ironically, I have been evangelizing that the data and technology should be wielded together to make big and dumb data smaller and smarter all this time. But providing answers to decision-makers in ready-to-be used formats, hence “humanizing” the data, may have its downside, too. Simply, “easy to use” can easily be “easy to abuse.” After all, humans are fallible creatures with ample amounts of greed and ambition. Even without any obvious bad intentions, it is sometimes very difficult to contemplate all angles, especially about those sensitive and squeamish humans.

I talked about the social consequences of the data business last month (refer to “How to Be a Good Data Scientist“), and that is why I emphasized that anyone who is about to get into this data field must possess deep understandings of both technology and human nature. That little sensor in your stomach that tells you “Oh, I have a bad feeling about this” may not come to everyone naturally, but we all need to be equipped with those safeguards like angels on our shoulders.

Hindsight is always 20/20, but apparently, those smart analysts who did that pregnancy prediction only thought about the techniques and the bottom line, but did not consider all the human factors. And they should have. Or, if not them, their manager should have. Or their partners in the marketing department should have. Or their public relations people should have. Heck, “someone” in their organization should have, alright? Just like we do not casually approach a woman on the street who “seems” pregnant and say “You must be pregnant.” Only socially inept people would do that.

People consider certain matters extremely private, in case some data geeks didn’t realize that. If I might add, the same goes for ailments such as erectile dysfunction or constipation, or any other personal business related to body parts that are considered private. Unless you are a doctor in an examining room, don’t say things like “You look old, so you must have hard time having sex, right?” It is already bad enough that we can’t even watch golf tournaments on TV without those commercials that assume that golf fans need help in that department. (By the way, having “two” bathtubs “outside” the house at dusk don’t make any sense either, when the effect of the drug can last for hours for heaven’s sake. Maybe the man lost interest because the tubs were too damn heavy?)

While it may vary from culture to culture, we all have some understanding of social boundaries in casual settings. When you are talking to a complete stranger on a plane ride, for example, you know exactly how much information that you would feel comfortable sharing with that person. And when someone crosses the line, we call that person inappropriate, or “creepy.” Unfortunately, that creepy line is set differently for each person who we encounter (I am sure people like George Clooney or Scarlett Johansson have a really high threshold for what might be considered creepy), but I think we can all agree that such a shady area can be loosely defined at the least. Therefore, when we deal with large amounts of data affecting a great many people, imagine a rather large common area of such creepiness/shadiness, and do not ever cross it. In other words, when in doubt, don’t go for it.

Now, as a lifelong database marketer, I am not advocating some over-the-top privacy zealots either, as most of them do not understand the nature of data work and can’t tell the difference between informed (and mutually beneficial) messages and Big Brother-like nosiness. This targeting business is never about looking up an individual’s record one at a time, but more about finding correlations between users and products and doing some good match-making in mass numbers. In other words, we don’t care what questionable sites anyone visits, and honest data players would not steal or abuse information with bad intent. I heard about waiters who steal credit card numbers from their customers with some swiping devices, but would you condemn the entire restaurant industry for that? Yes, there are thieves in any part of the society, but not all data players are hackers, just like not all waiters are thieves. Statistically speaking, much like flying being the safest from of travel, I can even argue that handing over your physical credit card to a stranger is even more dangerous than entering the credit card number on a website. It looks much worse when things go wrong, as incidents like that affect a great many all at once, just like when a plane crashes.

Years back, I used to frequent a Japanese Restaurant near my office. The owner, who doubled as the head sushi chef, was not a nosy type. So he waited for more than a year to ask me what I did for living. He had never heard anything about database marketing, direct marketing or CRM (no “Big Data” on the horizon at that time). So I had to find a simple way to explain what I do. As a sushi chef with some local reputation, I presumed that he would know personal preferences of many frequently visiting customers (or “high-value customers,” as marketers call them). He may know exactly who likes what kind of fish and types of cuts, who doesn’t like raw shellfish, who is allergic to what, who has less of a tolerance for wasabi or who would indulge in exotic fish roes. When I asked this question, his answer was a simple “yes.” Any diligent sushi chef would care for his or her customers that much. And I said, “Now imagine that you can provide such customized services to millions of people, with the help of computers and collected data.” He immediately understood the benefits of using data and analytics, and murmured “Ah so …”

Now let’s turn the table for a second here. From the customer’s point of view, yes, it is very convenient for me that my favorite sushi chef knows exactly how I like my sushi. Same goes for the local coffee barista who knows how you take your coffee every morning. Such knowledge is clearly mutually beneficial. But what if those business owners or service providers start asking about my personal finances or about my grown daughter in a “creepy” way? I wouldn’t care if they carried the best yellowtail in town or served the best cup of coffee in the world. I would cease all my interaction with them immediately. Sorry, they’ve just crossed that creepy line.

Years ago, I had more than a few chances to sit closely with Lester Wunderman, widely known as “The Father of Direct Marketing,” as the venture called I-Behavior in which I participated as one of the founders actually originated from an idea on a napkin from Lester and his friends. Having previously worked in an agency that still bears his name, and having only seen him behind a podium until I was introduced to him on one cool autumn afternoon in 1999, meeting him at a small round table and exchanging ideas with the master was like an unknown guitar enthusiast having a jam session with Eric Clapton. What was most amazing was that, at the beginning of the dot.com boom, he was completely unfazed about all those new ideas that were flying around at that time, and he was precisely pointing out why most of them would not succeed at all. I do not need to quote the early 21st century history to point out that his prediction was indeed accurate. When everyone was chasing the latest bit of technology for quick bucks, he was at least a decade ahead of all of those young bucks, already thinking about the human side of the equation. Now, I would not reveal his age out of respect, but let’s just say that almost all of the people in his age group would describe occupations of their offspring as “Oh, she just works on a computer all the time …” I can only wish that I will remain that sharp when I am his age.

One day, Wunderman very casually shared a draft of the “Consumer Bill of Rights for Online Engagement” with a small group of people who happened to be in his office. I was one of the lucky souls who heard about his idea firsthand, and I remember feeling that he was spot-on with every point, as usual. I read it again recently just as this Big Data hype is reaching its peak, just like the dot.com boom was moving with a force that could change the world back then. In many ways, such tidal waves do end up changing the world. But lest we forget, such shifts inevitably affect living, breathing human beings along the way. And for any movement guided by technology to sustain its velocity, people who are at the helm of the enabling technology must stay sensitive toward the needs of the rest of the human collective. In short, there is not much to gain by annoying and frustrating the masses.

Allow me to share Lester Wunderman’s “Consumer Bill of Rights for Online Engagement” verbatim, as it appeared in the second edition of his book “Being Direct”:

  1. Tell me clearly who you are and why you are contacting me.
  2. Tell me clearly what you are—or are not—going to do with the information I give.
  3. Don’t pretend that you know me personally. You don’t know me; you know some things about me.
  4. Don’t assume that we have a relationship.
  5. Don’t assume that I want to have a relationship with you.
  6. Make it easy for me to say “yes” and “no.”
  7. When I say “no,” accept that I mean not this, not now.
  8. Help me budget not only my money, but also my TIME.
  9. My time is valuable, don’t waste it.
  10. Make my shopping experience easier.
  11. Don’t communicate with me just because you can.
  12. If you do all of that, maybe we will then have the basis for a relationship!

So, after more than 15 years of the so-called digital revolution, how many of these are we violating almost routinely? Based on the look of my inboxes and sites that I visit, quite a lot and all the time. As I mentioned in my earlier article “The Future of Online is Offline,” I really get offended when even seasoned marketers use terms like “online person.” I do not become an online person simply because I happen to stumble onto some stupid website and forget to uncheck some pre-checked boxes. I am not some casual object at which some email division of a company can shoot to meet their top-down sales projections.

Oh, and good luck with that kind of mindless mass emailing; your base will soon be saturated and you will learn that irrelevant messages are bad for the senders, too. Proof? How is it that the conversion rate of a typical campaign did not increase dramatically during the past 40 years or so? Forget about open or click-through rate, but pay attention to the good-old conversion rate. You know, the one that measures actual sales. Don’t we have superior databases and technologies now? Why is anyone still bragging about mailing “more” in this century? Have you heard about “targeted” or “personalized” messages? Aren’t there lots and lots of toolsets for that?

As the technology advances, it becomes that much easier and faster to offend people. If the majority of data handlers continue to abuse their power, stemming from the data in their custody, the communication channels will soon run dry. Or worse, if abusive practices continue, the whole channel could be shut down by some legislation, as we have witnessed in the downfall of the outbound telemarketing channel. Unfortunately, a few bad apples will make things a lot worse a lot faster, but I see that even reputable companies do things just because they can. All the time, repeatedly.

Furthermore, in this day and age of abundant data, not offending someone or not violating rules aren’t good enough. In fact, to paraphrase comedian Chris Rock, only losers brag about doing things that they are supposed to do in the first place. The direct marketing industry has long been bragging about the self-governing nature of its tightly knit (and often incestuous) network, but as tools get cheaper and sharper by the day, we all need to be even more careful wielding this data weaponry. Because someday soon, we as consumers will be seeing messages everywhere around us, maybe through our retina directly, not just in our inboxes. Personal touch? Yes, in the creepiest way, if done wrong.

Visionaries like Lester Wunderman were concerned about the abusive nature of online communication from the very beginning. We should all read his words again, and think twice about social and human consequences of our actions. Google from its inception encapsulated a similar idea by simply stating its organizational objective as “Don’t be evil.” That does not mean that it will stop pursuing profit or cease to collect data. I think it means that Google will always try to be mindful about the influences of its actions on real people, who may not be in positions to control the data, but instead are on the side of being the subject of data collection.

I am not saying all of this out of some romantic altruism; rather, I am emphasizing the human side of the data business to preserve the forward-momentum of the Big Data movement, while I do not even care for its name. Because I still believe, even from a consumer’s point of view, that a great amount of efficiency could be achieved by using data and technology properly. No one can deny that modern life in general is much more convenient thanks to them. We do not get lost on streets often, we can translate foreign languages on the fly, we can talk to people on the other side of the globe while looking at their faces. We are much better informed about products and services that we care about, we can look up and order anything we want while walking on the street. And heck, we get suggestions before we even think about what we need.

But we can think of many negative effects of data, as well. It goes without saying that the data handlers must protect the data from falling into the wrong hands, which may have criminal intentions. Absolutely. That is like banks having to protect their vaults. Going a few steps further, if marketers want to retain the privilege of having ample amounts of consumer information and use such knowledge for their benefit, do not ever cross that creepy line. If the Consumer’s Bill of Rights is too much for you to retain, just remember this one line: “Don’t be creepy.”

Smart Data – Not Big Data

As a concerned data professional, I am already plotting an exit strategy from this Big Data hype. Because like any bubble, it will surely burst. That inevitable doomsday could be a couple of years away, but I can feel it coming. At the risk of sounding too much like Yoda the Jedi Grand Master, all hypes lead to over-investments, all over-investments lead to disappointments, and all disappointments lead to blames. Yes, in a few years, lots of blames will go around, and lots of heads will roll.

As a concerned data professional, I am already plotting an exit strategy from this Big Data hype. Because like any bubble, it will surely burst. That inevitable doomsday could be a couple of years away, but I can feel it coming. At the risk of sounding too much like Yoda the Jedi Grand Master, all hypes lead to over-investments, all over-investments lead to disappointments, and all disappointments lead to blames. Yes, in a few years, lots of blames will go around, and lots of heads will roll.

So, why would I stay on the troubled side? Well, because, for now, this Big Data thing is creating lots of opportunities, too. I am writing this on my way back from Seoul, Korea, where I presented this Big Data idea nine times in just two short weeks, trotting from large venues to small gatherings. Just a few years back, I used to have a hard time explaining what I do for living. Now, I just have to say “Hey, I do this Big Data thing,” and the doors start to open. In my experience, this is the best “Open Sesame” moment for all data specialists. But it will last only if we play it right.

Nonetheless, I also know that I will somehow continue to make living setting data strategies, fixing bad data, designing databases and leading analytical activities, even after the hype cools down. Just with a different title, under a different banner. I’ve seen buzzwords come and go, and this data business has been carried on by the people who cut through each hype (and gargantuan amount of BS along with it) and create real revenue-generating opportunities. At the end of the day (I apologize for using this cliché), it is all about the bottom line, whether it comes from a revenue increase or cost reduction. It is never about the buzzwords that may have created the business opportunities in the first place; it has always been more about the substance that turned those opportunities into money-making machines. And substance needs no fancy title or buzzwords attached to it.

Have you heard Google or Amazon calling themselves a “Big Data” companies? They are the ones with sick amounts of data, but they also know that it is not about the sheer amount of data, but it is all about the user experience. “Wannabes” who are not able to understand the core values often hang onto buzzwords and hypes. As if Big Data, Cloud Computing or coding language du jour will come and save the day. But they are just words.

Even the name “Big Data” is all wrong, as it implies that bigger is always better. The 3 Vs of Big Data—volume, velocity and variety—are also misleading. That could be a meaningful distinction for existing data players, but for decision-makers, it gives a notion that size and speed are the ultimate quest. But for the users, small is better. They don’t have time to analyze big sets of data. They need small answers in fun size packages. Plus, why is big and fast new? Since the invention of modern computers, has there been any year when the processing speed did not get faster and storage capacity did not get bigger?

Lest we forget, it is the software industry that came up with this Big Data thing. It was created as a marketing tagline. We should have read it as, “Yes, we can now process really large amounts of data, too,” not as, “Big Data will make all your dreams come true.” If you are in the business of selling toolsets, of course, that is how you present your product. If guitar companies keep emphasizing how hard it is to be a decent guitar player, would that help their businesses? It is a lot more effective to say, “Hey, this is the same guitar that your guitar hero plays!” But you don’t become Jeff Beck just because you bought a white Fender Stratocaster with a rosewood neck. The real hard work begins “after” you purchase a decent guitar. However, this obvious connection is often lost in the data business. Toolsets never provide solutions on their own. They may make your life easier, but you’d still have to formulate the question in a logical fashion, and still have to make decisions based on provided data. And harnessing meanings out of mounds of data requires training of your mind, much like the way musicians practice incessantly.

So, before business people even consider venturing into this Big Data hype, they should ask themselves “Why data?” What are burning questions that you are trying to solve with the data? If you can’t answer this simple question, then don’t jump into it. Forget about it. Don’t get into it just because everyone else seems to be getting into it. Yeah, it’s a big party, but why are you going there? Besides, if you formulate the question properly, often you will find that you don’t need Big Data all the time. If fact, Big Data can be a terrible detour if your question can be answered by “small” data. But that happens all the time, because people approach their business questions through the processes set by the toolsets. Big Data should be about the business, not about the IT or data.

Smart Data, Not Big Data
So, how do we get over this hype? All too often, perception rules, and a replacement word becomes necessary to summarize the essence of the concept for the general public. In my opinion, “Big Data” should have been “Smart Data.” Piles of unorganized dumb data aren’t worth a damn thing. Imagine a warehouse full of boxes with no labels, collecting dust since 1943. Would you be impressed with the sheer size of the warehouse? Great, the ark that Indiana Jones procured (or did he?) may be stored in there somewhere. But if no one knows where it is—or even if it can be located, if no one knows what to do with it—who cares?

Then, how do data get smarter? Smart data are bite-sized answers to questions. A thousand variables could have been considered to provide the weather forecast that calls for a “70 percent chance of scattered showers in the afternoon,” but that one line that we hear is the smart piece of data. Not the list of all the variables that went into the formula that created that answer. Emphasizing the raw data would be like giving paints and brushes to a person who wants a picture on the wall. As in, “Hey, here are all the ingredients, so why don’t you paint the picture and hang it on the wall?” Unfortunately, that is how the Big Data movement looks now. And too often, even the ingredients aren’t all that great.

I visit many companies only to find that the databases in question are just messy piles of unorganized and unstructured data. And please do not assume that such disarrays are good for my business. I’d rather spend my time harnessing meanings out of data and creating values, not taking care of someone else’s mess all the time. Really smart data are small, concise, clean and organized. Big Data should only be seen in “Behind the Scenes” types of documentaries for manias, not for everyday decision-makers.

I have been already saying that Big Data must get smaller for some time (refer to “Big Data Must Get Smaller“) and I would repeat it until it becomes a movement on its own. The Big Data movement must be about:

  1. Cutting down the noise
  2. Providing the answers

There is too much noise in the data, and cutting it out is the first step toward making the data smaller and smarter. The trouble is that the definition of “noise” is not static. Rock music that I grew up with was certainly a noise to my parents’ generation. In turn, some music that my kids listen to is pure noise to me. Likewise, “product color,” which is essential for a database designed for an inventory management system, may or may not be noise if the goal is to sell more apparel items. In such cases, more important variables could be style, brand, price range, target gender, etc., but color could be just peripheral information at best, or even noise (as in, “Uh, she isn’t going to buy just red shoes all the time?”). How do we then determine the differences? First, set the clear goals (as in, “Why are we playing with the data to begin with?”), define the goals using logical expressions, and let mathematics take care of it. Now you can drop the noise with conviction (even if it may look important to human minds).

If we continue with that mathematical path, we would reach the second part, which is “providing answers to the question.” And the smart answers are in the forms of yes/no, probability figures or some type of scores. Like in the weather forecast example, the question would be “chance of rain on a certain day” and the answer would be “70 percent.” Statistical modeling is not easy or simple, but it is the essential part of making the data smarter, as models are the most effective way to summarize complex and abundant data into compact forms (refer to “Why Model?”).

Most people do not have degrees in mathematics or statistics, but they all know what to do with a piece of information such as “70 percent chance of rain” on the day of a company outing. Some may complain that it is not a definite yes/no answer, but all would agree that providing information in this form is more humane than dumping all the raw data onto users. Sales folks are not necessarily mathematicians, but they would certainly appreciate scores attached to each lead, as in “more or less likely to close.” No, that is not a definite answer, but now sales people can start calling the leads in the order of relative importance to them.

So, all the Big Data players and data scientists must try to “humanize” the data, instead of bragging about the size of the data, making things more complex, and providing irrelevant pieces of raw data to users. Make things simpler, not more complex. Some may think that complexity is their job security, but I strongly disagree. That is a sure way to bring down this Big Data movement to the ground. We are already living in a complex world, and we certainly do not need more complications around us (more on “How to be a good data scientist” in a future article).

It’s About the Users, Too
On the flip side, the decision-makers must change their attitude about the data, as well.

1. Define the goals first: The main theme of this series has been that the Big Data movement is about the business, not IT or data. But I’ve seen too many business folks who would so willingly take a hands-off approach to data. They just fund the database; do not define clear business goals to developers; and hope to God that someday, somehow, some genius will show up and clear up the mess for them. Guess what? That cavalry is never coming if you are not even praying properly. If you do not know what problems you want to solve with data, don’t even get started; you will get to nowhere really slowly, bleeding lots of money and time along the way.

2. Take the data seriously: You don’t have to be a scientist to have a scientific mind. It is not ideal if someone blindly subscribes anything computers spew out (there are lots of inaccurate information in databases; refer to “Not All Databases Are Created Equal.”). But too many people do not take data seriously and continue to follow their gut feelings. Even if your customer profile coming out of a serious analysis does not match with your preconceived notions, do not blindly reject it; instead, treat it as a newly found gold mine. Gut feelings are even more overrated than Big Data.

3. Be logical: Illogical questions do not lead anywhere. There is no toolset that reads minds—at least not yet. Even if we get to have such amazing computers—as seen on “Star Trek” or in other science fiction movies—you would still have to ask questions in a logical fashion for them to be effective. I am not asking decision-makers to learn how to code (or be like Mr. Spock or his loyal follower, Dr. Sheldon Cooper), but to have some basic understanding of logical expressions and try to learn how analysts communicate with computers. This is not data geek vs. non-geek world anymore; we all have to be a little geekier. Knowing Boolean expressions may not be as cool as being able to throw a curve ball, but it is necessary to survive in the age of information overload.

4. Shoot for small successes: Start with a small proof of concept before fully investing in large data initiatives. Even with a small project, one gets to touch all necessary steps to finish the job. Understanding the flow of information is as important as each specific step, as most breakdowns occur in between steps, due to lack of proper connections. There was Gemini program before Apollo missions. Learn how to dock spaceships in space before plotting the chart to the moon. Often, over-investments are committed when the discussion is led by IT. Outsource even major components in the beginning, as the initial goal should be mastering the flow of things.

5. Be buyer-centric: No customer is bound by the channel of the marketer’s choice, and yet, may businesses act exactly that way. No one is an online person just because she did not refuse your email promotions yet (refer to “The Future of Online is Offline“). No buyer is just one dimensional. So get out of brand-, division-, product- or channel-centric mindsets. Even well-designed, buyer-centric marketing databases become ineffective if users are trapped in their channel- or division-centric attitudes, as in “These email promotions must flow!” or “I own this product line!” The more data we collect, the more chances marketers will gain to impress their customers and prospects. Do not waste those opportunities by imposing your own myopic views on them. Big Data movement is not there to fortify marketers’ bad habits. Thanks to the size of the data and speed of machines, we are now capable of disappointing a lot of people really fast.

What Did This Hype Change?
So, what did this Big Data hype change? First off, it changed people’s attitudes about the data. Some are no longer afraid of large amounts of information being thrown at them, and some actually started using them in their decision-making processes. Many realized that we are surrounded by numbers everywhere, not just in marketing, but also in politics, media, national security, health care and the criminal justice system.

Conversely, some people became more afraid—often with good reasons. But even more often, people react based on pure fear that their personal information is being actively exploited without their consent. While data geeks are rejoicing in the age of open source and cloud computing, many more are looking at this hype with deep suspicions, and they boldly reject storing any personal data in those obscure “clouds.” There are some people who don’t even sign up for EZ Pass and voluntarily stay on the long lane to pay tolls in the old, but untraceable way.

Nevertheless, not all is lost in this hype. The data got really big, and types of data that were previously unavailable, such as mobile and social data, became available to many marketers. Focus groups are now the size of Twitter followers of the company or a subject matter. The collection rate of POS (point of service) data has been increasingly steady, and some data players became virtuosi in using such fresh and abundant data to impress their customers (though some crossed that “creepy” line inadvertently). Different types of data are being used together now, and such merging activities will compound the predictive power even further. Analysts are dealing with less missing data, though no dataset would ever be totally complete. Developers in open source environments are now able to move really fast with new toolsets that would just run on any device. Simply, things that our forefathers of direct marketing used to take six months to complete can be done in few hours, and in the near future, maybe within a few seconds.

And that may be a good thing and a bad thing. If we do this right, without creating too many angry consumers and without burning holes in our budgets, we are currently in a position to achieve great many things in terms of predicting the future and making everyone’s lives a little more convenient. If we screw it up badly, we will end up creating lots of angry customers by abusing sensitive data and, at the same time, wasting a whole lot of investors’ money. Then this Big Data thing will go down in history as a great money-eating hype.

We should never do things just because we can; data is a powerful tool that can hurt real people. Do not even get into it if you don’t have a clear goal in terms of what to do with the data; it is not some piece of furniture that you buy just because your neighbor bought it. Living with data is a lifestyle change, and it requires a long-term commitment; it is not some fad that you try once and give up. It is a continuous loop where people’s responses to marketer’s data-based activities create even more data to be analyzed. And that is the only way it keeps getting better.

There Is No Big Data
And all that has nothing to do with “Big.” If done right, small data can do plenty. And in fact, most companies’ transaction data for the past few years would easily fit in an iPhone. It is about what to do with the data, and that goal must be set from a business point of view. This is not just a new playground for data geeks, who may care more for new hip technologies that sound cool in their little circle.

I recently went to Brazil to speak at a data conference called QIBRAS, and I was pleasantly surprised that the main theme of it was the quality of the data, not the size of the data. Well, at least somewhere in the world, people are approaching this whole thing without the “Big” hype. And if you look around, you will not find any successful data players calling this thing “Big Data.” They just deal with small and large data as part of their businesses. There is no buzzword, fanfare or a big banner there. Because when something is just part of your everyday business, you don’t even care what you call it. You just do. And to those masters of data, there is no Big Data. If Google all of a sudden starts calling itself a Big Data company, it would be so uncool, as that word would seriously limit it. Think about that.

Not All Databases Are Created Equal

Not all databases are created equal. No kidding. That is like saying that not all cars are the same, or not all buildings are the same. But somehow, “judging” databases isn’t so easy. First off, there is no tangible “tire” that you can kick when evaluating databases or data sources. Actually, kicking the tire is quite useless, even when you are inspecting an automobile. Can you really gauge the car’s handling, balance, fuel efficiency, comfort, speed, capacity or reliability based on how it feels when you kick “one” of the tires? I can guarantee that your toes will hurt if you kick it hard enough, and even then you won’t be able to tell the tire pressure within 20 psi. If you really want to evaluate an automobile, you will have to sign some papers and take it out for a spin (well, more than one spin, but you know what I mean). Then, how do we take a database out for a spin? That’s when the tool sets come into play.

Not all databases are created equal. No kidding. That is like saying that not all cars are the same, or not all buildings are the same. But somehow, “judging” databases isn’t so easy. First off, there is no tangible “tire” that you can kick when evaluating databases or data sources. Actually, kicking the tire is quite useless, even when you are inspecting an automobile. Can you really gauge the car’s handling, balance, fuel efficiency, comfort, speed, capacity or reliability based on how it feels when you kick “one” of the tires? I can guarantee that your toes will hurt if you kick it hard enough, and even then you won’t be able to tell the tire pressure within 20 psi. If you really want to evaluate an automobile, you will have to sign some papers and take it out for a spin (well, more than one spin, but you know what I mean). Then, how do we take a database out for a spin? That’s when the tool sets come into play.

However, even when the database in question is attached to analytical, visualization, CRM or drill-down tools, it is not so easy to evaluate it completely, as such practice reveals only a few aspects of a database, hardly all of them. That is because such tools are like window treatments of a building, through which you may look into the database. Imagine a building inspector inspecting a building without ever entering it. Would you respect the opinion of the inspector who just parks his car outside the building, looks into the building through one or two windows, and says, “Hey, we’re good to go”? No way, no sir. No one should judge a book by its cover.

In the age of the Big Data (you should know by now that I am not too fond of that word), everything digitized is considered data. And data reside in databases. And databases are supposed be designed to serve specific purposes, just like buildings and cars are. Although many modern databases are just mindless piles of accumulated data, granted that the database design is decent and functional, we can still imagine many different types of databases depending on the purposes and their contents.

Now, most of the Big Data discussions these days are about the platform, environment, or tool sets. I’m sure you heard or read enough about those, so let me boldly skip all that and their related techie words, such as Hadoop, MongoDB, Pig, Python, MapReduce, Java, SQL, PHP, C++, SAS or anything related to that elusive “cloud.” Instead, allow me to show you the way to evaluate databases—or data sources—from a business point of view.

For businesspeople and decision-makers, it is not about NoSQL vs. RDB; it is just about the usefulness of the data. And the usefulness comes from the overall content and database management practices, not just platforms, tool sets and buzzwords. Yes, tool sets are important, but concert-goers do not care much about the types and brands of musical instruments that are being used; they just care if the music is entertaining or not. Would you be impressed with a mediocre guitarist just because he uses the same brand of guitar that his guitar hero uses? Nope. Likewise, the usefulness of a database is not about the tool sets.

In my past column, titled “Big Data Must Get Smaller,” I explained that there are three major types of data, with which marketers can holistically describe their target audience: (1) Descriptive Data, (2) Transaction/Behavioral Data, and (3) Attitudinal Data. In short, if you have access to all three dimensions of the data spectrum, you will have a more complete portrait of customers and prospects. Because I already went through that subject in-depth, let me just say that such types of data are not the basis of database evaluation here, though the contents should be on top of the checklist to meet business objectives.

In addition, throughout this series, I have been repeatedly emphasizing that the database and analytics management philosophy must originate from business goals. Basically, the business objective must dictate the course for analytics, and databases must be designed and optimized to support such analytical activities. Decision-makers—and all involved parties, for that matter—suffer a great deal when that hierarchy is reversed. And unfortunately, that is the case in many organizations today. Therefore, let me emphasize that the evaluation criteria that I am about to introduce here are all about usefulness for decision-making processes and supporting analytical activities, including predictive analytics.

Let’s start digging into key evaluation criteria for databases. This list would be quite useful when examining internal and external data sources. Even databases managed by professional compilers can be examined through these criteria. The checklist could also be applicable to investors who are about to acquire a company with data assets (as in, “Kick the tire before you buy it.”).

1. Depth
Let’s start with the most obvious one. What kind of information is stored and maintained in the database? What are the dominant data variables in the database, and what is so unique about them? Variety of information matters for sure, and uniqueness is often related to specific business purposes for which databases are designed and created, along the lines of business data, international data, specific types of behavioral data like mobile data, categorical purchase data, lifestyle data, survey data, movement data, etc. Then again, mindless compilation of random data may not be useful for any business, regardless of the size.

Generally, data dictionaries (lack of it is a sure sign of trouble) reveal the depth of the database, but we need to dig deeper, as transaction and behavioral data are much more potent predictors and harder to manage in comparison to demographic and firmographic data, which are very much commoditized already. Likewise, Lifestyle variables that are derived from surveys that may have been conducted a long time ago are far less valuable than actual purchase history data, as what people say they do and what they actually do are two completely different things. (For more details on the types of data, refer to the second half of “Big Data Must Get Smaller.”)

Innovative ideas should not be overlooked, as data packaging is often very important in the age of information overflow. If someone or some company transformed many data points into user-friendly formats using modeling or other statistical techniques (imagine pre-developed categorical models targeting a variety of human behaviors, or pre-packaged segmentation or clustering tools), such effort deserves extra points, for sure. As I emphasized numerous times in this series, data must be refined to provide answers to decision-makers. That is why the sheer size of the database isn’t so impressive, and the depth of the database is not just about the length of the variable list and the number of bytes that go along with it. So, data collectors, impress us—because we’ve seen a lot.

2. Width
No matter how deep the information goes, if the coverage is not wide enough, the database becomes useless. Imagine well-organized, buyer-level POS (Point of Service) data coming from actual stores in “real-time” (though I am sick of this word, as it is also overused). The data go down to SKU-level details and payment methods. Now imagine that the data in question are collected in only two stores—one in Michigan, and the other in Delaware. This, by the way, is not a completely made -p story, and I faced similar cases in the past. Needless to say, we had to make many assumptions that we didn’t want to make in order to make the data useful, somehow. And I must say that it was far from ideal.

Even in the age when data are collected everywhere by every device, no dataset is ever complete (refer to “Missing Data Can Be Meaningful“). The limitations are everywhere. It could be about brand, business footprint, consumer privacy, data ownership, collection methods, technical limitations, distribution of collection devices, and the list goes on. Yes, Apple Pay is making a big splash in the news these days. But would you believe that the data collected only through Apple iPhone can really show the overall consumer trend in the country? Maybe in the future, but not yet. If you can pick only one credit card type to analyze, such as American Express for example, would you think that the result of the study is free from any bias? No siree. We can easily assume that such analysis would skew toward the more affluent population. I am not saying that such analyses are useless. And in fact, they can be quite useful if we understand the limitations of data collection and the nature of the bias. But the point is that the coverage matters.

Further, even within multisource databases in the market, the coverage should be examined variable by variable, simply because some data points are really difficult to obtain even by professional data compilers. For example, any information that crosses between the business and the consumer world is sparsely populated in many cases, and the “occupation” variable remains mostly blank or unknown on the consumer side. Similarly, any data related to young children is difficult or even forbidden to collect, so a seemingly simple variable, such as “number of children,” is left unknown for many households. Automobile data used to be abundant on a household level in the past, but a series of laws made sure that the access to such data is forbidden for many users. Again, don’t be impressed with the existence of some variables in the data menu, but look into it to see “how much” is available.

3. Accuracy
In any scientific analysis, a “false positive” is a dangerous enemy. In fact, they are worse than not having the information at all. Many folks just assume that any data coming out a computer is accurate (as in, “Hey, the computer says so!”). But data are not completely free from human errors.

Sheer accuracy of information is hard to measure, especially when the data sources are unique and rare. And the errors can happen in any stage, from data collection to imputation. If there are other known sources, comparing data from multiple sources is one way to ensure accuracy. Watching out for fluctuations in distributions of important variables from update to update is another good practice.

Nonetheless, the overall quality of the data is not just up to the person or department who manages the database. Yes, in this business, the last person who touches the data is responsible for all the mistakes that were made to it up to that point. However, when the garbage goes in, the garbage comes out. So, when there are errors, everyone who touched the database at any point must share in the burden of guilt.

Recently, I was part of a project that involved data collected from retail stores. We ran all kinds of reports and tallies to check the data, and edited many data values out when we encountered obvious errors. The funniest one that I saw was the first name “Asian” and the last name “Tourist.” As an openly Asian-American person, I was semi-glad that they didn’t put in “Oriental Tourist” (though I still can’t figure out who decided that word is for objects, but not people). We also found names like “No info” or “Not given.” Heck, I saw in the news that this refugee from Afghanistan (he was a translator for the U.S. troops) obtained a new first name as he was granted an entry visa, “Fnu.” That would be short for “First Name Unknown” as the first name in his new passport. Welcome to America, Fnu. Compared to that, “Andolini” becoming “Corleone” on Ellis Island is almost cute.

Data entry errors are everywhere. When I used to deal with data files from banks, I found that many last names were “Ira.” Well, it turned out that it wasn’t really the customers’ last names, but they all happened to have opened “IRA” accounts. Similarly, movie phone numbers like 777-555-1234 are very common. And fictitious names, such as “Mickey Mouse,” or profanities that are not fit to print are abundant, as well. At least fake email addresses can be tested and eliminated easily, and erroneous addresses can be corrected by time-tested routines, too. So, yes, maintaining a clean database is not so easy when people freely enter whatever they feel like. But it is not an impossible task, either.

We can also train employees regarding data entry principles, to a certain degree. (As in, “Do not enter your own email address,” “Do not use bad words,” etc.). But what about user-generated data? Search and kill is the only way to do it, and the job would never end. And the meta-table for fictitious names would grow longer and longer. Maybe we should just add “Thor” and “Sponge Bob” to that Mickey Mouse list, while we’re at it. Yet, dealing with this type of “text” data is the easy part. If the database manager in charge is not lazy, and if there is a bit of a budget allowed for data hygiene routines, one can avoid sending emails to “Dear Asian Tourist.”

Numeric errors are much harder to catch, as numbers do not look wrong to human eyes. That is when comparison to other known sources becomes important. If such examination is not possible on a granular level, then median value and distribution curves should be checked against historical transaction data or known public data sources, such as U.S. Census Data in the case of demographic information.

When it’s about the companies’ own data, follow your instincts and get rid of data that look too good or too bad to be true. We all can afford to lose a few records in our databases, and there is nothing wrong with deleting the “outliers” with extreme values. Erroneous names, like “No Information,” may be attached to a seven-figure lifetime spending sum, and you know that can’t be right.

The main takeaways are: (1) Never trust the data just because someone bothered to store them in computers, and (2) Constantly look for bad data in reports and listings, at times using old-fashioned eye-balling methods. Computers do not know what is “bad,” until we specifically tell them what bad data are. So, don’t give up, and keep at it. And if it’s about someone else’s data, insist on data tallies and data hygiene stats.

4. Recency
Outdated data are really bad for prediction or analysis, and that is a different kind of badness. Many call it a “Data Atrophy” issue, as no matter how fresh and accurate a data point may be today, it will surely deteriorate over time. Yes, data have a finite shelf-life, too. Let’s say that you obtained a piece of information called “Golf Interest” on an individual level. That information could be coming from a survey conducted a long time ago, or some golf equipment purchase data from a while ago. In any case, someone who is attached to that flag may have stopped shopping for new golf equipment, as he doesn’t play much anymore. Without a proper database update and a constant feed of fresh data, irrelevant data will continue to drive our decisions.

The crazy thing is that, the harder it is to obtain certain types of data—such as transaction or behavioral data—the faster they will deteriorate. By nature, transaction or behavioral data are time-sensitive. That is why it is important to install time parameters in databases for behavioral data. If someone purchased a new golf driver, when did he do that? Surely, having bought a golf driver in 2009 (“Hey, time for a new driver!”) is different from having purchased it last May.

So-called “Hot Line Names” literally cease to be hot after two to three months, or in some cases much sooner. The evaporation period maybe different for different product types, as one may stay longer in the market for an automobile than for a new printer. Part of the job of a data scientist is to defer the expiration date of data, finding leads or prospects who are still “warm,” or even “lukewarm,” with available valid data. But no matter how much statistical work goes into making the data “look” fresh, eventually the models will cease to be effective.

For decision-makers who do not make real-time decisions, a real-time database update could be an expensive solution. But the databases must be updated constantly (I mean daily, weekly, monthly or even quarterly). Otherwise, someone will eventually end up making a wrong decision based on outdated data.

5. Consistency
No matter how much effort goes into keeping the database fresh, not all data variables will be updated or filled in consistently. And that is the reality. The interesting thing is that, especially when using them for advanced analytics, we can still provide decent predictions if the data are consistent. It may sound crazy, but even not-so-accurate-data can be used in predictive analytics, if they are “consistently” wrong. Modeling is developing an algorithm that differentiates targets and non-targets, and if the descriptive variables are “consistently” off (or outdated, like census data from five years ago) on both sides, the model can still perform.

Conversely, if there is a huge influx of a new type of data, or any drastic change in data collection or in a business model that supports such data collection, all bets are off. We may end up predicting such changes in business models or in methodologies, not the differences in consumer behavior. And that is one of the worst kinds of errors in the predictive business.

Last month, I talked about dealing with missing data (refer to “Missing Data Can Be Meaningful“), and I mentioned that data can be inferred via various statistical techniques. And such data imputation is OK, as long as it returns consistent values. I have seen so many so-called professionals messing up popular models, like “Household Income,” from update to update. If the inferred values jump dramatically due to changes in the source data, there is no amount of effort that can save the targeting models that employed such variables, short of re-developing them.

That is why a time-series comparison of important variables in databases is so important. Any changes of more than 5 percent in distribution of variables when compared to the previous update should be investigated immediately. If you are dealing with external data vendors, insist on having a distribution report of key variables for every update. Consistency of data is more important in predictive analytics than sheer accuracy of data.

6. Connectivity
As I mentioned earlier, there are many types of data. And the predictive power of data multiplies as different types of data get to be used together. For instance, demographic data, which is quite commoditized, still plays an important role in predictive modeling, even when dominant predictors are behavioral data. It is partly because no one dataset is complete, and because different types of data play different roles in algorithms.

The trouble is that many modern datasets do not share any common matching keys. On the demographic side, we can easily imagine using PII (Personally Identifiable Information), such as name, address, phone number or email address for matching. Now, if we want to add some transaction data to the mix, we would need some match “key” (or a magic decoder ring) by which we can link it to the base records. Unfortunately, many modern databases completely lack PII, right from the data collection stage. The result is that such a data source would remain in a silo. It is not like all is lost in such a situation, as they can still be used for trend analysis. But to employ multisource data for one-to-one targeting, we really need to establish the connection among various data worlds.

Even if the connection cannot be made to household, individual or email levels, I would not give up entirely, as we can still target based on IP addresses, which may lead us to some geographic denominations, such as ZIP codes. I’d take ZIP-level targeting anytime over no targeting at all, even though there are many analytical and summarization steps required for that (more on that subject in future articles).

Not having PII or any hard matchkey is not a complete deal-breaker, but the maneuvering space for analysts and marketers decreases significantly without it. That is why the existence of PII, or even ZIP codes, is the first thing that I check when looking into a new data source. I would like to free them from isolation.

7. Delivery Mechanisms
Users judge databases based on visualization or reporting tool sets that are attached to the database. As I mentioned earlier, that is like judging the entire building based just on the window treatments. But for many users, that is the reality. After all, how would a casual user without programming or statistical background would even “see” the data? Through tool sets, of course.

But that is the only one end of it. There are so many types of platforms and devices, and the data must flow through them all. The important point is that data is useless if it is not in the hands of decision-makers through the device of their choice, at the right time. Such flow can be actualized via API feed, FTP, or good, old-fashioned batch installments, and no database should stay too far away from the decision-makers. In my earlier column, I emphasized that data players must be good at (1) Collection, (2) Refinement, and (3) Delivery (refer to “Big Data is Like Mining Gold for a Watch—Gold Can’t Tell Time“). Delivering the answers to inquirers properly closes one iteration of information flow. And they must continue to flow to the users.

8. User-Friendliness
Even when state-of-the-art (I apologize for using this cliché) visualization, reporting or drill-down tool sets are attached to the database, if the data variables are too complicated or not intuitive, users will get frustrated and eventually move away from it. If that happens after pouring a sick amount of money into any data initiative, that would be a shame. But it happens all the time. In fact, I am not going to name names here, but I saw some ridiculously hard to understand data dictionary from a major data broker in the U.S.; it looked like the data layout was designed for robots by the robots. Please. Data scientists must try to humanize the data.

This whole Big Data movement has a momentum now, and in the interest of not killing it, data players must make every aspect of this data business easy for the users, not harder. Simpler data fields, intuitive variable names, meaningful value sets, pre-packaged variables in forms of answers, and completeness of a data dictionary are not too much to ask after the hard work of developing and maintaining the database.

This is why I insist that data scientists and professionals must be businesspeople first. The developers should never forget that end-users are not trained data experts. And guess what? Even professional analysts would appreciate intuitive variable sets and complete data dictionaries. So, pretty please, with sugar on top, make things easy and simple.

9. Cost
I saved this important item for last for a good reason. Yes, the dollar sign is a very important factor in all business decisions, but it should not be the sole deciding factor when it comes to databases. That means CFOs should not dictate the decisions regarding data or databases without considering the input from CMOs, CTOs, CIOs or CDOs who should be, in turn, concerned about all the other criteria listed in this article.

Playing with the data costs money. And, at times, a lot of money. When you add up all the costs for hardware, software, platforms, tool sets, maintenance and, most importantly, the man-hours for database development and maintenance, the sum becomes very large very fast, even in the age of the open-source environment and cloud computing. That is why many companies outsource the database work to share the financial burden of having to create infrastructures. But even in that case, the quality of the database should be evaluated based on all criteria, not just the price tag. In other words, don’t just pick the lowest bidder and hope to God that it will be alright.

When you purchase external data, you can also apply these evaluation criteria. A test-match job with a data vendor will reveal lots of details that are listed here; and metrics, such as match rate and variable fill-rate, along with complete the data dictionary should be carefully examined. In short, what good is lower unit price per 1,000 records, if the match rate is horrendous and even matched data are filled with missing or sub-par inferred values? Also consider that, once you commit to an external vendor and start building models and analytical framework around their its, it becomes very difficult to switch vendors later on.

When shopping for external data, consider the following when it comes to pricing options:

  • Number of variables to be acquired: Don’t just go for the full option. Pick the ones that you need (involve analysts), unless you get a fantastic deal for an all-inclusive option. Generally, most vendors provide multiple-packaging options.
  • Number of records: Processed vs. Matched. Some vendors charge based on “processed” records, not just matched records. Depending on the match rate, it can make a big difference in total cost.
  • Installment/update frequency: Real-time, weekly, monthly, quarterly, etc. Think carefully about how often you would need to refresh “demographic” data, which doesn’t change as rapidly as transaction data, and how big the incremental universe would be for each update. Obviously, a real-time API feed can be costly.
  • Delivery method: API vs. Batch Delivery, for example. Price, as well as the data menu, change quite a bit based on the delivery options.
  • Availability of a full-licensing option: When the internal database becomes really big, full installment becomes a good option. But you would need internal capability for a match and append process that involves “soft-match,” using similar names and addresses (imagine good-old name and address merge routines). It becomes a bit of commitment as the match and append becomes a part of the internal database update process.

Business First
Evaluating a database is a project in itself, and these nine evaluation criteria will be a good guideline. Depending on the businesses, of course, more conditions could be added to the list. And that is the final point that I did not even include in the list: That the database (or all data, for that matter) should be useful to meet the business goals.

I have been saying that “Big Data Must Get Smaller,” and this whole Big Data movement should be about (1) Cutting down on the noise, and (2) Providing answers to decision-makers. If the data sources in question do not serve the business goals, cut them out of the plan, or cut loose the vendor if they are from external sources. It would be an easy decision if you “know” that the database in question is filled with dirty, sporadic and outdated data that cost lots of money to maintain.

But if that database is needed for your business to grow, clean it, update it, expand it and restructure it to harness better answers from it. Just like the way you’d maintain your cherished automobile to get more mileage out of it. Not all databases are created equal for sure, and some are definitely more equal than others. You just have to open your eyes to see the differences.

‘Big Data’ Is Like Mining Gold for a Watch – Gold Can’t Tell Time

It is often quoted that 2.5 quintillion bytes of data are collected each day. That surely sounds like a big number, considering 1 quintillion bytes (or exabytes, if that sounds fancier) are equal to 1 billion gigabytes. … My phone can hold about 65 gigabytes; which, by the way, means nothing to me. I just know that figure equates to about 6,000 songs, plus all my personal information, with room to spare for hundreds of photos and videos. 

It is often quoted that 2.5 quintillion bytes of data are collected each day. That surely sounds like a big number, considering 1 quintillion bytes (or exabytes, if that sounds fancier) are equal to 1 billion gigabytes. Looking back only about 20 years, I remember my beloved 386-based desktop computer had a hard drive that can barely hold 300 megabytes, which was considered to be quite large in those ancient days. Now, my phone can hold about 65 gigabytes; which, by the way, means nothing to me. I just know that figure equates to about 6,000 songs, plus all my personal information, with room to spare for hundreds of photos and videos. So how do I fathom the size of 2.5 quintillion bytes? I don’t. I give up. I’d rather count the number stars in the universe. And I have been in the database business for more than 25 years.

But I don’t feel bad about that. If a pile of data requires a computer to process it, then it is already too “big” for our brains. In the age of “Big Data,” size matters, but emphasizing the size element is missing the point. People want to understand the data in their own terms and want to use them in decision-making processes. Throwing the raw data around to people without math or computing skills is like galleries handing out paint and brushes to people who want paintings on the wall. Worse yet, continuing to point out how “big” the Big Data world is to them is like quoting the number of rice grains on this planet in front of a hungry man, when he doesn’t even care how many grains are in one bowl. He just wants to eat a bowl of “cooked” rice, and right this moment.

To be a successful data player, one must be the master of the following three steps:

  • Collection;
  • Refinement; and
  • Delivery.

Collection and storage are obviously important in the age of Big Data. However, that in itself shouldn’t be the goal. I hear lots of bragging about how much data can be collected and stored, and how fast the data can be retrieved.

Great, you can retrieve any transaction detail going back 20 years in less than 0.5 seconds. Congratulations. But can you now tell me whom are more likely to be loyal customers for the next five years, with annual spending potential of more than $250? Or who is more likely to quit using the service in next 60 days? Who is more likely to be on a cruise ship leaving the dock on the East Coast heading for Europe between Thanksgiving and Christmas, with onboard spending potential greater than $300? Who is more likely to respond to emails with free shipping offers? Where should I open my next store selling fancy children’s products? What do my customers look like, and where do they go between 6 and 9 p.m.?

Answers to these types of questions do not come from the raw data, but they should be derived from the data through the data refinement process. And that is the hard part. Asking the right questions, expressing the goals in a mathematical format, throwing out data that don’t fit the question, merging data from a diverse array of sources, summarizing the data into meaningful levels, filling in the blanks (there will be plenty—even these days), and running statistical models to come up with scores that look like an answer to the question are all parts of the data refinement process. It is a lot like manufacturing gold watches, where mining gold is just an important first step. But a piece of gold won’t tell you what time it is.

The final step is to deliver that answer—which, by now, should be in a user-friendly format—to the user at the right time in the right format. Often, lots of data-related products only emphasize this part, as it is the most intimate one to the users. After all, it provides an illusion that the user is in total control, being able to touch the data so nicely displayed on the screen. Such tool sets may produce impressive-looking reports and dazzling graphics. But, lest we forget, they are only representations of the data refinement processes. In addition, no tool set will ever do the thinking part for anyone. I’ve seen so many missed opportunities where decision-makers invested obscene amounts of money in fancy tool sets, believing they will conduct all the logical and data refinement work for them, automatically. That is like believing that purchasing the top of the line Fender Stratocaster will guarantee that you will play like Eric Clapton in the near future. Yes, the tool sets are important as delivery mechanisms of refined data, but none of them replace the refinement part. Doing so would be like skipping guitar practice after spending $3,000 on a guitar.

Big Data business should be about providing answers to questions. It should be about humans who are the subjects of data collection and, in turn, the ultimate beneficiaries of information. It’s not about IT or tool sets that come and go like hit songs. But it should be about inserting advanced use of data into everyday decision-making processes by all kinds of people, not just the ones with statistics degrees. The goal of data players must be to make it simple—not bigger and more complex.

I boldly predict that missing these points will make “Big Data” a dirty word in the next three years. Emphasizing the size element alone will lead to unbalanced investments, which will then lead to disappointing results with not much to show for them in this cruel age of ROI. That is a sure way to kill the buzz. Not that I am that fond of the expression “Big Data”; though, I admit, one benefit has been that I don’t have to explain what I do for living for 10 minutes any more. Nonetheless, all the Big Data folks may need an exit plan if we are indeed heading for the days when it will be yet another disappointing buzzword. So let’s do this one right, and start thinking about refining the data first and foremost.

Collection and storage are just so last year.