Beware of One-Size-Fits-All Customer Data Solutions

In the data business, the ability to fine-tune database structure and toolsets to meet unique business requirements is key to success, not just flashy features and functionalities. Beware of technology providers who insist on a “one-size-fits-all” customer data solution.

In the data business, the ability to fine-tune database structure and toolsets to meet unique business requirements is key to success, not just flashy features and functionalities. Beware of technology providers who insist on a “one-size-fits-all” customer data solution, unless the price of entry is extremely low. Always check the tech provider’s exception management skills and their determination to finish the last mile. Too often, many just freeze at the thought of any customization.

The goal of any data project is to create monetary value out of available data. Whether it is about increasing revenue or reducing cost, data activities through various types of basic and advanced analytics must yield tangible results. Marketers are not doing all this data-related work to entertain geeks and nerds (no offense); no one is paying for data infrastructure, analytics toolsets, and most importantly, human cost to support some intellectual curiosity of a bunch of specialists.

Therefore, when it comes to evaluating any data play, the criteria that CEOs and CFOs bring to the table matter the most. Yes, I shared a long list of CDP evaluation criteria from the users’ and technical points of views last month, but let me emphasize that, like any business activity, data work is ultimately about the bottom line.

That means we have to maintain balance between the cost of doing business and usability of data assets. Unfortunately, these two important factors are inversely related. In other words, to make customer data more useful, one must put more time and money into it. Most datasets are unstructured, unrefined, uncategorized, and plain dirty. And the messiness level is not uniform.

Start With the Basics

Now, there are many commoditized toolsets out in the market to clean the data and weave them together to create a coveted Customer-360 view. In fact, if a service provider or a toolset isn’t even equipped to do the basic part, I suggest working with someone who can.

For example, a service provider must know the definition of dirty data. They may have to ask the client to gauge the tolerance level (for messy data), but basic parameters must be in place already.

What is a good email address, for instance? It should have all the proper components like @ signs and .com, .net, .org, etc. at the end. Permission flags must be attached properly. Primary and secondary email must be set by predetermined rules. They must be tagged properly if delivery fails, even once. The list goes on. I can think of similar sets of rules when it comes to name, address, company name, phone number, and other basic data fields.

Why are these important? Because it is not possible to create that Customer-360 view without properly cleaned and standardized Personally Identifiable Information (PII). And anyone who is in this game must be masters of that. The ability to clean basic information and matching seemingly unmatchable entities are just prerequisites in this game.

Even Basic Data Hygiene and Matching Routines Must Be Tweaked

Even with basic match routines, users must be able to dictate tightness and looseness of matching logics. If the goal of customer communication involves legal notifications (as for banking and investment industries), one should not merge any two entities just because they look similar. If the goal is mainly to maximize campaign effectiveness, one may merge similar looking entities using various “fuzzy” matching techniques, employing Soundex, nickname tables, and abbreviated or hashed match keys. If the database is filled with business entities for B2B marketing, then so-called commoditized merge rules become more complicated.

The first sign of trouble often becomes visible at this basic stage. Be aware of providers that insist on “one-size-fits-all” rules, in the name of some universal matching routine. There was no such thing even in the age of direct marketing (i.e., really old days). How are we going to go through complex omnichannel marketing environment with just a few hard-set rules that can’t be modified?

Simple matching logic only with name, address, and email becomes much more complex when you add new online and offline channels, as they all come with different types of match keys. Just in the offline world, the quality of customer names collected in physical stores vastly differs from that of self-entered information from a website along with shipping addresses. For example, I have seen countless invalid names like “Mickey Mouse,” “Asian Tourist,” or “No Name Provided.” Conversely, no one who wants to receive the merchandise at their address would create an entry “First Name: Asian” and “Last Name: Tourist.”

Sure, I’m providing simple examples to illustrate the fallacy of “one-size-fits-all” rules. But by definition, a CDP is an amalgamation of vastly different data sources, online and offline. Exceptions are the rules.

Dissecting Transaction Elements

Up to this point, we are still in the realm of “basic” stuff, which is mostly commoditized in the technology market. Now, let’s get into more challenging parts.

Once data weaving is done through PII fields and various proxies of individuals across networks and platforms, then behavioral, demographic, geo-location, and movement data must be consolidated around each individual. Now, demographic data from commercial data compilers are already standardized (one would hope), regardless of their data sources. Every other customer data type varies depending on your business.

The simplest form of transaction records would be from retail businesses, where you would sell widgets for set prices through certain channels. And what is a transaction record in that sense? “Who” bought “what,” “when,” for “how much,” through “what channel.” Even from such a simplified view point, things are not so uniform.

Let’s start with an easy one, such as common date/time stamp. Is it in form of UTC time code? That would be simple. Do we need to know the day-part of the transaction? Eventually, but by what standard? Do we need to convert them into local time of the transaction? Yes, because we need to tell evening buyers and daytime buyers apart, and we can’t use Coordinated Universal Time for that (unless you only operate in the U.K.).

“How much” isn’t so bad. It is made of net price, tax, shipping, discount, coupon redemption, and finally, total paid amount (for completed transactions). Sounds easy? Let’s just say that out of thousands of transaction files that I’ve encountered in my lifetime, I couldn’t find any “one rule” that governs how merchants would handle returns, refunds, or coupon redemptions.

Some create multiple entries for each action, with or without common transaction ID (crazy, right?). Many customer data sources contain mathematical errors all over. Inevitable file cutoff dates would create orphan records where only return transactions are found without any linkage to the original transaction record. Yes, we are not building an accounting system out of a marketing database, but no one should count canceled and returned transactions as a valid transaction for any analytics. “One-size-fits-all?” I laugh at that notion.

“Channel” may not be so bad. But at what level? What if the client has over 1,000 retail store locations all over the world? Should there be a subcategory under “Retail” as a channel? What about multiple websites with different brand names? How would we organize all that? If this type of basic – but essential – data isn’t organized properly, you won’t even be able to share store level reports with the marketing and sales teams, who wouldn’t care for a minute about “why” such basic reports are so hard to obtain.

The “what” part can be really complicated. Or, very simple if product SKUs are well-organized with proper product descriptions, and more importantly, predetermined product categories. A good sign would be the presence of a multi-level product category table, where you see entries like an apparel category broken down into Men, Women, Children, etc., and Women’s Apparel is broken down further into Formalwear, Sportswear, Casualwear, Underwear, Lingerie, Beachwear, Fashion, Accessories, etc.

For merchants with vast arrays of products, three to five levels of subcategories may be necessary even for simple BI reports, or further, advanced modeling and segmentation. But I’ve seen too many cases of incongruous and inconsistent categories (totally useless), recycled category names (really?), and weird categories such as “Summer Sales” or “Gift” (which are clearly for promotional events, not products).

All these items must be fixed and categorized properly, if they are not adequate for analytics. Otherwise, the gatekeepers of information are just dumping the hard work on poor end-users and analysts. Good luck creating any usable reports or models out of uncategorized product information. You might as well leave it as an unknown field, as product reports will have as many rows as the number of SKUs in the system. It will be a challenge finding any insights out of that kind of messy report.

Behavioral Data Are Complex and Unique to Your Business

Now, all this was about relatively simple “transaction” part. Shall we get into the online behavior data? Oh, it gets much dirtier, as any “tag” data are only as good as the person or department that tagged the web pages in question. Let’s just say I’ve seen all kinds of variations of one channel (or “Source”) called “Facebook.” Not from one place either, as they show up in “Medium” or “Device” fields. Who is going to clean up the mess?

I don’t mean to scare you, but these are just common examples in the retail industry. If you are in any subscription, continuity, travel, hospitality, or credit business, things get much more complicated.

For example, there isn’t any one “transaction date” in the travel industry. There would be Reservation Date, Booking Confirmation Date, Payment Date, Travel Date, Travel Duration, Cancellation Date, Modification Date, etc., and all these dates matter if you want to figure out what the traveler is about. If you get all these down properly and calculate distances from one another, you may be able to tell if the individual is traveling for business or for leisure. But only if all these data are in usable forms.

Always Consider Exception Management Skills

Some of you may be in businesses where turn-key solutions may be sufficient. And there are plenty of companies that provide automated, but simpler and cheaper options. The proper way to evaluate your situation would be to start with specific objectives and prioritize them. What are the functionalities you can’t live without, and what is the main goal of the data project? (Hopefully not hoarding the customer data.)

Once you set the organizational goals, try not to deviate from them so casually in the name of cost savings and automation. Your bosses and colleagues (i.e., mostly the “bottom line” folks) may not care much about the limitations of toolsets and technologies (i.e., geeky concerns).

Omnichannel marketing that requires a CDP is already complicated. So, beware of sales pitches like “All your dreams will come true with our CDP solution!” Ask some hard questions, and see if they balk at the word “customization.” Your success may depend on their ability to handle exceptions than executing some commoditized functions that they had acquired a long time ago. Unless you really believe that you will safely get to your destination on a “autopilot” mode.

 

Clean Up After Yourselves, Marketers

You know what can really suck? When a piece of marketing is spot on … until it isn’t. Let’s look at a couple email examples and see what went awry.

You know what can really suck? When a piece of marketing is spot on … until it isn’t.

Let’s take a look at this email from American Red Cross … I’m a blood donor, and I regularly receive emails from the nonprofit, alerting me about blood drives and more. And hey, when the subject line is “MELISSA, This is Your Week’s Best Email,” it’s got to be good, right?
Red Cross email
All right, this email is definitely on brand for me … photo of a puppy hugging a kitten? Check. Photo of a baby seal with a super cute smile on its face? Check. Let’s read on.

Red Cross emailOhmigod that puppy is so happy look at that … wait a second.

Red Cross Email CloseupIn the final paragraph, the email reads: As an AB donor, MELISSA, your help is especially important.

Oh really? I’m an A+ donor.

I’ve been donating blood off and on for the past 18 years. I’m registered as an A+ donor. So where did they get AB?

Look, it’s not the end of the world, but the incorrect personalized data stopped me dead in my tracks. And no, I didn’t schedule a donation in May.

And in April, a reader forwarded me the following email he received from Inc.:

inc-emailThe reader (who asked me to not share his name) let me know that, while Cornell University might hit the Inc. 5000 requirements, he does not work for Cornell. He’s also not an officer of trustee of the university. He is an alum (Go Big Red!) and an active volunteer, and sure, maybe his email address is @cornell.edu.

But so are the email addresses of a lot of Cornell students.

The lesson to be learned from these emails? Clean your lists, marketers. According to Experian Data Quality, dirty data costs marketers approximately 12 percent of their revenue. It makes you look bad, can cost you a sale or at least get people talking about you in ways you didn’t want them to do.

 

Not All Databases Are Created Equal

Not all databases are created equal. No kidding. That is like saying that not all cars are the same, or not all buildings are the same. But somehow, “judging” databases isn’t so easy. First off, there is no tangible “tire” that you can kick when evaluating databases or data sources. Actually, kicking the tire is quite useless, even when you are inspecting an automobile. Can you really gauge the car’s handling, balance, fuel efficiency, comfort, speed, capacity or reliability based on how it feels when you kick “one” of the tires? I can guarantee that your toes will hurt if you kick it hard enough, and even then you won’t be able to tell the tire pressure within 20 psi. If you really want to evaluate an automobile, you will have to sign some papers and take it out for a spin (well, more than one spin, but you know what I mean). Then, how do we take a database out for a spin? That’s when the tool sets come into play.

Not all databases are created equal. No kidding. That is like saying that not all cars are the same, or not all buildings are the same. But somehow, “judging” databases isn’t so easy. First off, there is no tangible “tire” that you can kick when evaluating databases or data sources. Actually, kicking the tire is quite useless, even when you are inspecting an automobile. Can you really gauge the car’s handling, balance, fuel efficiency, comfort, speed, capacity or reliability based on how it feels when you kick “one” of the tires? I can guarantee that your toes will hurt if you kick it hard enough, and even then you won’t be able to tell the tire pressure within 20 psi. If you really want to evaluate an automobile, you will have to sign some papers and take it out for a spin (well, more than one spin, but you know what I mean). Then, how do we take a database out for a spin? That’s when the tool sets come into play.

However, even when the database in question is attached to analytical, visualization, CRM or drill-down tools, it is not so easy to evaluate it completely, as such practice reveals only a few aspects of a database, hardly all of them. That is because such tools are like window treatments of a building, through which you may look into the database. Imagine a building inspector inspecting a building without ever entering it. Would you respect the opinion of the inspector who just parks his car outside the building, looks into the building through one or two windows, and says, “Hey, we’re good to go”? No way, no sir. No one should judge a book by its cover.

In the age of the Big Data (you should know by now that I am not too fond of that word), everything digitized is considered data. And data reside in databases. And databases are supposed be designed to serve specific purposes, just like buildings and cars are. Although many modern databases are just mindless piles of accumulated data, granted that the database design is decent and functional, we can still imagine many different types of databases depending on the purposes and their contents.

Now, most of the Big Data discussions these days are about the platform, environment, or tool sets. I’m sure you heard or read enough about those, so let me boldly skip all that and their related techie words, such as Hadoop, MongoDB, Pig, Python, MapReduce, Java, SQL, PHP, C++, SAS or anything related to that elusive “cloud.” Instead, allow me to show you the way to evaluate databases—or data sources—from a business point of view.

For businesspeople and decision-makers, it is not about NoSQL vs. RDB; it is just about the usefulness of the data. And the usefulness comes from the overall content and database management practices, not just platforms, tool sets and buzzwords. Yes, tool sets are important, but concert-goers do not care much about the types and brands of musical instruments that are being used; they just care if the music is entertaining or not. Would you be impressed with a mediocre guitarist just because he uses the same brand of guitar that his guitar hero uses? Nope. Likewise, the usefulness of a database is not about the tool sets.

In my past column, titled “Big Data Must Get Smaller,” I explained that there are three major types of data, with which marketers can holistically describe their target audience: (1) Descriptive Data, (2) Transaction/Behavioral Data, and (3) Attitudinal Data. In short, if you have access to all three dimensions of the data spectrum, you will have a more complete portrait of customers and prospects. Because I already went through that subject in-depth, let me just say that such types of data are not the basis of database evaluation here, though the contents should be on top of the checklist to meet business objectives.

In addition, throughout this series, I have been repeatedly emphasizing that the database and analytics management philosophy must originate from business goals. Basically, the business objective must dictate the course for analytics, and databases must be designed and optimized to support such analytical activities. Decision-makers—and all involved parties, for that matter—suffer a great deal when that hierarchy is reversed. And unfortunately, that is the case in many organizations today. Therefore, let me emphasize that the evaluation criteria that I am about to introduce here are all about usefulness for decision-making processes and supporting analytical activities, including predictive analytics.

Let’s start digging into key evaluation criteria for databases. This list would be quite useful when examining internal and external data sources. Even databases managed by professional compilers can be examined through these criteria. The checklist could also be applicable to investors who are about to acquire a company with data assets (as in, “Kick the tire before you buy it.”).

1. Depth
Let’s start with the most obvious one. What kind of information is stored and maintained in the database? What are the dominant data variables in the database, and what is so unique about them? Variety of information matters for sure, and uniqueness is often related to specific business purposes for which databases are designed and created, along the lines of business data, international data, specific types of behavioral data like mobile data, categorical purchase data, lifestyle data, survey data, movement data, etc. Then again, mindless compilation of random data may not be useful for any business, regardless of the size.

Generally, data dictionaries (lack of it is a sure sign of trouble) reveal the depth of the database, but we need to dig deeper, as transaction and behavioral data are much more potent predictors and harder to manage in comparison to demographic and firmographic data, which are very much commoditized already. Likewise, Lifestyle variables that are derived from surveys that may have been conducted a long time ago are far less valuable than actual purchase history data, as what people say they do and what they actually do are two completely different things. (For more details on the types of data, refer to the second half of “Big Data Must Get Smaller.”)

Innovative ideas should not be overlooked, as data packaging is often very important in the age of information overflow. If someone or some company transformed many data points into user-friendly formats using modeling or other statistical techniques (imagine pre-developed categorical models targeting a variety of human behaviors, or pre-packaged segmentation or clustering tools), such effort deserves extra points, for sure. As I emphasized numerous times in this series, data must be refined to provide answers to decision-makers. That is why the sheer size of the database isn’t so impressive, and the depth of the database is not just about the length of the variable list and the number of bytes that go along with it. So, data collectors, impress us—because we’ve seen a lot.

2. Width
No matter how deep the information goes, if the coverage is not wide enough, the database becomes useless. Imagine well-organized, buyer-level POS (Point of Service) data coming from actual stores in “real-time” (though I am sick of this word, as it is also overused). The data go down to SKU-level details and payment methods. Now imagine that the data in question are collected in only two stores—one in Michigan, and the other in Delaware. This, by the way, is not a completely made -p story, and I faced similar cases in the past. Needless to say, we had to make many assumptions that we didn’t want to make in order to make the data useful, somehow. And I must say that it was far from ideal.

Even in the age when data are collected everywhere by every device, no dataset is ever complete (refer to “Missing Data Can Be Meaningful“). The limitations are everywhere. It could be about brand, business footprint, consumer privacy, data ownership, collection methods, technical limitations, distribution of collection devices, and the list goes on. Yes, Apple Pay is making a big splash in the news these days. But would you believe that the data collected only through Apple iPhone can really show the overall consumer trend in the country? Maybe in the future, but not yet. If you can pick only one credit card type to analyze, such as American Express for example, would you think that the result of the study is free from any bias? No siree. We can easily assume that such analysis would skew toward the more affluent population. I am not saying that such analyses are useless. And in fact, they can be quite useful if we understand the limitations of data collection and the nature of the bias. But the point is that the coverage matters.

Further, even within multisource databases in the market, the coverage should be examined variable by variable, simply because some data points are really difficult to obtain even by professional data compilers. For example, any information that crosses between the business and the consumer world is sparsely populated in many cases, and the “occupation” variable remains mostly blank or unknown on the consumer side. Similarly, any data related to young children is difficult or even forbidden to collect, so a seemingly simple variable, such as “number of children,” is left unknown for many households. Automobile data used to be abundant on a household level in the past, but a series of laws made sure that the access to such data is forbidden for many users. Again, don’t be impressed with the existence of some variables in the data menu, but look into it to see “how much” is available.

3. Accuracy
In any scientific analysis, a “false positive” is a dangerous enemy. In fact, they are worse than not having the information at all. Many folks just assume that any data coming out a computer is accurate (as in, “Hey, the computer says so!”). But data are not completely free from human errors.

Sheer accuracy of information is hard to measure, especially when the data sources are unique and rare. And the errors can happen in any stage, from data collection to imputation. If there are other known sources, comparing data from multiple sources is one way to ensure accuracy. Watching out for fluctuations in distributions of important variables from update to update is another good practice.

Nonetheless, the overall quality of the data is not just up to the person or department who manages the database. Yes, in this business, the last person who touches the data is responsible for all the mistakes that were made to it up to that point. However, when the garbage goes in, the garbage comes out. So, when there are errors, everyone who touched the database at any point must share in the burden of guilt.

Recently, I was part of a project that involved data collected from retail stores. We ran all kinds of reports and tallies to check the data, and edited many data values out when we encountered obvious errors. The funniest one that I saw was the first name “Asian” and the last name “Tourist.” As an openly Asian-American person, I was semi-glad that they didn’t put in “Oriental Tourist” (though I still can’t figure out who decided that word is for objects, but not people). We also found names like “No info” or “Not given.” Heck, I saw in the news that this refugee from Afghanistan (he was a translator for the U.S. troops) obtained a new first name as he was granted an entry visa, “Fnu.” That would be short for “First Name Unknown” as the first name in his new passport. Welcome to America, Fnu. Compared to that, “Andolini” becoming “Corleone” on Ellis Island is almost cute.

Data entry errors are everywhere. When I used to deal with data files from banks, I found that many last names were “Ira.” Well, it turned out that it wasn’t really the customers’ last names, but they all happened to have opened “IRA” accounts. Similarly, movie phone numbers like 777-555-1234 are very common. And fictitious names, such as “Mickey Mouse,” or profanities that are not fit to print are abundant, as well. At least fake email addresses can be tested and eliminated easily, and erroneous addresses can be corrected by time-tested routines, too. So, yes, maintaining a clean database is not so easy when people freely enter whatever they feel like. But it is not an impossible task, either.

We can also train employees regarding data entry principles, to a certain degree. (As in, “Do not enter your own email address,” “Do not use bad words,” etc.). But what about user-generated data? Search and kill is the only way to do it, and the job would never end. And the meta-table for fictitious names would grow longer and longer. Maybe we should just add “Thor” and “Sponge Bob” to that Mickey Mouse list, while we’re at it. Yet, dealing with this type of “text” data is the easy part. If the database manager in charge is not lazy, and if there is a bit of a budget allowed for data hygiene routines, one can avoid sending emails to “Dear Asian Tourist.”

Numeric errors are much harder to catch, as numbers do not look wrong to human eyes. That is when comparison to other known sources becomes important. If such examination is not possible on a granular level, then median value and distribution curves should be checked against historical transaction data or known public data sources, such as U.S. Census Data in the case of demographic information.

When it’s about the companies’ own data, follow your instincts and get rid of data that look too good or too bad to be true. We all can afford to lose a few records in our databases, and there is nothing wrong with deleting the “outliers” with extreme values. Erroneous names, like “No Information,” may be attached to a seven-figure lifetime spending sum, and you know that can’t be right.

The main takeaways are: (1) Never trust the data just because someone bothered to store them in computers, and (2) Constantly look for bad data in reports and listings, at times using old-fashioned eye-balling methods. Computers do not know what is “bad,” until we specifically tell them what bad data are. So, don’t give up, and keep at it. And if it’s about someone else’s data, insist on data tallies and data hygiene stats.

4. Recency
Outdated data are really bad for prediction or analysis, and that is a different kind of badness. Many call it a “Data Atrophy” issue, as no matter how fresh and accurate a data point may be today, it will surely deteriorate over time. Yes, data have a finite shelf-life, too. Let’s say that you obtained a piece of information called “Golf Interest” on an individual level. That information could be coming from a survey conducted a long time ago, or some golf equipment purchase data from a while ago. In any case, someone who is attached to that flag may have stopped shopping for new golf equipment, as he doesn’t play much anymore. Without a proper database update and a constant feed of fresh data, irrelevant data will continue to drive our decisions.

The crazy thing is that, the harder it is to obtain certain types of data—such as transaction or behavioral data—the faster they will deteriorate. By nature, transaction or behavioral data are time-sensitive. That is why it is important to install time parameters in databases for behavioral data. If someone purchased a new golf driver, when did he do that? Surely, having bought a golf driver in 2009 (“Hey, time for a new driver!”) is different from having purchased it last May.

So-called “Hot Line Names” literally cease to be hot after two to three months, or in some cases much sooner. The evaporation period maybe different for different product types, as one may stay longer in the market for an automobile than for a new printer. Part of the job of a data scientist is to defer the expiration date of data, finding leads or prospects who are still “warm,” or even “lukewarm,” with available valid data. But no matter how much statistical work goes into making the data “look” fresh, eventually the models will cease to be effective.

For decision-makers who do not make real-time decisions, a real-time database update could be an expensive solution. But the databases must be updated constantly (I mean daily, weekly, monthly or even quarterly). Otherwise, someone will eventually end up making a wrong decision based on outdated data.

5. Consistency
No matter how much effort goes into keeping the database fresh, not all data variables will be updated or filled in consistently. And that is the reality. The interesting thing is that, especially when using them for advanced analytics, we can still provide decent predictions if the data are consistent. It may sound crazy, but even not-so-accurate-data can be used in predictive analytics, if they are “consistently” wrong. Modeling is developing an algorithm that differentiates targets and non-targets, and if the descriptive variables are “consistently” off (or outdated, like census data from five years ago) on both sides, the model can still perform.

Conversely, if there is a huge influx of a new type of data, or any drastic change in data collection or in a business model that supports such data collection, all bets are off. We may end up predicting such changes in business models or in methodologies, not the differences in consumer behavior. And that is one of the worst kinds of errors in the predictive business.

Last month, I talked about dealing with missing data (refer to “Missing Data Can Be Meaningful“), and I mentioned that data can be inferred via various statistical techniques. And such data imputation is OK, as long as it returns consistent values. I have seen so many so-called professionals messing up popular models, like “Household Income,” from update to update. If the inferred values jump dramatically due to changes in the source data, there is no amount of effort that can save the targeting models that employed such variables, short of re-developing them.

That is why a time-series comparison of important variables in databases is so important. Any changes of more than 5 percent in distribution of variables when compared to the previous update should be investigated immediately. If you are dealing with external data vendors, insist on having a distribution report of key variables for every update. Consistency of data is more important in predictive analytics than sheer accuracy of data.

6. Connectivity
As I mentioned earlier, there are many types of data. And the predictive power of data multiplies as different types of data get to be used together. For instance, demographic data, which is quite commoditized, still plays an important role in predictive modeling, even when dominant predictors are behavioral data. It is partly because no one dataset is complete, and because different types of data play different roles in algorithms.

The trouble is that many modern datasets do not share any common matching keys. On the demographic side, we can easily imagine using PII (Personally Identifiable Information), such as name, address, phone number or email address for matching. Now, if we want to add some transaction data to the mix, we would need some match “key” (or a magic decoder ring) by which we can link it to the base records. Unfortunately, many modern databases completely lack PII, right from the data collection stage. The result is that such a data source would remain in a silo. It is not like all is lost in such a situation, as they can still be used for trend analysis. But to employ multisource data for one-to-one targeting, we really need to establish the connection among various data worlds.

Even if the connection cannot be made to household, individual or email levels, I would not give up entirely, as we can still target based on IP addresses, which may lead us to some geographic denominations, such as ZIP codes. I’d take ZIP-level targeting anytime over no targeting at all, even though there are many analytical and summarization steps required for that (more on that subject in future articles).

Not having PII or any hard matchkey is not a complete deal-breaker, but the maneuvering space for analysts and marketers decreases significantly without it. That is why the existence of PII, or even ZIP codes, is the first thing that I check when looking into a new data source. I would like to free them from isolation.

7. Delivery Mechanisms
Users judge databases based on visualization or reporting tool sets that are attached to the database. As I mentioned earlier, that is like judging the entire building based just on the window treatments. But for many users, that is the reality. After all, how would a casual user without programming or statistical background would even “see” the data? Through tool sets, of course.

But that is the only one end of it. There are so many types of platforms and devices, and the data must flow through them all. The important point is that data is useless if it is not in the hands of decision-makers through the device of their choice, at the right time. Such flow can be actualized via API feed, FTP, or good, old-fashioned batch installments, and no database should stay too far away from the decision-makers. In my earlier column, I emphasized that data players must be good at (1) Collection, (2) Refinement, and (3) Delivery (refer to “Big Data is Like Mining Gold for a Watch—Gold Can’t Tell Time“). Delivering the answers to inquirers properly closes one iteration of information flow. And they must continue to flow to the users.

8. User-Friendliness
Even when state-of-the-art (I apologize for using this cliché) visualization, reporting or drill-down tool sets are attached to the database, if the data variables are too complicated or not intuitive, users will get frustrated and eventually move away from it. If that happens after pouring a sick amount of money into any data initiative, that would be a shame. But it happens all the time. In fact, I am not going to name names here, but I saw some ridiculously hard to understand data dictionary from a major data broker in the U.S.; it looked like the data layout was designed for robots by the robots. Please. Data scientists must try to humanize the data.

This whole Big Data movement has a momentum now, and in the interest of not killing it, data players must make every aspect of this data business easy for the users, not harder. Simpler data fields, intuitive variable names, meaningful value sets, pre-packaged variables in forms of answers, and completeness of a data dictionary are not too much to ask after the hard work of developing and maintaining the database.

This is why I insist that data scientists and professionals must be businesspeople first. The developers should never forget that end-users are not trained data experts. And guess what? Even professional analysts would appreciate intuitive variable sets and complete data dictionaries. So, pretty please, with sugar on top, make things easy and simple.

9. Cost
I saved this important item for last for a good reason. Yes, the dollar sign is a very important factor in all business decisions, but it should not be the sole deciding factor when it comes to databases. That means CFOs should not dictate the decisions regarding data or databases without considering the input from CMOs, CTOs, CIOs or CDOs who should be, in turn, concerned about all the other criteria listed in this article.

Playing with the data costs money. And, at times, a lot of money. When you add up all the costs for hardware, software, platforms, tool sets, maintenance and, most importantly, the man-hours for database development and maintenance, the sum becomes very large very fast, even in the age of the open-source environment and cloud computing. That is why many companies outsource the database work to share the financial burden of having to create infrastructures. But even in that case, the quality of the database should be evaluated based on all criteria, not just the price tag. In other words, don’t just pick the lowest bidder and hope to God that it will be alright.

When you purchase external data, you can also apply these evaluation criteria. A test-match job with a data vendor will reveal lots of details that are listed here; and metrics, such as match rate and variable fill-rate, along with complete the data dictionary should be carefully examined. In short, what good is lower unit price per 1,000 records, if the match rate is horrendous and even matched data are filled with missing or sub-par inferred values? Also consider that, once you commit to an external vendor and start building models and analytical framework around their its, it becomes very difficult to switch vendors later on.

When shopping for external data, consider the following when it comes to pricing options:

  • Number of variables to be acquired: Don’t just go for the full option. Pick the ones that you need (involve analysts), unless you get a fantastic deal for an all-inclusive option. Generally, most vendors provide multiple-packaging options.
  • Number of records: Processed vs. Matched. Some vendors charge based on “processed” records, not just matched records. Depending on the match rate, it can make a big difference in total cost.
  • Installment/update frequency: Real-time, weekly, monthly, quarterly, etc. Think carefully about how often you would need to refresh “demographic” data, which doesn’t change as rapidly as transaction data, and how big the incremental universe would be for each update. Obviously, a real-time API feed can be costly.
  • Delivery method: API vs. Batch Delivery, for example. Price, as well as the data menu, change quite a bit based on the delivery options.
  • Availability of a full-licensing option: When the internal database becomes really big, full installment becomes a good option. But you would need internal capability for a match and append process that involves “soft-match,” using similar names and addresses (imagine good-old name and address merge routines). It becomes a bit of commitment as the match and append becomes a part of the internal database update process.

Business First
Evaluating a database is a project in itself, and these nine evaluation criteria will be a good guideline. Depending on the businesses, of course, more conditions could be added to the list. And that is the final point that I did not even include in the list: That the database (or all data, for that matter) should be useful to meet the business goals.

I have been saying that “Big Data Must Get Smaller,” and this whole Big Data movement should be about (1) Cutting down on the noise, and (2) Providing answers to decision-makers. If the data sources in question do not serve the business goals, cut them out of the plan, or cut loose the vendor if they are from external sources. It would be an easy decision if you “know” that the database in question is filled with dirty, sporadic and outdated data that cost lots of money to maintain.

But if that database is needed for your business to grow, clean it, update it, expand it and restructure it to harness better answers from it. Just like the way you’d maintain your cherished automobile to get more mileage out of it. Not all databases are created equal for sure, and some are definitely more equal than others. You just have to open your eyes to see the differences.

5 Steps to Customer Data Hygiene: It’s Not Sexy, But It’s Essential

Are you happy with the quality of the information in your marketing database? Probably not. A new report from NetProspex confirms: 64 percent of company records in the database of a typical B-to-B marketer have no phone number attached. Pretty much eliminates phone as a reliable communications medium, doesn’t it? And 88 percent are missing basic firmographic data

Are you happy with the quality of the information in your marketing database? Probably not. A new report from NetProspex confirms: 64 percent of company records in the database of a typical B-to-B marketer have no phone number attached.

Pretty much eliminates phone as a reliable communications medium, doesn’t it?

And 88 percent are missing basic firmographic data, like industry, revenue or employee size—so profiling and segmentation is pretty tough. In fact, the Netprospex report concluded that 84 percent of B-to-B marketing databases are “barely functional.” Yipes. So, what can you do about it?

This is not a new problem. Dun & Bradstreet reports regularly on how quickly B-to-B data degrades. Get this: Every year, in the U.S., business postal addresses change at a rate of 20.7 percent. If your customer is a new business, the rate is 27.3 percent. Phone numbers change at the rate of 18 percent, and 22.7 percent among new businesses. Even company names fluctuate: 12.4 percent overall, and a staggering 36.4 percent percent among new businesses.

No wonder your sales force is always complaining that your data is no good (although they probably use more colorful words).

Here are five steps you can take to maintain data accuracy, a process known as “data hygiene.”

1. Key enter the data correctly in the first place.
Sounds obvious, but it’s often overlooked. This means following address guidelines from the Postal Service (for example, USPS Publication 28), and standardizing such complex things as job functions and company names. But it also means training for your key-entry personnel. These folks are often at the bottom of the status heap, but they are handling one of your most important corporate assets. So give them the respect they deserve.

2. Harness customer-facing personnel to update the data.
Leverage the access of customer-facing personnel to refresh contact information. Train and motivate call center personnel, customer service, salespeople and distributors—anyone with direct customer contact—to request updated information at each meeting. When it comes to sales people, this is an entirely debatable matter. You want sales people selling, not entering data. But it’s worth at least a conversation to see if you can come up with a painless way to extract fresh contact updates as sales people interact with their accounts.

3. Use data-cleansing software, internally or from a service provider, and delete obsolete records.
Use the software tools that are available, which will de-duplicate, standardize and sometimes append missing fields. These won’t correct much—it’s mostly email and postal address standardization—but they will save you time, and they are much cheaper than other methods.

4. Allow customers access to their records online, so they can make changes.
Consider setting up a customer preference center, where customers can manage the data you have on them, and indicate how they want to hear from you. Offer a premium or incentive, or even a discount, to obtain higher levels of compliance.

5. Outbound phone or email to verify, especially to top customers.
Segment your file, and conduct outbound confirmation campaigns for the highest value accounts. This can be by mail, email or telephone, and done annually. When you have some results, decide whether to put your less valuable accounts through the same process.

Do you have any favorite hygiene techniques to add to my list?

A version of this article appeared in Biznology, the digital marketing blog.

4 Predictions for B-to-B Marketing in 2013

It’s that time of the year when observers can’t resist making predictions about developments on the horizon. I hereby take up that tradition, offering up four random prognostications for where B-to-B digital marketing is headed in 2013. My topics include Facebook, content marketing, personal branding and data hygiene—certainly an eclectic mix. I encourage readers to add their own.

It’s that time of the year when observers can’t resist making predictions about developments on the horizon. I hereby take up that tradition, offering up four random prognostications for where B-to-B digital marketing is headed in 2013. My topics include Facebook, content marketing, personal branding and data hygiene—certainly an eclectic mix. I encourage readers to add their own.

Facebook Is Ready, At Last, for the B-to-B Prime Time
It took a while, but Facebook (FB) marketing is now ready for mainstream B-to-B, in support of branding, lead generation and customer relationship marketing goals for enterprises of all sizes. There are several reasons for this—FB’s universality being one of them. But the critical driver is the recent arrival of the Facebook Exchange (FBX) ad platform, which will allow banner ad bidding and retargeting to specific individuals, based on data matching.

So, while I used to argue that Facebook should be at the bottom of a B-to-B marketer’s to-do list, I am revising my view for 2013. Talking to my pals at Edmund Optics, where I serve on the board of directors, I am hearing confirmation of these developments. Edmund’s target audience is optical engineers and others interested in science and technology. Years ago, I would have advised them to ignore FB and focus on more targeted social networks.

But now, EO has turned its Facebook page into an effective environment for engaging these guys, with weekly “Geeky Friday” offers, and the enormously popular Zombie Apocalypse Survival Guide at Halloween, where engineers were invited to design zombie-blasting tools using Edmund products. Facebook is now a top referring source for EO’s website, up 60 percent from last year. I stand corrected.

More and Better Content
B-to-B marketers were early to the content marketing game. In fact, I would argue that B-to-B has been a leading force in this area, in recognition of the importance of prospect education and thought leadership in the complex selling process. B-to-B marketers will continue to excel at creating valuable materials—digital, paper-based, video, you name it—to attract prospects and deepen relationships.

How do I know this? A new study from the Content Marketing Institute and MarketingProfs, which says that 54 percent of B-to-B marketers plan to increase their content marketing budgets in 2013. Their biggest content challenge for next year? Ironically, it’s producing enough content.

Personal Branding as a Way of Life
Business people and consumers alike are realizing that their online personas have a growing impact on both their everyday lives and their professional careers. Rather than letting their personal brands evolve organically, individuals will make more proactive efforts to build and manage their images online, benefiting from the guidance of an emerging community of personal brand experts like William Arruda and Kirsten Dixson. This means establishing unique brand positioning and developing a set of active and consistent messaging across Internet media, especially social networks, to explain who they are and what are their capabilities. Personal branding is no longer just for celebrities or the self-employed; with the rise of social media, it is for everyone.

Renewed Interest in Data Hygiene
Whenever I give a seminar on B-to-B marketing, I ask attendees to take out their business cards and look at them carefully. Then, I say, “Raise your hand if anything on the card is new in the last 12 months.” Invariably, 30 percent of the hands go up.

The high rate of change in B-to-B—whether moving to a different a company, a new title, even a new mail stop—is obvious. But only recently has it begun to sink in that addressing people incorrectly, or campaigning with undeliverable mail or email addresses, not only wastes marketing dollars, but also means lost business opportunity. So enough about big data. The focus in 2013 will be clean data.

And if you want some tips on how to keep your B-to-B data clean, have a look at my white paper: “Our Data is a Mess! How to Clean Up Your Marketing Database.”

So, those are my predictions. I hope readers will add some of their own. What do you think we’ll be seeing in B-to-B digital marketing in 2013?

A version of this post appeared in Biznology, the digital marketing blog.