Not All Databases Are Created Equal

Not all databases are created equal. No kidding. That is like saying that not all cars are the same, or not all buildings are the same. But somehow, “judging” databases isn’t so easy. First off, there is no tangible “tire” that you can kick when evaluating databases or data sources. Actually, kicking the tire is quite useless, even when you are inspecting an automobile. Can you really gauge the car’s handling, balance, fuel efficiency, comfort, speed, capacity or reliability based on how it feels when you kick “one” of the tires? I can guarantee that your toes will hurt if you kick it hard enough, and even then you won’t be able to tell the tire pressure within 20 psi. If you really want to evaluate an automobile, you will have to sign some papers and take it out for a spin (well, more than one spin, but you know what I mean). Then, how do we take a database out for a spin? That’s when the tool sets come into play.

Not all databases are created equal. No kidding. That is like saying that not all cars are the same, or not all buildings are the same. But somehow, “judging” databases isn’t so easy. First off, there is no tangible “tire” that you can kick when evaluating databases or data sources. Actually, kicking the tire is quite useless, even when you are inspecting an automobile. Can you really gauge the car’s handling, balance, fuel efficiency, comfort, speed, capacity or reliability based on how it feels when you kick “one” of the tires? I can guarantee that your toes will hurt if you kick it hard enough, and even then you won’t be able to tell the tire pressure within 20 psi. If you really want to evaluate an automobile, you will have to sign some papers and take it out for a spin (well, more than one spin, but you know what I mean). Then, how do we take a database out for a spin? That’s when the tool sets come into play.

However, even when the database in question is attached to analytical, visualization, CRM or drill-down tools, it is not so easy to evaluate it completely, as such practice reveals only a few aspects of a database, hardly all of them. That is because such tools are like window treatments of a building, through which you may look into the database. Imagine a building inspector inspecting a building without ever entering it. Would you respect the opinion of the inspector who just parks his car outside the building, looks into the building through one or two windows, and says, “Hey, we’re good to go”? No way, no sir. No one should judge a book by its cover.

In the age of the Big Data (you should know by now that I am not too fond of that word), everything digitized is considered data. And data reside in databases. And databases are supposed be designed to serve specific purposes, just like buildings and cars are. Although many modern databases are just mindless piles of accumulated data, granted that the database design is decent and functional, we can still imagine many different types of databases depending on the purposes and their contents.

Now, most of the Big Data discussions these days are about the platform, environment, or tool sets. I’m sure you heard or read enough about those, so let me boldly skip all that and their related techie words, such as Hadoop, MongoDB, Pig, Python, MapReduce, Java, SQL, PHP, C++, SAS or anything related to that elusive “cloud.” Instead, allow me to show you the way to evaluate databases—or data sources—from a business point of view.

For businesspeople and decision-makers, it is not about NoSQL vs. RDB; it is just about the usefulness of the data. And the usefulness comes from the overall content and database management practices, not just platforms, tool sets and buzzwords. Yes, tool sets are important, but concert-goers do not care much about the types and brands of musical instruments that are being used; they just care if the music is entertaining or not. Would you be impressed with a mediocre guitarist just because he uses the same brand of guitar that his guitar hero uses? Nope. Likewise, the usefulness of a database is not about the tool sets.

In my past column, titled “Big Data Must Get Smaller,” I explained that there are three major types of data, with which marketers can holistically describe their target audience: (1) Descriptive Data, (2) Transaction/Behavioral Data, and (3) Attitudinal Data. In short, if you have access to all three dimensions of the data spectrum, you will have a more complete portrait of customers and prospects. Because I already went through that subject in-depth, let me just say that such types of data are not the basis of database evaluation here, though the contents should be on top of the checklist to meet business objectives.

In addition, throughout this series, I have been repeatedly emphasizing that the database and analytics management philosophy must originate from business goals. Basically, the business objective must dictate the course for analytics, and databases must be designed and optimized to support such analytical activities. Decision-makers—and all involved parties, for that matter—suffer a great deal when that hierarchy is reversed. And unfortunately, that is the case in many organizations today. Therefore, let me emphasize that the evaluation criteria that I am about to introduce here are all about usefulness for decision-making processes and supporting analytical activities, including predictive analytics.

Let’s start digging into key evaluation criteria for databases. This list would be quite useful when examining internal and external data sources. Even databases managed by professional compilers can be examined through these criteria. The checklist could also be applicable to investors who are about to acquire a company with data assets (as in, “Kick the tire before you buy it.”).

1. Depth
Let’s start with the most obvious one. What kind of information is stored and maintained in the database? What are the dominant data variables in the database, and what is so unique about them? Variety of information matters for sure, and uniqueness is often related to specific business purposes for which databases are designed and created, along the lines of business data, international data, specific types of behavioral data like mobile data, categorical purchase data, lifestyle data, survey data, movement data, etc. Then again, mindless compilation of random data may not be useful for any business, regardless of the size.

Generally, data dictionaries (lack of it is a sure sign of trouble) reveal the depth of the database, but we need to dig deeper, as transaction and behavioral data are much more potent predictors and harder to manage in comparison to demographic and firmographic data, which are very much commoditized already. Likewise, Lifestyle variables that are derived from surveys that may have been conducted a long time ago are far less valuable than actual purchase history data, as what people say they do and what they actually do are two completely different things. (For more details on the types of data, refer to the second half of “Big Data Must Get Smaller.”)

Innovative ideas should not be overlooked, as data packaging is often very important in the age of information overflow. If someone or some company transformed many data points into user-friendly formats using modeling or other statistical techniques (imagine pre-developed categorical models targeting a variety of human behaviors, or pre-packaged segmentation or clustering tools), such effort deserves extra points, for sure. As I emphasized numerous times in this series, data must be refined to provide answers to decision-makers. That is why the sheer size of the database isn’t so impressive, and the depth of the database is not just about the length of the variable list and the number of bytes that go along with it. So, data collectors, impress us—because we’ve seen a lot.

2. Width
No matter how deep the information goes, if the coverage is not wide enough, the database becomes useless. Imagine well-organized, buyer-level POS (Point of Service) data coming from actual stores in “real-time” (though I am sick of this word, as it is also overused). The data go down to SKU-level details and payment methods. Now imagine that the data in question are collected in only two stores—one in Michigan, and the other in Delaware. This, by the way, is not a completely made -p story, and I faced similar cases in the past. Needless to say, we had to make many assumptions that we didn’t want to make in order to make the data useful, somehow. And I must say that it was far from ideal.

Even in the age when data are collected everywhere by every device, no dataset is ever complete (refer to “Missing Data Can Be Meaningful“). The limitations are everywhere. It could be about brand, business footprint, consumer privacy, data ownership, collection methods, technical limitations, distribution of collection devices, and the list goes on. Yes, Apple Pay is making a big splash in the news these days. But would you believe that the data collected only through Apple iPhone can really show the overall consumer trend in the country? Maybe in the future, but not yet. If you can pick only one credit card type to analyze, such as American Express for example, would you think that the result of the study is free from any bias? No siree. We can easily assume that such analysis would skew toward the more affluent population. I am not saying that such analyses are useless. And in fact, they can be quite useful if we understand the limitations of data collection and the nature of the bias. But the point is that the coverage matters.

Further, even within multisource databases in the market, the coverage should be examined variable by variable, simply because some data points are really difficult to obtain even by professional data compilers. For example, any information that crosses between the business and the consumer world is sparsely populated in many cases, and the “occupation” variable remains mostly blank or unknown on the consumer side. Similarly, any data related to young children is difficult or even forbidden to collect, so a seemingly simple variable, such as “number of children,” is left unknown for many households. Automobile data used to be abundant on a household level in the past, but a series of laws made sure that the access to such data is forbidden for many users. Again, don’t be impressed with the existence of some variables in the data menu, but look into it to see “how much” is available.

3. Accuracy
In any scientific analysis, a “false positive” is a dangerous enemy. In fact, they are worse than not having the information at all. Many folks just assume that any data coming out a computer is accurate (as in, “Hey, the computer says so!”). But data are not completely free from human errors.

Sheer accuracy of information is hard to measure, especially when the data sources are unique and rare. And the errors can happen in any stage, from data collection to imputation. If there are other known sources, comparing data from multiple sources is one way to ensure accuracy. Watching out for fluctuations in distributions of important variables from update to update is another good practice.

Nonetheless, the overall quality of the data is not just up to the person or department who manages the database. Yes, in this business, the last person who touches the data is responsible for all the mistakes that were made to it up to that point. However, when the garbage goes in, the garbage comes out. So, when there are errors, everyone who touched the database at any point must share in the burden of guilt.

Recently, I was part of a project that involved data collected from retail stores. We ran all kinds of reports and tallies to check the data, and edited many data values out when we encountered obvious errors. The funniest one that I saw was the first name “Asian” and the last name “Tourist.” As an openly Asian-American person, I was semi-glad that they didn’t put in “Oriental Tourist” (though I still can’t figure out who decided that word is for objects, but not people). We also found names like “No info” or “Not given.” Heck, I saw in the news that this refugee from Afghanistan (he was a translator for the U.S. troops) obtained a new first name as he was granted an entry visa, “Fnu.” That would be short for “First Name Unknown” as the first name in his new passport. Welcome to America, Fnu. Compared to that, “Andolini” becoming “Corleone” on Ellis Island is almost cute.

Data entry errors are everywhere. When I used to deal with data files from banks, I found that many last names were “Ira.” Well, it turned out that it wasn’t really the customers’ last names, but they all happened to have opened “IRA” accounts. Similarly, movie phone numbers like 777-555-1234 are very common. And fictitious names, such as “Mickey Mouse,” or profanities that are not fit to print are abundant, as well. At least fake email addresses can be tested and eliminated easily, and erroneous addresses can be corrected by time-tested routines, too. So, yes, maintaining a clean database is not so easy when people freely enter whatever they feel like. But it is not an impossible task, either.

We can also train employees regarding data entry principles, to a certain degree. (As in, “Do not enter your own email address,” “Do not use bad words,” etc.). But what about user-generated data? Search and kill is the only way to do it, and the job would never end. And the meta-table for fictitious names would grow longer and longer. Maybe we should just add “Thor” and “Sponge Bob” to that Mickey Mouse list, while we’re at it. Yet, dealing with this type of “text” data is the easy part. If the database manager in charge is not lazy, and if there is a bit of a budget allowed for data hygiene routines, one can avoid sending emails to “Dear Asian Tourist.”

Numeric errors are much harder to catch, as numbers do not look wrong to human eyes. That is when comparison to other known sources becomes important. If such examination is not possible on a granular level, then median value and distribution curves should be checked against historical transaction data or known public data sources, such as U.S. Census Data in the case of demographic information.

When it’s about the companies’ own data, follow your instincts and get rid of data that look too good or too bad to be true. We all can afford to lose a few records in our databases, and there is nothing wrong with deleting the “outliers” with extreme values. Erroneous names, like “No Information,” may be attached to a seven-figure lifetime spending sum, and you know that can’t be right.

The main takeaways are: (1) Never trust the data just because someone bothered to store them in computers, and (2) Constantly look for bad data in reports and listings, at times using old-fashioned eye-balling methods. Computers do not know what is “bad,” until we specifically tell them what bad data are. So, don’t give up, and keep at it. And if it’s about someone else’s data, insist on data tallies and data hygiene stats.

4. Recency
Outdated data are really bad for prediction or analysis, and that is a different kind of badness. Many call it a “Data Atrophy” issue, as no matter how fresh and accurate a data point may be today, it will surely deteriorate over time. Yes, data have a finite shelf-life, too. Let’s say that you obtained a piece of information called “Golf Interest” on an individual level. That information could be coming from a survey conducted a long time ago, or some golf equipment purchase data from a while ago. In any case, someone who is attached to that flag may have stopped shopping for new golf equipment, as he doesn’t play much anymore. Without a proper database update and a constant feed of fresh data, irrelevant data will continue to drive our decisions.

The crazy thing is that, the harder it is to obtain certain types of data—such as transaction or behavioral data—the faster they will deteriorate. By nature, transaction or behavioral data are time-sensitive. That is why it is important to install time parameters in databases for behavioral data. If someone purchased a new golf driver, when did he do that? Surely, having bought a golf driver in 2009 (“Hey, time for a new driver!”) is different from having purchased it last May.

So-called “Hot Line Names” literally cease to be hot after two to three months, or in some cases much sooner. The evaporation period maybe different for different product types, as one may stay longer in the market for an automobile than for a new printer. Part of the job of a data scientist is to defer the expiration date of data, finding leads or prospects who are still “warm,” or even “lukewarm,” with available valid data. But no matter how much statistical work goes into making the data “look” fresh, eventually the models will cease to be effective.

For decision-makers who do not make real-time decisions, a real-time database update could be an expensive solution. But the databases must be updated constantly (I mean daily, weekly, monthly or even quarterly). Otherwise, someone will eventually end up making a wrong decision based on outdated data.

5. Consistency
No matter how much effort goes into keeping the database fresh, not all data variables will be updated or filled in consistently. And that is the reality. The interesting thing is that, especially when using them for advanced analytics, we can still provide decent predictions if the data are consistent. It may sound crazy, but even not-so-accurate-data can be used in predictive analytics, if they are “consistently” wrong. Modeling is developing an algorithm that differentiates targets and non-targets, and if the descriptive variables are “consistently” off (or outdated, like census data from five years ago) on both sides, the model can still perform.

Conversely, if there is a huge influx of a new type of data, or any drastic change in data collection or in a business model that supports such data collection, all bets are off. We may end up predicting such changes in business models or in methodologies, not the differences in consumer behavior. And that is one of the worst kinds of errors in the predictive business.

Last month, I talked about dealing with missing data (refer to “Missing Data Can Be Meaningful“), and I mentioned that data can be inferred via various statistical techniques. And such data imputation is OK, as long as it returns consistent values. I have seen so many so-called professionals messing up popular models, like “Household Income,” from update to update. If the inferred values jump dramatically due to changes in the source data, there is no amount of effort that can save the targeting models that employed such variables, short of re-developing them.

That is why a time-series comparison of important variables in databases is so important. Any changes of more than 5 percent in distribution of variables when compared to the previous update should be investigated immediately. If you are dealing with external data vendors, insist on having a distribution report of key variables for every update. Consistency of data is more important in predictive analytics than sheer accuracy of data.

6. Connectivity
As I mentioned earlier, there are many types of data. And the predictive power of data multiplies as different types of data get to be used together. For instance, demographic data, which is quite commoditized, still plays an important role in predictive modeling, even when dominant predictors are behavioral data. It is partly because no one dataset is complete, and because different types of data play different roles in algorithms.

The trouble is that many modern datasets do not share any common matching keys. On the demographic side, we can easily imagine using PII (Personally Identifiable Information), such as name, address, phone number or email address for matching. Now, if we want to add some transaction data to the mix, we would need some match “key” (or a magic decoder ring) by which we can link it to the base records. Unfortunately, many modern databases completely lack PII, right from the data collection stage. The result is that such a data source would remain in a silo. It is not like all is lost in such a situation, as they can still be used for trend analysis. But to employ multisource data for one-to-one targeting, we really need to establish the connection among various data worlds.

Even if the connection cannot be made to household, individual or email levels, I would not give up entirely, as we can still target based on IP addresses, which may lead us to some geographic denominations, such as ZIP codes. I’d take ZIP-level targeting anytime over no targeting at all, even though there are many analytical and summarization steps required for that (more on that subject in future articles).

Not having PII or any hard matchkey is not a complete deal-breaker, but the maneuvering space for analysts and marketers decreases significantly without it. That is why the existence of PII, or even ZIP codes, is the first thing that I check when looking into a new data source. I would like to free them from isolation.

7. Delivery Mechanisms
Users judge databases based on visualization or reporting tool sets that are attached to the database. As I mentioned earlier, that is like judging the entire building based just on the window treatments. But for many users, that is the reality. After all, how would a casual user without programming or statistical background would even “see” the data? Through tool sets, of course.

But that is the only one end of it. There are so many types of platforms and devices, and the data must flow through them all. The important point is that data is useless if it is not in the hands of decision-makers through the device of their choice, at the right time. Such flow can be actualized via API feed, FTP, or good, old-fashioned batch installments, and no database should stay too far away from the decision-makers. In my earlier column, I emphasized that data players must be good at (1) Collection, (2) Refinement, and (3) Delivery (refer to “Big Data is Like Mining Gold for a Watch—Gold Can’t Tell Time“). Delivering the answers to inquirers properly closes one iteration of information flow. And they must continue to flow to the users.

8. User-Friendliness
Even when state-of-the-art (I apologize for using this cliché) visualization, reporting or drill-down tool sets are attached to the database, if the data variables are too complicated or not intuitive, users will get frustrated and eventually move away from it. If that happens after pouring a sick amount of money into any data initiative, that would be a shame. But it happens all the time. In fact, I am not going to name names here, but I saw some ridiculously hard to understand data dictionary from a major data broker in the U.S.; it looked like the data layout was designed for robots by the robots. Please. Data scientists must try to humanize the data.

This whole Big Data movement has a momentum now, and in the interest of not killing it, data players must make every aspect of this data business easy for the users, not harder. Simpler data fields, intuitive variable names, meaningful value sets, pre-packaged variables in forms of answers, and completeness of a data dictionary are not too much to ask after the hard work of developing and maintaining the database.

This is why I insist that data scientists and professionals must be businesspeople first. The developers should never forget that end-users are not trained data experts. And guess what? Even professional analysts would appreciate intuitive variable sets and complete data dictionaries. So, pretty please, with sugar on top, make things easy and simple.

9. Cost
I saved this important item for last for a good reason. Yes, the dollar sign is a very important factor in all business decisions, but it should not be the sole deciding factor when it comes to databases. That means CFOs should not dictate the decisions regarding data or databases without considering the input from CMOs, CTOs, CIOs or CDOs who should be, in turn, concerned about all the other criteria listed in this article.

Playing with the data costs money. And, at times, a lot of money. When you add up all the costs for hardware, software, platforms, tool sets, maintenance and, most importantly, the man-hours for database development and maintenance, the sum becomes very large very fast, even in the age of the open-source environment and cloud computing. That is why many companies outsource the database work to share the financial burden of having to create infrastructures. But even in that case, the quality of the database should be evaluated based on all criteria, not just the price tag. In other words, don’t just pick the lowest bidder and hope to God that it will be alright.

When you purchase external data, you can also apply these evaluation criteria. A test-match job with a data vendor will reveal lots of details that are listed here; and metrics, such as match rate and variable fill-rate, along with complete the data dictionary should be carefully examined. In short, what good is lower unit price per 1,000 records, if the match rate is horrendous and even matched data are filled with missing or sub-par inferred values? Also consider that, once you commit to an external vendor and start building models and analytical framework around their its, it becomes very difficult to switch vendors later on.

When shopping for external data, consider the following when it comes to pricing options:

  • Number of variables to be acquired: Don’t just go for the full option. Pick the ones that you need (involve analysts), unless you get a fantastic deal for an all-inclusive option. Generally, most vendors provide multiple-packaging options.
  • Number of records: Processed vs. Matched. Some vendors charge based on “processed” records, not just matched records. Depending on the match rate, it can make a big difference in total cost.
  • Installment/update frequency: Real-time, weekly, monthly, quarterly, etc. Think carefully about how often you would need to refresh “demographic” data, which doesn’t change as rapidly as transaction data, and how big the incremental universe would be for each update. Obviously, a real-time API feed can be costly.
  • Delivery method: API vs. Batch Delivery, for example. Price, as well as the data menu, change quite a bit based on the delivery options.
  • Availability of a full-licensing option: When the internal database becomes really big, full installment becomes a good option. But you would need internal capability for a match and append process that involves “soft-match,” using similar names and addresses (imagine good-old name and address merge routines). It becomes a bit of commitment as the match and append becomes a part of the internal database update process.

Business First
Evaluating a database is a project in itself, and these nine evaluation criteria will be a good guideline. Depending on the businesses, of course, more conditions could be added to the list. And that is the final point that I did not even include in the list: That the database (or all data, for that matter) should be useful to meet the business goals.

I have been saying that “Big Data Must Get Smaller,” and this whole Big Data movement should be about (1) Cutting down on the noise, and (2) Providing answers to decision-makers. If the data sources in question do not serve the business goals, cut them out of the plan, or cut loose the vendor if they are from external sources. It would be an easy decision if you “know” that the database in question is filled with dirty, sporadic and outdated data that cost lots of money to maintain.

But if that database is needed for your business to grow, clean it, update it, expand it and restructure it to harness better answers from it. Just like the way you’d maintain your cherished automobile to get more mileage out of it. Not all databases are created equal for sure, and some are definitely more equal than others. You just have to open your eyes to see the differences.

‘Big Data’ Is Like Mining Gold for a Watch – Gold Can’t Tell Time

It is often quoted that 2.5 quintillion bytes of data are collected each day. That surely sounds like a big number, considering 1 quintillion bytes (or exabytes, if that sounds fancier) are equal to 1 billion gigabytes. … My phone can hold about 65 gigabytes; which, by the way, means nothing to me. I just know that figure equates to about 6,000 songs, plus all my personal information, with room to spare for hundreds of photos and videos. 

It is often quoted that 2.5 quintillion bytes of data are collected each day. That surely sounds like a big number, considering 1 quintillion bytes (or exabytes, if that sounds fancier) are equal to 1 billion gigabytes. Looking back only about 20 years, I remember my beloved 386-based desktop computer had a hard drive that can barely hold 300 megabytes, which was considered to be quite large in those ancient days. Now, my phone can hold about 65 gigabytes; which, by the way, means nothing to me. I just know that figure equates to about 6,000 songs, plus all my personal information, with room to spare for hundreds of photos and videos. So how do I fathom the size of 2.5 quintillion bytes? I don’t. I give up. I’d rather count the number stars in the universe. And I have been in the database business for more than 25 years.

But I don’t feel bad about that. If a pile of data requires a computer to process it, then it is already too “big” for our brains. In the age of “Big Data,” size matters, but emphasizing the size element is missing the point. People want to understand the data in their own terms and want to use them in decision-making processes. Throwing the raw data around to people without math or computing skills is like galleries handing out paint and brushes to people who want paintings on the wall. Worse yet, continuing to point out how “big” the Big Data world is to them is like quoting the number of rice grains on this planet in front of a hungry man, when he doesn’t even care how many grains are in one bowl. He just wants to eat a bowl of “cooked” rice, and right this moment.

To be a successful data player, one must be the master of the following three steps:

  • Collection;
  • Refinement; and
  • Delivery.

Collection and storage are obviously important in the age of Big Data. However, that in itself shouldn’t be the goal. I hear lots of bragging about how much data can be collected and stored, and how fast the data can be retrieved.

Great, you can retrieve any transaction detail going back 20 years in less than 0.5 seconds. Congratulations. But can you now tell me whom are more likely to be loyal customers for the next five years, with annual spending potential of more than $250? Or who is more likely to quit using the service in next 60 days? Who is more likely to be on a cruise ship leaving the dock on the East Coast heading for Europe between Thanksgiving and Christmas, with onboard spending potential greater than $300? Who is more likely to respond to emails with free shipping offers? Where should I open my next store selling fancy children’s products? What do my customers look like, and where do they go between 6 and 9 p.m.?

Answers to these types of questions do not come from the raw data, but they should be derived from the data through the data refinement process. And that is the hard part. Asking the right questions, expressing the goals in a mathematical format, throwing out data that don’t fit the question, merging data from a diverse array of sources, summarizing the data into meaningful levels, filling in the blanks (there will be plenty—even these days), and running statistical models to come up with scores that look like an answer to the question are all parts of the data refinement process. It is a lot like manufacturing gold watches, where mining gold is just an important first step. But a piece of gold won’t tell you what time it is.

The final step is to deliver that answer—which, by now, should be in a user-friendly format—to the user at the right time in the right format. Often, lots of data-related products only emphasize this part, as it is the most intimate one to the users. After all, it provides an illusion that the user is in total control, being able to touch the data so nicely displayed on the screen. Such tool sets may produce impressive-looking reports and dazzling graphics. But, lest we forget, they are only representations of the data refinement processes. In addition, no tool set will ever do the thinking part for anyone. I’ve seen so many missed opportunities where decision-makers invested obscene amounts of money in fancy tool sets, believing they will conduct all the logical and data refinement work for them, automatically. That is like believing that purchasing the top of the line Fender Stratocaster will guarantee that you will play like Eric Clapton in the near future. Yes, the tool sets are important as delivery mechanisms of refined data, but none of them replace the refinement part. Doing so would be like skipping guitar practice after spending $3,000 on a guitar.

Big Data business should be about providing answers to questions. It should be about humans who are the subjects of data collection and, in turn, the ultimate beneficiaries of information. It’s not about IT or tool sets that come and go like hit songs. But it should be about inserting advanced use of data into everyday decision-making processes by all kinds of people, not just the ones with statistics degrees. The goal of data players must be to make it simple—not bigger and more complex.

I boldly predict that missing these points will make “Big Data” a dirty word in the next three years. Emphasizing the size element alone will lead to unbalanced investments, which will then lead to disappointing results with not much to show for them in this cruel age of ROI. That is a sure way to kill the buzz. Not that I am that fond of the expression “Big Data”; though, I admit, one benefit has been that I don’t have to explain what I do for living for 10 minutes any more. Nonetheless, all the Big Data folks may need an exit plan if we are indeed heading for the days when it will be yet another disappointing buzzword. So let’s do this one right, and start thinking about refining the data first and foremost.

Collection and storage are just so last year.

Privacy in the Age of Big Data

Consumers reveal more than ever before consciously through social media and, just as importantly, unconsciously through their behaviors. This data gives marketers great power, which they can use to design better products, hone messages and, most importantly, sell more by providing consumers what they want. That’s all good from a marketer’s perspective, but for consumers, the scope of data collection can often cross a line, becoming too intrusive or too loosely held. Marketers have to balance the opportunities of Big Data with the concerns of consumers or they risk a serious backlash.

Consumers reveal more than ever before consciously through social media and, just as importantly, unconsciously through their behaviors. This data gives marketers great power, which they can use to design better products, hone messages and, most importantly, sell more by providing consumers what they want. That’s all good from a marketer’s perspective, but for consumers, the scope of data collection can often cross a line, becoming too intrusive or too loosely held. Marketers have to balance the opportunities of Big Data with the concerns of consumers or they risk a serious backlash.

For some people, the line has already been crossed. When Edward Snowden revealed information about the government’s data collection policies, he presented it as a scandal. But, in many ways, what the NSA does differs mainly in scope from what many private marketers do. A German politician recently went to court to force T-Mobile to release the full amount of metadata that it collects from his cellphone behavior. The results highlighted just how much a company can know from this data—not just about an individual’s behavior and interests, but also about his or her friends, and whom among them are most influential. Even for marketers who strongly believe in the social utility that this enables, it highlights just how core an issue privacy has become.

So how can marketers get the most from data without alarming consumers? Transparency and value. For some consumers, there’s really no good use of personal data, so opt-outs have to be clear and easy to use. The best way to collect and use data is if the value to the consumer is so clear that he or she will opt in to a program.

One company that has framed its data collection as a service that’s worth joining is Waze, which Google recently bought for over a billion dollars. Google beat out rivals Facebook and Apple because high-quality maps are one of the most important infrastructure tools for the big mobile players. Well, before Google bought the company, CEO David Bardin said, “Waze relies on the wisdom of crowds: We haven’t spent billions of dollars a year, we’ve cooperated with millions of users. Google is the No. 1 player. But, a few years out, there’s no excuse that we wouldn’t pass them.”

Drivers sign up for the app to find out about traffic conditions ahead; inherently useful information. Once they’re logged in, they automatically send information about their speed and location to Waze. Waze invites active participation, too, encouraging users to fix map errors and report accidents, weather disruptions, police and gas stations. Users get points for using the service and more points for actively reporting issues. With 50 million users, this decentralized data entry system is incredibly efficient at producing real-time road conditions and maps. Even Google’s own map service can’t match the refresh speed of Waze maps.

Points for check-ins don’t fully explain why people are so invested in Waze. Dynamic graphics help, with charming icons and those de rigueur 3D zooming maps. Even more important than a great interface, however, is that Waze serves a real-time need while making users feel part of a community working together to solve problems. Waze is part of a broader movement to crowdsource solutions that rely on consumers or investors who believe in the mission of a company, not just its utility. People contribute to Waze because they want to help fellow travelers, as well as speed their own journeys. It’s shared self-interest.

Waze has a relatively easy task of proving the worth of a data exchange. Other companies need to work harder to show that they use data to enhance user experiences—but the extra effort is not optional.

As marketers become better at using data, they will need to prove the value of the data they use, and they need to be transparent on how they’re using it. If they don’t, marketers will have their own Snowdens to worry about.

Don’t Get Trashed — Is Recycling Discarded Mail Profitable? — Part II

In our previous post of “Marketing Sustainably,” we introduced an expert discussion on whether or not recycling collection of discarded mail, catalogs, printed communications and paper packaging is profitable, and why this matters is an important business consideration for the direct marketing field. In this post, we continue and conclude the discussion with our two experts, Monica Garvey, director of sustainability, Verso Paper, and Meta Brophy, director of procurement operations, Consumer Reports.

In our previous post of “Marketing Sustainably,we introduced an expert discussion on whether or not recycling collection of discarded mail, catalogs, printed communications and paper packaging is profitable, and why this matters is an important business consideration for the direct marketing field.

In this post, we continue and conclude the discussion with our two experts, Monica Garvey, director of sustainability, Verso Paper, and Meta Brophy, director of procurement operations, Consumer Reports. The conversation is based on a Town Square presentation that took place at the Direct Marketing Association’s recent DMA2012 annual conference.

Chet Dalzell: If much of the recovered fiber goes overseas, what’s the benefit to my company or organization in supporting recycling in North America?

Monica Garvey: The benefit—companies can promote that they support the use of recycled paper because they believe that recovered fiber is a valuable resource that can supplement virgin fiber. Recycling extends the life of a valuable natural resource, and contributes to a company’s socially responsible positioning. While it’s true that the less fiber supply there is locally, the higher the cost for the products made from that recovered fiber domestically, it’s still important to encourage recycling collection. Because recovered fiber is a global commodity, it is subject to demand-and-supply price fluctuations. If demand should drop overseas, and prices moderate, there may be greater supply at more moderate prices here at home, helping North American manufacturers; however, this is very unlikely. RISI, the leading information provider for the global forest products industry, projects that over the next five years, world recovered paper demand will continue to grow aggressively from fiber-poor regions such as China and India. Demand will run up against limited supply of recovered paper in the U.S. and other parts of the developed world and create a growing shortage of recovered paper worldwide.

CD: Is there a way to guarantee that recovered fiber stays at home (in the United States, for example)?

Meta Brophy: Yes! Special partnerships and programs exist that collect paper at local facilities and use the fiber domestically, allocating the recovered paper for specific use. ReMag, for example, is a private firm that places kiosks at local collection points—retailers, supermarket chains—where consumers can drop their catalogs, magazines and other papers and receive discounts, coupons and retailer promotions in exchange. These collections ensure a quality supply of recovered fiber for specific manufacturing uses. It’s a win-win for all stakeholders involved.

I recommend mailers use the DMA “Recycle Please” logo and participate in programs such as ReMag to encourage more consumers to recycle, and to increase the convenience and ease of recycling.

CD: What’s the harm of landfilling discarded paper—there’s plenty of landfill space out there, right?

MG: Landfill costs vary significantly around the country—depending on hauling distances, and the costs involved in operating landfills. In addition, there are also environmental costs. By diverting usable fiber from landfills, we not only extend the useful life of a valuable raw material, but also reduce greenhouse gas emissions (methane) that result when landfilled paper products degrade over time. There are also greenhouse gases that are released from hauling post-consumer waste. While carbon emissions may not yet be assessed, taxed or regulated in the United States, many national and global brands already participate in strategies to calculate and reduce their carbon emissions, and their corporate owners may participate in carbon trading regimes.

CD: You’ve brought up regulation, Monica. I’ve heard of “Extended Producer Responsibility” (EPR) legislation. Does EPR extend to direct marketers in any way?

MG: EPR refers to policy intended to shift responsibility for the end-of-life of products and/or packaging from the municipality to the manufacturer/brand owner. It can be expressed at a state level via specific product legislation, framework legislation, governor’s directive, or a solid waste management plan. EPR has begun to appear in proposals at the state level in the United States. EPR, for better or worse, recognizes that there are costs associated with waste management on all levels—not just landfilling, but waste-to-energy, recycling collection and even reuse.

These waste management costs currently are paid for in our taxes, but governments are looking to divert such costs so that they are paid for by those who actually make and use scrutinized products. Thus EPR can result in increased costs, were states to enact such regulation on particular products such as paper, packaging and electronic and computer equipment. Greatest pressure to enact EPR most likely focuses on products where end-of-life disposition involves hazardous materials where recycling and return programs may make only a negligible difference. Many will state that the natural fibers in paper along with the extremely high recovery rate of 67 percent makes paper a poor choice for inclusion in any state EPR legislation. That is also why the more we support the efficiency and effectiveness of existing recycling collection programs, the less pressure there may be to enact EPR regulations directly. It will likely vary state to state where specific concerns and challenges may exist.

CD: Does the public really care if this material gets recycled? Do they participate in recycling programs?

MB: Yes, they do. Even a public that’s skeptical of “greenwashing”—environmental claims that are suspect, unsubstantiated or less than credible—participates in recycling collection in greater numbers. Both EPA and American Forest & Paper Association data tell us the amount of paper collected is now well more than half of total paper produced, and still growing—despite the recent recession and continued economic uncertainty. Recycling collection programs at the hometown level are politically popular, too—people like to take actions that they believe can make a difference. And as long as the costs of landfilling exceed the costs or possible revenue gain of recycling, it’s good for the taxpayer, too.

CD: At the end of the day, what’s in recycling for my brand, and the direct marketing business overall?

MB: I see at least three direct benefits—and nearly no downside. First, a brand’s image benefits when it embraces social responsibility as an objective. Second, being a responsible steward of natural resources, and promoting environmental performance in a way that avoids running afoul of the Federal Trade Commission’s new Green Guides environmental claims—positions a brand well in practice and public perception. And, third, and I see this firsthand in my own organization, both the employee base and the supply chain are more deeply engaged and motivated as a result, too. Certainly, in the direct marketing business overall, there are similar gains—and I’m excited that the DMA has embraced this goal for our marketing discipline.

Inside the Recycling Tub: Catalogs & Direct Mail, Post-Consumer

The year was 1990. Earth Day turned 20 years old. The darling book that year was 50 Simple Things You Can Do to Save the Earth. Its author’s top recommendation was “Stop Junk Mail.” The book was a “cultural phenomena,” as one reviewer recalled, selling more than 5 million copies in all.

The year was 1990. Earth Day turned 20 years old. The darling book that year was 50 Simple Things You Can Do to Save the Earth. Its author’s top recommendation was “Stop Junk Mail.” The book was a “cultural phenomena,” as one reviewer recalled, selling more than 5 million copies in all.

During the early 1990s, millions of consumers wrote their request to the then-Mail Preference Service (MPS, now DMAchoice) to remove themselves from national mailing lists, partially as a result of the media hype around that publication and its recommendation to consumers to sign up for MPS. Even some cities and towns urged their citizens (with taxpayer money) to get off mailing lists. I don’t think the Direct Marketing Association released publicly its MPS consumer registration figures, but it swelled to the point where some saturation mailers nearly considered not using the file for fear it would disqualify them for the lowest postage within certain ZIP Codes where new MPS registrants were concentrated. (DMA developed a saturation mailer format at the time to preserve MPS utility.)

Removing names from a mailing list is what solid waste management professionals call “source reduction”—an act that prevents the production of mail (and later waste) in the first place.

One of the reasons “junk mail” met with some consumer hostility then was simply because once you were done with a catalog or mail piece, wanted or not, there was no place to put it except in the trash. It seemed to many, “All this waste!” (that actually amounted to about 2.3 percent of the municipal solid waste stream back then).

Thankfully, there were other marketplace and public policy dynamics tied to support of the green movement, circa 1990. In a word, “recycling” (like source reduction) was seen as a part of responsible solid waste management. At the time, North American paper mills were scrambling to get recovered fiber to manufacture paper products and packaging with recycled content. Some states (and the federal government) set minimum recycled-content and “post-consumer” recycled-content percentage requirements for the paper they procured, while California mandated diversion goals for solid waste from its landfills. Increasingly, foreign trading partners were clamoring for America’s discarded paper to meet their ravenous demands for fiber. The cumulative results were an aggressive increase in the amount of paper collected for recycling and the number of collection points across the United States.

All this boded well for catalogs and direct mail, as far as their collection rates. Catalogs and magazines are considered equivalent when it comes to their fiber makeup. They do tend to have more hardwood (short, thinner fibers) versus softwood (long, strong fibers) since the hardwood gives a nice, smooth printing surface. When they are collected for recycling, recovered catalogs and magazines are suitable for lower quality paper/packaging grades, as well as for tissue. Some of the fiber does wind up getting used as post-consumer waste in new magazines and catalogs, but producers of such papers much prefer having recovered office paper (ideally not mixed with other lower-quality post-consumer papers) as their source of post-consumer content, as the quality is better for making higher quality magazine/catalog papers. (See link below from Verso Paper.)

Most direct mail when recovered is classified as mixed papers, and is suitable for tissue, packaging and other recovered-fiber products. (Today, a lot of paper recovery mixes it all together, and with positive reuse.) By 2007, DMA had received permission from the Federal Trade Commission to begin allowing mailers to place “recyclable” messages and seals on catalogs and mail pieces (roughly 60 percent of U.S. households must have access to local recycling options before “recyclable” labels can be used). Upon this FTC opinion, DMA promptly launched its “Recycle Please” logo program. By 2010, in addition, thousands of U.S. post offices were placing “Read-Respond-Recycle” collection bins for mixed paper in their lobbies.

When the U.S. Environmental Protection Agency began tracking “Standard Mail” in its biennial Municipal Solid Waste Characterization Report in 1990, the recovery rate (through recycling collection) was near 5 percent. By 2009 (the most current year reported), the recovery rate had increased more than 10-fold to 63 percent—but I cite this figure with a big asterisk. There will be a discussion in a future post on why the EPA MSW recycling data may not be as accurate (and as optimistic) as these findings seem to present. In fact, the EPA itself has asked for public comment on how its current MSW study methodology can be improved—again, more on that in another post.

While I’m not an expert on solid waste reporting, I certainly can see the positive direction underway here, no matter what the actual recovery rate may be. The more catalogs and direct mail that are recovered for their fiber, chances are that there will be more efficient use of that fiber in the supply chain, rather than ending up in a landfill. That helps relieve pressure on paper and packaging pricing, which is good for our bottom lines.

It might also, just a little bit, make a consumer think to herself “I love my junk mail”—as she takes the no-longer-needed mail at week’s or season’s end and places it into a recycling tub. Recycling makes us feel good. It is simple to do. Recycling may not truly save the Earth, but it certainly does extend the life of an importantly renewable natural resource, wood fiber.

Helpful links: