How Marketers Can Throw Away Data, Without Regrets

Yes, data is an asset. But not if the data doesn’t generate any value. (There is no sentimental value to data, unless we are talking about building a museum of old data.) So here’s how to throw away data.

Last month, I talked about data hoarders (refer to “Don’t Be a Data Hoarder”). This time, let me share some ideas about how to throw away data.

I heard about people who specialize in cleaning other people’s closets and storage spaces. Looking at the result — turning a hoarder’s house into a presentable living quarters — I am certain that they have their own set of rules and methodologies in deciding what to throw out, what goes together, and how to organize items that are to be kept.

I recently had a relatable experience, as I sold a house and moved to a smaller place, all in the name of age-appropriate downsizing. We lived in the old home for 22 years, raising two children. We thought that our kids took much of their stuff when they moved out, but as you may have guessed already, no, we still had so much to sort through. After all, we are talking about accumulation of possessions by four individuals for 22 long years. Enough to invoke a philosophical question “Why do humans gather so much stuff during their short lifespans?” Maybe we all carry a bit of hoarder genes after all. Or we’re just too lazy to sort things through on a regular basis.

My rule was rather simple: If I haven’t touched an item for more than three years (two years for apparel), give it away or throw it out. One exception was for the things with high sentimental value; which, unfortunately, could lead into hoarding behavior all over again (as in “Oh, I can’t possibly throw out this ‘Best Daddy in the World’ mug, though it looks totally hideous.”). So, when I was in doubt, I chucked it.

But after all of this, I may have to move to an even smaller place to be able to claim a minimalist lifestyle. Or should I just hire a cleanup specialist? One thing is for sure though; the cleanup job should be done in phases.

Useless junk — i.e., things that generate no monetary or sentimental value — is a liability. Yes, data is an asset. But not if the data doesn’t generate any value. (There is no sentimental value to data, unless we are talking about building a museum of old data.)

So, how do we really clean the house? I’ve seen some harsh methods like “If the data is more than three years old, just dump it.” Unless the business model has gone through some drastic changes rendering the past data completely useless, I strongly recommend against such a crude tactic. If trend analysis or a churn prediction model is in the plan, you will definitely regret throwing away data just because they are old. Then again, as I wrote last month, no one should keep every piece of data since the beginning of time, either.

Like any other data-related activities, the cleanup job starts with goal-setting, too. How will you know what to keep, if you don’t even know what you are about to do? If you “do” know what is on the horizon, then follow your own plan. If you don’t, the No. 1 step would be a companywide Need-Analysis, as different types of data are required for different tasks.

The Process of Ridding Yourself of Data

First, ask the users and analysts:

  • What is in the marketing plan?
  • What type of predictions would be required for such marketing goals? Be as specific as possible:
    • Forecasting and Time-Series Analysis — You will need to keep some “old” data for sure for these.
    • Product Affinity Models for Cross-sell/Upsell — You must keep who bought what for how much, when, through what channel type of data.
    • Attribution Analysis and Response Models — This type of analytics requires past promotion and response history data for at least a few calendar years.
    • Product Development and Planning — You would need SKU-level transaction data, but not from the beginning of time.
    • Etc.
  • What do you have? Do the full inventory and categorize them by data types, as you may have much more than you thought. Some examples are:
    • PII (Personally Identifiable Data): Name, Address, Email, Phone Number, Various ID’s, etc. These are valuable connectors to other data sources such as Geo/Demographic Data.
    • Order/Transaction Data: Transaction Date, Amount, Payment Methods
    • Item/SKU-Level Data: Products, Price, Units
    • Promotion/Response History: Source, Channel, Offer, Creative, Drop/Wave, etc.
    • Life-to-Date/Past ‘X’ Months Summary Data: Not as good as detailed, event-level data, but summary data may be enough for trend analysis or forecasting.
    • Customer Status Flags: Active, Dormant, Delinquent, Canceled
    • Surveys/Product Registration: Attitudinal and Lifestyle Data
    • Customer Communication History Data: Call-center and web interaction data
    • Online Behavior: Open, Click-through, Page views, etc.
    • Social Media: Sentiment/Intentions
    • Etc.
  • What kind of data did you buy? Surprisingly large amounts of data are acquired from third-party data sources, and kept around indefinitely.
  • Where are they? On what platform, and how are they stored?
  • Who is assessing them? Through what channels and platform? Via what methods or software? Search for them, as you may uncover data users in unexpected places. You do not want to throw things out without asking them.
  • Who is updating them? Data that are not regularly updated are most likely to be junk.

Taking Stock

Now, I’m not suggesting actually “deleting” data on a source level in the age of cheap storage. All I am saying is that not all data points are equally important, and some data can be easily tucked away. In short, if data don’t fit your goals, don’t bring them out to the front.

Essentially, this is the first step of the data refinement process. The emergence of the Data Lake concept is rooted here. Big Data was too big, so users wanted to put more useful data in more easily accessible places. Now, the trouble with the Data Lake is that the lake water is still not drinkable, requiring further refinement. However, like I admitted that I may have to move again to clean my stuff out further, the cleaning process should be done in phases, and the Data Lake may as well be the first station.

In contrast, the Analytics Sandbox that I often discussed in this series would be more of a data haven for analysts, where every variable is cleaned, standardized, categorized, consolidated, and summarized for advanced analytics and targeting (refer to “Chicken or the Egg? Data or Analytics?” and “It’s All about Ranking”). Basically, it’s data on silver platters for professional analysts— humans or machines.

At the end of such data refinement processes, the end-users will see data in the form of “answers to questions.” As in, scores that describe targets in a concise manner, like “Likelihood of being an early adopter,” or “Likelihood of being a bargain-seeker.” To get to that stage, useful data must flow through the pipeline constantly and smoothly. But not all data are required to do that (refer to “Data Must Flow, But Not All of Them”).

For the folks who just want to cut to the chase, allow me to share a cheat sheet.

Disclaimer: You should really plan to do some serious need analysis to select and purge data from your value chain. Nonetheless, you may be able to kick-start a majority of customer-related analytics, if you start with this basic list.

Because different business models call for a different data menu, I divided the list by major industry types. If your industry is not listed here, use your imagination along with a need-analysis.

Cheat Sheet

Merchandizing: Most retailers would fall into this category. Basically, you would provide products and services upon payment.

  • Who: Customer ID / PII
  • What: Product SKU / Category
  • When: Purchase Date
  • How Much: Total Paid, Net Price, Discount/Coupon, Tax, Shipping, Return
  • Channel/Device: Store, Web, App, etc.
  • Payment Method

Subscription: This business model is coming back with full force, as a new generation of shoppers prefer subscription over ownership. It gets a little more complicated, as shipment/delivery and payment may follow different cycles.

  • Who: Subscriber ID/PII
  • Brand/Title/Property
  • Dates: First Subscription, Renewal, Payment, Delinquent, Cancelation, Reactivation, etc.
  • Paid Amounts by Pay Period
  • Number of Payments/Turns
  • Payment Method
  • Auto Payment Status
  • Subscription Status
  • Number of Renewals
  • Subscription Terms
  • Acquisition Channel/Device
  • Acquisition Source

Hospitality: Most hotels and travel services fall under this category. This is even more complicated than the subscription model, as booking and travel date, and gaps between them, all play important parts in the prediction and personalization.

  • Who: Guest ID / PII
  • Brand/Property
  • Region
  • Booking Site/Source
  • Transaction Channel/Device
  • Booking Date/Time/Day of Week
  • Travel(Arrival) Date/Time
  • Travel Duration
  • Transaction Amount: Total Paid, Net Price, Discount, Coupon, Fees, Taxes
  • Number of Rooms/Parties
  • Room Class/Price Band
  • Payment Method
  • Corporate Discount Code
  • Special Requests

Promotion Data: On top of these basic lists of behavioral data, you would need promotion history to get into the “what worked” part of analytics, leading to real response models.

  • Promotion Channel
  • Source of Data/List
  • Offer Type
  • Creative Details
  • Segment/Model (for selection/targeting)
  • Drop/Contact Date

Summing It All Up

I am certain that you have much more data, and would need more data categories than ones on this list. For one, promotion data would be much more complicated if you gathered all types of touch data from Google tags and your own mail and email promotion history from multiple vendors. Like I said, this is a cheat sheet, and at some point, you’d have to get deeper.

Plus, you will still have to agonize over how far back in time you would have to go back for a proper data inventory. That really depends on your business, as the data cycle for big ticket items like home furniture or automobiles is far longer than consumables and budget-price items.

When in doubt, start asking your analysts. If they are not sure — i.e., insisting that they must have “everything, all the time”— then call for outside help. Knowing what to keep, based on business objectives, is the first step of building an analytics roadmap, anyway.

No matter how overwhelming this cleanup job may seem, it is something that most organizations must go through — at some point. Otherwise, your own IT department may decide to throw away “old” data, unilaterally. That is more like a foreclosure situation, and you won’t even be able to finish necessary data summary work before some critical data are gone. So, plan for streamlining the data flow like you just sold a house and must move out by a certain date. Happy cleaning, and don’t forget to whistle while you work.

‘Big Data’ Is Like Mining Gold for a Watch – Gold Can’t Tell Time

It is often quoted that 2.5 quintillion bytes of data are collected each day. That surely sounds like a big number, considering 1 quintillion bytes (or exabytes, if that sounds fancier) are equal to 1 billion gigabytes. … My phone can hold about 65 gigabytes; which, by the way, means nothing to me. I just know that figure equates to about 6,000 songs, plus all my personal information, with room to spare for hundreds of photos and videos. 

It is often quoted that 2.5 quintillion bytes of data are collected each day. That surely sounds like a big number, considering 1 quintillion bytes (or exabytes, if that sounds fancier) are equal to 1 billion gigabytes. Looking back only about 20 years, I remember my beloved 386-based desktop computer had a hard drive that can barely hold 300 megabytes, which was considered to be quite large in those ancient days. Now, my phone can hold about 65 gigabytes; which, by the way, means nothing to me. I just know that figure equates to about 6,000 songs, plus all my personal information, with room to spare for hundreds of photos and videos. So how do I fathom the size of 2.5 quintillion bytes? I don’t. I give up. I’d rather count the number stars in the universe. And I have been in the database business for more than 25 years.

But I don’t feel bad about that. If a pile of data requires a computer to process it, then it is already too “big” for our brains. In the age of “Big Data,” size matters, but emphasizing the size element is missing the point. People want to understand the data in their own terms and want to use them in decision-making processes. Throwing the raw data around to people without math or computing skills is like galleries handing out paint and brushes to people who want paintings on the wall. Worse yet, continuing to point out how “big” the Big Data world is to them is like quoting the number of rice grains on this planet in front of a hungry man, when he doesn’t even care how many grains are in one bowl. He just wants to eat a bowl of “cooked” rice, and right this moment.

To be a successful data player, one must be the master of the following three steps:

  • Collection;
  • Refinement; and
  • Delivery.

Collection and storage are obviously important in the age of Big Data. However, that in itself shouldn’t be the goal. I hear lots of bragging about how much data can be collected and stored, and how fast the data can be retrieved.

Great, you can retrieve any transaction detail going back 20 years in less than 0.5 seconds. Congratulations. But can you now tell me whom are more likely to be loyal customers for the next five years, with annual spending potential of more than $250? Or who is more likely to quit using the service in next 60 days? Who is more likely to be on a cruise ship leaving the dock on the East Coast heading for Europe between Thanksgiving and Christmas, with onboard spending potential greater than $300? Who is more likely to respond to emails with free shipping offers? Where should I open my next store selling fancy children’s products? What do my customers look like, and where do they go between 6 and 9 p.m.?

Answers to these types of questions do not come from the raw data, but they should be derived from the data through the data refinement process. And that is the hard part. Asking the right questions, expressing the goals in a mathematical format, throwing out data that don’t fit the question, merging data from a diverse array of sources, summarizing the data into meaningful levels, filling in the blanks (there will be plenty—even these days), and running statistical models to come up with scores that look like an answer to the question are all parts of the data refinement process. It is a lot like manufacturing gold watches, where mining gold is just an important first step. But a piece of gold won’t tell you what time it is.

The final step is to deliver that answer—which, by now, should be in a user-friendly format—to the user at the right time in the right format. Often, lots of data-related products only emphasize this part, as it is the most intimate one to the users. After all, it provides an illusion that the user is in total control, being able to touch the data so nicely displayed on the screen. Such tool sets may produce impressive-looking reports and dazzling graphics. But, lest we forget, they are only representations of the data refinement processes. In addition, no tool set will ever do the thinking part for anyone. I’ve seen so many missed opportunities where decision-makers invested obscene amounts of money in fancy tool sets, believing they will conduct all the logical and data refinement work for them, automatically. That is like believing that purchasing the top of the line Fender Stratocaster will guarantee that you will play like Eric Clapton in the near future. Yes, the tool sets are important as delivery mechanisms of refined data, but none of them replace the refinement part. Doing so would be like skipping guitar practice after spending $3,000 on a guitar.

Big Data business should be about providing answers to questions. It should be about humans who are the subjects of data collection and, in turn, the ultimate beneficiaries of information. It’s not about IT or tool sets that come and go like hit songs. But it should be about inserting advanced use of data into everyday decision-making processes by all kinds of people, not just the ones with statistics degrees. The goal of data players must be to make it simple—not bigger and more complex.

I boldly predict that missing these points will make “Big Data” a dirty word in the next three years. Emphasizing the size element alone will lead to unbalanced investments, which will then lead to disappointing results with not much to show for them in this cruel age of ROI. That is a sure way to kill the buzz. Not that I am that fond of the expression “Big Data”; though, I admit, one benefit has been that I don’t have to explain what I do for living for 10 minutes any more. Nonetheless, all the Big Data folks may need an exit plan if we are indeed heading for the days when it will be yet another disappointing buzzword. So let’s do this one right, and start thinking about refining the data first and foremost.

Collection and storage are just so last year.