How Marketers Can Throw Away Data, Without Regrets

Yes, data is an asset. But not if the data doesn’t generate any value. (There is no sentimental value to data, unless we are talking about building a museum of old data.) So here’s how to throw away data.

Last month, I talked about data hoarders (refer to “Don’t Be a Data Hoarder”). This time, let me share some ideas about how to throw away data.

I heard about people who specialize in cleaning other people’s closets and storage spaces. Looking at the result — turning a hoarder’s house into a presentable living quarters — I am certain that they have their own set of rules and methodologies in deciding what to throw out, what goes together, and how to organize items that are to be kept.

I recently had a relatable experience, as I sold a house and moved to a smaller place, all in the name of age-appropriate downsizing. We lived in the old home for 22 years, raising two children. We thought that our kids took much of their stuff when they moved out, but as you may have guessed already, no, we still had so much to sort through. After all, we are talking about accumulation of possessions by four individuals for 22 long years. Enough to invoke a philosophical question “Why do humans gather so much stuff during their short lifespans?” Maybe we all carry a bit of hoarder genes after all. Or we’re just too lazy to sort things through on a regular basis.

My rule was rather simple: If I haven’t touched an item for more than three years (two years for apparel), give it away or throw it out. One exception was for the things with high sentimental value; which, unfortunately, could lead into hoarding behavior all over again (as in “Oh, I can’t possibly throw out this ‘Best Daddy in the World’ mug, though it looks totally hideous.”). So, when I was in doubt, I chucked it.

But after all of this, I may have to move to an even smaller place to be able to claim a minimalist lifestyle. Or should I just hire a cleanup specialist? One thing is for sure though; the cleanup job should be done in phases.

Useless junk — i.e., things that generate no monetary or sentimental value — is a liability. Yes, data is an asset. But not if the data doesn’t generate any value. (There is no sentimental value to data, unless we are talking about building a museum of old data.)

So, how do we really clean the house? I’ve seen some harsh methods like “If the data is more than three years old, just dump it.” Unless the business model has gone through some drastic changes rendering the past data completely useless, I strongly recommend against such a crude tactic. If trend analysis or a churn prediction model is in the plan, you will definitely regret throwing away data just because they are old. Then again, as I wrote last month, no one should keep every piece of data since the beginning of time, either.

Like any other data-related activities, the cleanup job starts with goal-setting, too. How will you know what to keep, if you don’t even know what you are about to do? If you “do” know what is on the horizon, then follow your own plan. If you don’t, the No. 1 step would be a companywide Need-Analysis, as different types of data are required for different tasks.

The Process of Ridding Yourself of Data

First, ask the users and analysts:

  • What is in the marketing plan?
  • What type of predictions would be required for such marketing goals? Be as specific as possible:
    • Forecasting and Time-Series Analysis — You will need to keep some “old” data for sure for these.
    • Product Affinity Models for Cross-sell/Upsell — You must keep who bought what for how much, when, through what channel type of data.
    • Attribution Analysis and Response Models — This type of analytics requires past promotion and response history data for at least a few calendar years.
    • Product Development and Planning — You would need SKU-level transaction data, but not from the beginning of time.
    • Etc.
  • What do you have? Do the full inventory and categorize them by data types, as you may have much more than you thought. Some examples are:
    • PII (Personally Identifiable Data): Name, Address, Email, Phone Number, Various ID’s, etc. These are valuable connectors to other data sources such as Geo/Demographic Data.
    • Order/Transaction Data: Transaction Date, Amount, Payment Methods
    • Item/SKU-Level Data: Products, Price, Units
    • Promotion/Response History: Source, Channel, Offer, Creative, Drop/Wave, etc.
    • Life-to-Date/Past ‘X’ Months Summary Data: Not as good as detailed, event-level data, but summary data may be enough for trend analysis or forecasting.
    • Customer Status Flags: Active, Dormant, Delinquent, Canceled
    • Surveys/Product Registration: Attitudinal and Lifestyle Data
    • Customer Communication History Data: Call-center and web interaction data
    • Online Behavior: Open, Click-through, Page views, etc.
    • Social Media: Sentiment/Intentions
    • Etc.
  • What kind of data did you buy? Surprisingly large amounts of data are acquired from third-party data sources, and kept around indefinitely.
  • Where are they? On what platform, and how are they stored?
  • Who is assessing them? Through what channels and platform? Via what methods or software? Search for them, as you may uncover data users in unexpected places. You do not want to throw things out without asking them.
  • Who is updating them? Data that are not regularly updated are most likely to be junk.

Taking Stock

Now, I’m not suggesting actually “deleting” data on a source level in the age of cheap storage. All I am saying is that not all data points are equally important, and some data can be easily tucked away. In short, if data don’t fit your goals, don’t bring them out to the front.

Essentially, this is the first step of the data refinement process. The emergence of the Data Lake concept is rooted here. Big Data was too big, so users wanted to put more useful data in more easily accessible places. Now, the trouble with the Data Lake is that the lake water is still not drinkable, requiring further refinement. However, like I admitted that I may have to move again to clean my stuff out further, the cleaning process should be done in phases, and the Data Lake may as well be the first station.

In contrast, the Analytics Sandbox that I often discussed in this series would be more of a data haven for analysts, where every variable is cleaned, standardized, categorized, consolidated, and summarized for advanced analytics and targeting (refer to “Chicken or the Egg? Data or Analytics?” and “It’s All about Ranking”). Basically, it’s data on silver platters for professional analysts— humans or machines.

At the end of such data refinement processes, the end-users will see data in the form of “answers to questions.” As in, scores that describe targets in a concise manner, like “Likelihood of being an early adopter,” or “Likelihood of being a bargain-seeker.” To get to that stage, useful data must flow through the pipeline constantly and smoothly. But not all data are required to do that (refer to “Data Must Flow, But Not All of Them”).

For the folks who just want to cut to the chase, allow me to share a cheat sheet.

Disclaimer: You should really plan to do some serious need analysis to select and purge data from your value chain. Nonetheless, you may be able to kick-start a majority of customer-related analytics, if you start with this basic list.

Because different business models call for a different data menu, I divided the list by major industry types. If your industry is not listed here, use your imagination along with a need-analysis.

Cheat Sheet

Merchandizing: Most retailers would fall into this category. Basically, you would provide products and services upon payment.

  • Who: Customer ID / PII
  • What: Product SKU / Category
  • When: Purchase Date
  • How Much: Total Paid, Net Price, Discount/Coupon, Tax, Shipping, Return
  • Channel/Device: Store, Web, App, etc.
  • Payment Method

Subscription: This business model is coming back with full force, as a new generation of shoppers prefer subscription over ownership. It gets a little more complicated, as shipment/delivery and payment may follow different cycles.

  • Who: Subscriber ID/PII
  • Brand/Title/Property
  • Dates: First Subscription, Renewal, Payment, Delinquent, Cancelation, Reactivation, etc.
  • Paid Amounts by Pay Period
  • Number of Payments/Turns
  • Payment Method
  • Auto Payment Status
  • Subscription Status
  • Number of Renewals
  • Subscription Terms
  • Acquisition Channel/Device
  • Acquisition Source

Hospitality: Most hotels and travel services fall under this category. This is even more complicated than the subscription model, as booking and travel date, and gaps between them, all play important parts in the prediction and personalization.

  • Who: Guest ID / PII
  • Brand/Property
  • Region
  • Booking Site/Source
  • Transaction Channel/Device
  • Booking Date/Time/Day of Week
  • Travel(Arrival) Date/Time
  • Travel Duration
  • Transaction Amount: Total Paid, Net Price, Discount, Coupon, Fees, Taxes
  • Number of Rooms/Parties
  • Room Class/Price Band
  • Payment Method
  • Corporate Discount Code
  • Special Requests

Promotion Data: On top of these basic lists of behavioral data, you would need promotion history to get into the “what worked” part of analytics, leading to real response models.

  • Promotion Channel
  • Source of Data/List
  • Offer Type
  • Creative Details
  • Segment/Model (for selection/targeting)
  • Drop/Contact Date

Summing It All Up

I am certain that you have much more data, and would need more data categories than ones on this list. For one, promotion data would be much more complicated if you gathered all types of touch data from Google tags and your own mail and email promotion history from multiple vendors. Like I said, this is a cheat sheet, and at some point, you’d have to get deeper.

Plus, you will still have to agonize over how far back in time you would have to go back for a proper data inventory. That really depends on your business, as the data cycle for big ticket items like home furniture or automobiles is far longer than consumables and budget-price items.

When in doubt, start asking your analysts. If they are not sure — i.e., insisting that they must have “everything, all the time”— then call for outside help. Knowing what to keep, based on business objectives, is the first step of building an analytics roadmap, anyway.

No matter how overwhelming this cleanup job may seem, it is something that most organizations must go through — at some point. Otherwise, your own IT department may decide to throw away “old” data, unilaterally. That is more like a foreclosure situation, and you won’t even be able to finish necessary data summary work before some critical data are gone. So, plan for streamlining the data flow like you just sold a house and must move out by a certain date. Happy cleaning, and don’t forget to whistle while you work.

Beyond RFM Data

In the world of predictive analytics, the transaction data is the king of the hill. The master of the domain. The protector of the realm. Why? Because they are hands-down the most powerful predictors. If I may borrow the term that my mentor coined for our cooperative venture more than a decade ago (before anyone even uttered the word “Big Data”), “The past behavior is the best predictor of the future behavior.” Indeed. Back then, we had built a platform that nowadays could easily have qualified as Big Data. The platform predicted people’s future behaviors on a massive scale, and it worked really well, so I still stand by that statement.

In the world of predictive analytics, the transaction data is the king of the hill. The master of the domain. The protector of the realm. Why? Because they are hands-down the most powerful predictors. If I may borrow the term that my mentor coined for our cooperative venture more than a decade ago (before anyone even uttered the word “Big Data”), “The past behavior is the best predictor of the future behavior.” Indeed. Back then, we had built a platform that nowadays could easily have qualified as Big Data. The platform predicted people’s future behaviors on a massive scale, and it worked really well, so I still stand by that statement.

How so? At the risk of sounding like a pompous mathematical smartypants (I’m really not), it is because people do not change that much, or if so, not so rapidly. Every move you make is on some predictive curve. What you been buying, clicking, browsing, smelling or coveting somehow leads to the next move. Well, not all the time. (Maybe you just like to “look” at pretty shoes?) But with enough data, we can calculate the probability with some confidence that you would be an outdoors type, or a golfer, or a relaxing type on a cruise ship, or a risk-averse investor, or a wine enthusiast, or into fashion, or a passionate gardener, or a sci-fi geek, or a professional wrestling fan. Beyond affinity scores listed here, we can predict future value of each customer or prospect and possible attrition points, as well. And behind all those predictive models (and I have seen countless algorithms), the leading predictors are mostly transaction data, if you are lucky enough to get your hands on them. In the age of ubiquitous data and at the dawn of the “Internet of Things,” more marketers will be in that lucky group if they are diligent about data collection and refinement. Yes, in the near future, even a refrigerator will be able to order groceries, but don’t forget that only the collection mechanism will be different there. We still have to collect, refine and analyze the transaction data.

Last month, I talked about three major types of data (refer to “Big Data Must Get Smaller“), which are:
1. Descriptive Data
2. Behavioral Data (mostly Transaction Data)
3. Attitudinal Data.

If you gain access to all three elements with decent coverage, you will have tremendous predictive power when it comes to human behaviors. Unfortunately, it is really difficult to accumulate attitudinal data on a large scale with individual-level details (i.e., knowing who’s behind all those sentiments). Behavioral data, mostly in forms of transaction data, are also not easy to collect and maintain (non-transaction behavioral data are even bigger and harder to handle), but I’d say it is definitely worth the effort, as most of what we call Big Data fall under this category. Conversely, one can just purchase descriptive data, which are what we generally call demographic or firmographic data, from data compilers or brokers. The sellers (there are many) will even do the data-append processing for you and they may also throw in a few free profile reports with it.

Now, when we start talking about the transaction data, many marketers will respond “Oh, you mean RFM data?” Well, that is not completely off-base, because “Recency, Frequency and Monetary” data certainly occupy important positions in the family of transaction data. But they hardly are the whole thing, and the term is misused as frequently as “Big Data.” Transaction data are so much more than simple RFM variables.

RFM Data Is Just a Good Start
The term RFM should be used more as a checklist for marketers, not as design guidelines—or limitations in many cases—for data professionals. How recently did this particular customer purchase our product, and how frequently did she do that and how much money did she spend with us? Answering these questions is a good start, but stopping there would seriously limit the potential of transaction data. Further, this line of questioning would lead the interrogation efforts to simple “filtering,” as in: “Select all customers who purchased anything with a price tag over $100 more than once in past 12 months.” Many data users may think that this query is somewhat complex, but it really is just a one-dimensional view of the universe. And unfortunately, no customer is one-dimensional. And this query is just one slice of truth from the marketer’s point of view, not the customer’s. If you want to get really deep, the view must be “buyer-centric,” not product-, channel-, division-, seller- or company-centric. And the database structure should reflect that view (refer to “It’s All About Ranking,” where the concept of “Analytical Sandbox” is introduced).

Transaction data by definition describe the transactions, not the buyers. If you would like to describe a buyer or if you are trying to predict the buyer’s future behavior, you need to convert the transaction data into “descriptors of the buyers” first. What is the difference? It is the same data looked at through a different window—front vs. side window—but the effect is huge.

Even if we think about just one simple transaction with one item, instead of describing the shopping basket as “transaction happened on July 3, 2014, containing the Coldplay’s latest CD ‘Ghost Stories’ priced at $11.88,” a buyer-centric description would read: “A recent CD buyer in Rock genre with an average spending level in the music category under $20.” The trick is to describe the buyer, not the product or the transaction. If that customer has many orders and items in his purchase history (let’s say he downloaded a few songs to his portable devices, as well), the description of the buyer would become much richer. If you collect all of his past purchase history, it gets even more colorful, as in: “A recent music CD or MP3 buyer in rock, classical and jazz genres with 24-month purchase totaling to 13 orders containing 16 items with total spending valued in $100-$150 range and $11 average order size.” Of course you would store all this using many different variables (such as genre indicators, number of orders, number of items, total dollars spent during the past 24 months, average order amount and number of weeks since last purchase in the music category, etc.). But the point is that the story would come out this way when you change the perspective.

Creating a Buyer-Centric Portrait
The whole process of creating a buyer-centric portrait starts with data summarization (or de-normalization). A typical structure of the table (or database) that needs to capture every transaction detail, such as transaction date and amount, would require an entry for every transaction, and the database designers call it the “normal” state. As I explained in my previous article (“Ranking is the key”), if you would like to rank in terms of customer value, the data record must be on a customer level, as well. If you are ranking households or companies, you would then need to summarize the data on those levels, too.

Now, this summarization (or de-normalization) is not a process of eliminating duplicate entries of names, as you wouldn’t want to throw away any transaction details. If there are multiple orders per person, what is the total number of orders? What is the total amount of spending on an individual level? What would be average spending level per transaction, or per year? If you are allowed to have only one line of entry per person, how would you summarize the purchase dates, as you cannot just add them up? In that case, you can start with the first and last transaction date of each customer. Now, when you have the first and last transaction date for every customer, what would be the tenure of each customer and what would be the number of days since the last purchase? How many days, on average, are there in between orders then? Yes, all these figures are related to basic RFM metrics, but they are far more colorful this way.

The attached exhibit displays a very simple example of a before and after picture of such summarization process. On the left-hand side, there resides a typical order table containing customer ID, order number, order date and transaction amount. If a customer has multiple orders in a given period, an equal number of lines are required to record the transaction details. In real life, other order level information, such as payment method (very predictive, by the way), tax amount, discount or coupon amount and, if applicable, shipping amount would be on this table, as well.

On the right-hand side of the chart, you will find there is only one line per customer. As I mentioned in my previous columns, establishing consistent and accurate customer ID cannot be neglected—for this reason alone. How would you rely on the summary data if one person may have multiple IDs? The customer may have moved to a new address, or shopped from multiple stores or sites, or there could have been errors in data collections. Relying on email address is a big no-no, as we all carry many email addresses. That is why the first step of building a functional marketing database is to go through the data hygiene and consolidation process. (There are many data processing vendors and software packages for it.) Once a persistent customer (or individual) ID system is in place, you can add up the numbers to create customer-level statistics, such as total orders, total dollars, and first and last order dates, as you see in the chart.

Remember R, F, M, P and C
The real fun begins when you combine these numeric summary figures with product, channel and other important categorical variables. Because product (or service) and channel are the most distinctive dividers of customer behaviors, let’s just add P and C to the famous RFM (remember, we are using RFM just as a checklist here), and call it R, F, M, P and C.

Product (rather, product category) is an important separator, as people often show completely different spending behavior for different types of products. For example, you can send me fancy-shmancy fashion catalogs all you want, but I won’t look at it with an intention of purchase, as most men will look at the models and not what they are wearing. So my active purchase history in the sports, home electronics or music categories won’t mean anything in the fashion category. In other words, those so-called “hotline” names should be treated differently for different categories.

Channel information is also important, as there are active online buyers who would never buy certain items, such as apparel or home furnishing products, without physically touching them first. For example, even in the same categories, I would buy guitar strings or golf balls online. But I would not purchase a guitar or a driver without trying them out first. Now, when I say channel, I mean the channel that the customer used to make the purchase, not the channel through which the marketer chose to communicate with him. Channel information should be treated as a two-way street, as no marketer “owns” a customer through a particular channel (refer to “The Future of Online is Offline“).

As an exercise, let’s go back to the basic RFM data and create some actual variables. For “each” customer, we can start with basic RFM measures, as exhibited in the chart:

· Number of Transactions
· Total Dollar Amount
· Number of Days (or Weeks) since the Last Transaction
· Number of Days (or Weeks) since the First Transaction

Notice that the days are counted from today’s point of view (practically the day the database is updated), as the actual date’s significance changes as time goes by (e.g., a day in February would feel different when looked back on from April vs. November). “Recency” is a relative concept; therefore, we should relativize the time measurements to express it.

From these basic figures, we can derive other related variables, such as:

· Average Dollar Amount per Customer
· Average Dollar Amount per Transaction
· Average Dollar Amount per Year
· Lifetime Highest Amount per Item
· Lifetime Lowest Amount per Transaction
· Average Number of Days Between Transactions
· Etc., etc…

Now, imagine you have all these measurements by channels, such as retail, Web, catalog, phone or mail-in, and separately by product categories. If you imagine a gigantic spreadsheet, the summarized table would have fewer numbers of rows, but a seemingly endless number of columns. I will discuss categorical and non-numeric variables in future articles. But for this exercise, let’s just imagine having these sets of variables for all major product categories. The result is that the recency factor now becomes more like “Weeks since Last Online Order”—not just any order. Frequency measurements would be more like “Number of Transactions in Dietary Supplement Category”—not just for any product. Monetary values can be expressed in “Average Spending Level in Outdoor Sports Category through Online Channel”—not just the customer’s average dollar amount, in general.

Why stop there? We may slice and dice the data by offer type, customer status, payment method or time intervals (e.g., lifetime, 24-month, 48-months, etc.) as well. I am not saying that all the RFM variables should be cut out this way, but having “Number of Transaction by Payment Method,” for example, could be very revealing about the customer, as everybody uses multiple payment methods, while some may never use a debit card for a large purchase, for example. All these little measurements become building blocks in predictive modeling. Now, too many variables can also be troublesome. And knowing the balance (i.e., knowing where to stop) comes from the experience and preliminary analysis. That is when experts and analysts should be consulted for this type of uniform variable creation. Nevertheless, the point is that RFM variables are not just three simple measures that happen be a part of the larger transaction data menu. And we didn’t even touch non-transaction based behavioral elements, such as clicks, views, miles or minutes.

The Time Factor
So, if such data summarization is so useful for analytics and modeling, should we always include everything that has been collected since the inception of the database? The answer is yes and no. Sorry for being cryptic here, but it really depends on what your product is all about; how the buyers would relate to it; and what you, as a marketer, are trying to achieve. As for going back forever, there is a danger in that kind of data hoarding, as “Life-to-Date” data always favors tenured customers over new customers who have a relatively short history. In reality, many new customers may have more potential in terms of value than a tenured customer with lots of transaction records from a long time ago, but with no recent activity. That is why we need to create a level playing field in terms of time limit.

If a “Life-to-Date” summary is not ideal for predictive analytics, then where should you place the cutoff line? If you are selling cars or home furnishing products, we may need to look at a 4- to 5-year history. If your products are consumables with relatively short purchase cycles, then a 1-year examination would be enough. If your product is seasonal in nature—like gardening, vacation or heavily holiday-related items, then you may have to look at a minimum of two consecutive years of history to capture seasonal patterns. If you have mixed seasonality or longevity of products (e.g., selling golf balls and golf clubs sets through the same store or site), then you may have to summarize the data with multiple timelines, where the above metrics would be separated by 12 months, 24 months, 48 months, etc. If you have lifetime value models or any time-series models in the plan, then you may have to break the timeline down even more finely. Again, this is where you may need professional guidance, but marketers’ input is equally important.

Analytical Sandbox
Lastly, who should be doing all of this data summary work? I talked about the concept of the “Analytical Sandbox,” where all types of data conversion, hygiene, transformation, categorization and summarization are done in a consistent manner, and analytical activities, such as sampling, profiling, modeling and scoring are done with proper toolsets like SAS, R or SPSS (refer to “It’s All About Ranking“). The short and final answer is this: Do not leave that to analysts or statisticians. They are the main players in that playground, not the architects or developers of it. If you are serious about employing analytics for your business, plan to build the Analytical Sandbox along with the team of analysts.

My goal as a database designer has always been serving the analysts and statisticians with “model-ready” datasets on silver platters. My promise to them has been that the modelers would spend no time fixing the data. Instead, they would be spending their valuable time thinking about the targets and statistical methodologies to fulfill the marketing goals. After all, answers that we seek come out of those mighty—but often elusive—algorithms, and the algorithms are made of data variables. So, in the interest of getting the proper answers fast, we must build lots of building blocks first. And no, simple RFM variables won’t cut it.