Last month, I talked about data hoarders (refer to “Don’t Be a Data Hoarder”). This time, let me share some ideas about how to throw away data.
I heard about people who specialize in cleaning other people’s closets and storage spaces. Looking at the result — turning a hoarder’s house into a presentable living quarters — I am certain that they have their own set of rules and methodologies in deciding what to throw out, what goes together, and how to organize items that are to be kept.
I recently had a relatable experience, as I sold a house and moved to a smaller place, all in the name of age-appropriate downsizing. We lived in the old home for 22 years, raising two children. We thought that our kids took much of their stuff when they moved out, but as you may have guessed already, no, we still had so much to sort through. After all, we are talking about accumulation of possessions by four individuals for 22 long years. Enough to invoke a philosophical question “Why do humans gather so much stuff during their short lifespans?” Maybe we all carry a bit of hoarder genes after all. Or we’re just too lazy to sort things through on a regular basis.
My rule was rather simple: If I haven’t touched an item for more than three years (two years for apparel), give it away or throw it out. One exception was for the things with high sentimental value; which, unfortunately, could lead into hoarding behavior all over again (as in “Oh, I can’t possibly throw out this ‘Best Daddy in the World’ mug, though it looks totally hideous.”). So, when I was in doubt, I chucked it.
But after all of this, I may have to move to an even smaller place to be able to claim a minimalist lifestyle. Or should I just hire a cleanup specialist? One thing is for sure though; the cleanup job should be done in phases.
Useless junk — i.e., things that generate no monetary or sentimental value — is a liability. Yes, data is an asset. But not if the data doesn’t generate any value. (There is no sentimental value to data, unless we are talking about building a museum of old data.)
So, how do we really clean the house? I’ve seen some harsh methods like “If the data is more than three years old, just dump it.” Unless the business model has gone through some drastic changes rendering the past data completely useless, I strongly recommend against such a crude tactic. If trend analysis or a churn prediction model is in the plan, you will definitely regret throwing away data just because they are old. Then again, as I wrote last month, no one should keep every piece of data since the beginning of time, either.
Like any other data-related activities, the cleanup job starts with goal-setting, too. How will you know what to keep, if you don’t even know what you are about to do? If you “do” know what is on the horizon, then follow your own plan. If you don’t, the No. 1 step would be a companywide Need-Analysis, as different types of data are required for different tasks.
The Process of Ridding Yourself of Data
First, ask the users and analysts:
- What is in the marketing plan?
- What type of predictions would be required for such marketing goals? Be as specific as possible:
- Forecasting and Time-Series Analysis — You will need to keep some “old” data for sure for these.
- Product Affinity Models for Cross-sell/Upsell — You must keep who bought what for how much, when, through what channel type of data.
- Attribution Analysis and Response Models — This type of analytics requires past promotion and response history data for at least a few calendar years.
- Product Development and Planning — You would need SKU-level transaction data, but not from the beginning of time.
- What do you have? Do the full inventory and categorize them by data types, as you may have much more than you thought. Some examples are:
- PII (Personally Identifiable Data): Name, Address, Email, Phone Number, Various ID’s, etc. These are valuable connectors to other data sources such as Geo/Demographic Data.
- Order/Transaction Data: Transaction Date, Amount, Payment Methods
- Item/SKU-Level Data: Products, Price, Units
- Promotion/Response History: Source, Channel, Offer, Creative, Drop/Wave, etc.
- Life-to-Date/Past ‘X’ Months Summary Data: Not as good as detailed, event-level data, but summary data may be enough for trend analysis or forecasting.
- Customer Status Flags: Active, Dormant, Delinquent, Canceled
- Surveys/Product Registration: Attitudinal and Lifestyle Data
- Customer Communication History Data: Call-center and web interaction data
- Online Behavior: Open, Click-through, Page views, etc.
- Social Media: Sentiment/Intentions
- What kind of data did you buy? Surprisingly large amounts of data are acquired from third-party data sources, and kept around indefinitely.
- Where are they? On what platform, and how are they stored?
- Who is assessing them? Through what channels and platform? Via what methods or software? Search for them, as you may uncover data users in unexpected places. You do not want to throw things out without asking them.
- Who is updating them? Data that are not regularly updated are most likely to be junk.
Now, I’m not suggesting actually “deleting” data on a source level in the age of cheap storage. All I am saying is that not all data points are equally important, and some data can be easily tucked away. In short, if data don’t fit your goals, don’t bring them out to the front.
Essentially, this is the first step of the data refinement process. The emergence of the Data Lake concept is rooted here. Big Data was too big, so users wanted to put more useful data in more easily accessible places. Now, the trouble with the Data Lake is that the lake water is still not drinkable, requiring further refinement. However, like I admitted that I may have to move again to clean my stuff out further, the cleaning process should be done in phases, and the Data Lake may as well be the first station.
In contrast, the Analytics Sandbox that I often discussed in this series would be more of a data haven for analysts, where every variable is cleaned, standardized, categorized, consolidated, and summarized for advanced analytics and targeting (refer to “Chicken or the Egg? Data or Analytics?” and “It’s All about Ranking”). Basically, it’s data on silver platters for professional analysts— humans or machines.
At the end of such data refinement processes, the end-users will see data in the form of “answers to questions.” As in, scores that describe targets in a concise manner, like “Likelihood of being an early adopter,” or “Likelihood of being a bargain-seeker.” To get to that stage, useful data must flow through the pipeline constantly and smoothly. But not all data are required to do that (refer to “Data Must Flow, But Not All of Them”).
For the folks who just want to cut to the chase, allow me to share a cheat sheet.
Disclaimer: You should really plan to do some serious need analysis to select and purge data from your value chain. Nonetheless, you may be able to kick-start a majority of customer-related analytics, if you start with this basic list.
Because different business models call for a different data menu, I divided the list by major industry types. If your industry is not listed here, use your imagination along with a need-analysis.
Merchandizing: Most retailers would fall into this category. Basically, you would provide products and services upon payment.
- Who: Customer ID / PII
- What: Product SKU / Category
- When: Purchase Date
- How Much: Total Paid, Net Price, Discount/Coupon, Tax, Shipping, Return
- Channel/Device: Store, Web, App, etc.
- Payment Method
Subscription: This business model is coming back with full force, as a new generation of shoppers prefer subscription over ownership. It gets a little more complicated, as shipment/delivery and payment may follow different cycles.
- Who: Subscriber ID/PII
- Dates: First Subscription, Renewal, Payment, Delinquent, Cancelation, Reactivation, etc.
- Paid Amounts by Pay Period
- Number of Payments/Turns
- Payment Method
- Auto Payment Status
- Subscription Status
- Number of Renewals
- Subscription Terms
- Acquisition Channel/Device
- Acquisition Source
Hospitality: Most hotels and travel services fall under this category. This is even more complicated than the subscription model, as booking and travel date, and gaps between them, all play important parts in the prediction and personalization.
- Who: Guest ID / PII
- Booking Site/Source
- Transaction Channel/Device
- Booking Date/Time/Day of Week
- Travel(Arrival) Date/Time
- Travel Duration
- Transaction Amount: Total Paid, Net Price, Discount, Coupon, Fees, Taxes
- Number of Rooms/Parties
- Room Class/Price Band
- Payment Method
- Corporate Discount Code
- Special Requests
Promotion Data: On top of these basic lists of behavioral data, you would need promotion history to get into the “what worked” part of analytics, leading to real response models.
- Promotion Channel
- Source of Data/List
- Offer Type
- Creative Details
- Segment/Model (for selection/targeting)
- Drop/Contact Date
Summing It All Up
I am certain that you have much more data, and would need more data categories than ones on this list. For one, promotion data would be much more complicated if you gathered all types of touch data from Google tags and your own mail and email promotion history from multiple vendors. Like I said, this is a cheat sheet, and at some point, you’d have to get deeper.
Plus, you will still have to agonize over how far back in time you would have to go back for a proper data inventory. That really depends on your business, as the data cycle for big ticket items like home furniture or automobiles is far longer than consumables and budget-price items.
When in doubt, start asking your analysts. If they are not sure — i.e., insisting that they must have “everything, all the time”— then call for outside help. Knowing what to keep, based on business objectives, is the first step of building an analytics roadmap, anyway.
No matter how overwhelming this cleanup job may seem, it is something that most organizations must go through — at some point. Otherwise, your own IT department may decide to throw away “old” data, unilaterally. That is more like a foreclosure situation, and you won’t even be able to finish necessary data summary work before some critical data are gone. So, plan for streamlining the data flow like you just sold a house and must move out by a certain date. Happy cleaning, and don’t forget to whistle while you work.