The Art of Data Categorization

Machine-learning is getting better at recognition and categorization by leaps and bounds, for sure. My dog has a Facebook page — don’t ask why — and on his last birthday, Facebook correctly converted his age to dog years.

Data mining, big data

Do you know why some reports are unbearably long and filled with numbers that are irrelevant to decision-making? It is mostly because there are serious misalignments between the desired level of detail in reporting and actual data categorization. Raw data, with very few exceptions, are rarely ready for decision-making (through various reports) or statistical modeling (an important part of what we often call advanced analytics).

Machine-learning is getting better at recognition and categorization by leaps and bounds, for sure. My dog has a Facebook page — don’t ask why — and on his last birthday, Facebook correctly converted his age to dog years. I kept hearing that machines have a hard time separating dog and cat pictures, but apparently such an obstacle has been overcome (or do they just use dog years for cats, too?).

In any case, do machines understand the “purpose” of categorization and tagging, as well? Does it understand why it is even necessary to put my dog’s age in dog years? That is an entirely different matter, and whether the work is done by humans or machines, I have seen time and time again that categorization efforts with clear purposes result in improvement in analytics and prediction.

Let’s take an example of the hot topic of personalization. Folks who have read my previous articles may already know that I am not even nearly impressed with various marketing efforts under the banner of personalization today. Most are done on a product level, with raw product-level data, when the personalization must foremost be about the person.

Even at a basic level of personalization, consumers on the receiving end often suspect that some personalization engines don’t even consider categories of products, as a suggested product is often irrelevant, dubious or even stupid (as in, “Hey, I just bought that exact item! Why are they offering it to me again?”). I can think of many reasons why that happens (mostly around data and analytics), but the first wrong gear often is that data are not properly categorized.

Results of analytical efforts for personalization and other complex challenges certainly improve when clean data enters the system. The reasons why most analysts spend the majority of their valuable time in data preparation — or even give up to use some granular data — is mostly because input data are unclean, unstructured or uncategorized.

Allow me to share some categorization rules that I have developed based on countless trials and errors during my co-op database days, when we had to put tens of millions of SKUs from over 1,500 sources into one consistent list of categories, solely for the purpose of analytics for individual-level targeting. Whether the actual categorization is done by humans or machines is not the issue; they all have to “learn” what the proper category is to be assigned for each item, and that starts with a proper categorization framework.

The rules I am introducing here are for personal-level targeting and customization of messages; therefore, “customer-centric” at the core. You may need to develop separate frameworks, if the goals are different. Problem statements such as “What will be the most popular product next season?” for instance, would require product-centric categorization. Nonetheless, this framework will be useful when setting up your own, as well.

Without further ado, let’s dive into the list:

Categorize the Buyers, Not the Product

This may not sound intuitive, but it is the first item to remember when setting up a goal-oriented categorization framework. If it is for personalization, and if you are creating a 360-degree view of customers for that purpose, don’t stop there and convert the product-level information into descriptors of buyers. And categorizing items with this goal in mind results in a vastly different — and far more predictable — outcome.

For instance, some items in a nautical catalog, such as a wall-mounted weather station (displaying temperature, air pressure, humidity, etc. on a fancy panel), can also be purchased from an executive gift catalog or website. When assigning categories for items like that, think about the context of the purchase, not just SKU descriptions, to avoid cases where you end up sending nautical catalogs to casual gift buyers. When in doubt, imagine how many purposes baking soda serves; think about the context of the purchase to describe the buyer, depending on the specific purpose (e.g., baking, personal hygiene, deodorization of a refrigerator, domestic cleaning, etc.).

Also consider the price scale and purpose of the purchase, so that you do not end up putting a cheap, everyday lamp and a state-of-the-art home decor lamp in the same category, leading to seriously misaligned offers. You must look beyond simple product descriptions.

The More Specific, the Better

Basically, don’t be lazy and put a 4K TV under “Home Electronics” and call it a day. For apparel items, gender break is the easy part, but sub-categories are even more important for prediction. Most modern product categorization schema are multilevel, like Home Electronics>Home Theater>TV>4K TV. So use it fully.

I’m not saying that all the minute details are helpful for analytics; I’m just emphasizing that one can combine categories later in the process. But if things are lumped up to begin with, one cannot break them apart without resorting back to the source data.

You will be better off if this type of effort is performed as early in the process as possible. Don’t create some big homework for everyone — especially for the analysts — for later. 

Consistency Over Accuracy

This may sound weird as well, but consistently wrong data may be more predictable than “sometimes” accurate data. Assigning the same item to multiple categories creates all kinds of havoc in reporting and prediction downstream. We may argue forever if a certain type of luxury handbag belongs in a category, with no clear winner in the end. The key point is that one should not go back and forth with established categorization rules.

If you can’t settle the fight, then use multiple tags for an item (I don’t recommend it personally). In any case, to machines and algorithms, those categories are just a numeric representation of where they belong, without any judgement. Don’t spend too much energy on making human sense out of every assignment. We can always change the “label” at the reporting stage.

Categorize Only as Much as It Matters

When categorizing items for targeting and reporting, we do not have to create a new schema that covers the entire spectrum of items. If targeting is the end-goal, you don’t even have to touch the items that did not sell very well, as there are not many buyers behind them. Going further, it is alright to categorize the top 20 percent of the items in terms of popularity (i.e., number of transactions or revenue dollar amount), if it covers over 80 percent of the customer behaviors. Yes, I said don’t be lazy under No. 2, but there is no point in spending energy categorizing small items that may not even move analytical needles later. In other words, know when to stop and use the “All other” category for insignificant ones.

Cut Out the Noise

Not every little detail matters in analytics. For example, the “color” of an item may matter a great deal for inventory management (as in “Hey, we are running low on the toasters in Ferrari red!”). But unless you are thinking about targeting people who only purchase items in red, you may not need such details for customized communication and offers. Break down the elements that make up an item, and go only as far as your specific goal calls for. Consult with analysts when in doubt.

Be Inventive

Creating the category buckets is the first important step of categorization efforts. This is where one must “imagine” what type of category would be useful for reporting and prediction later. Simple food labels could lead to all kinds of interesting “behavioral” categories that may be extremely useful when personalizing offers (refer to “Freeform Data Are Not Exactly Free”). This may sound contradictory to No. 5, but hitting the right balance between “too much” and “too little” is indeed the human function — for now — that I was talking about.


Analytics, as we’ve been saying for a long time, is a “garbage-in-garbage-out” business. But in the age of abundant and ubiquitous data, some “seemingly” useless data can be truly predictable. If we don’t think about “data refinement” — of which categorization is a big part — analysts will end up beating down a few popular variables, or worse yet, push down the raw data through some analytical engine “hoping” for some good results.

If the current state of personalization is any indication, most available data must be refined in more systematic and rigorous fashion, whether done by machines or humans. And until the machine catches up with us in the area of creativity, intuition, as well as logical deduction, we will have to be the ones who set up the framework.

Data became too big and complex and customers became too demanding for marketers to leave anything to chance. Even your off-the-self personalization engine will run better with well-categorized data. So, commit to that step, set up proper frameworks and rules, and move onto automation once the organization is ready for it.

AI may take over the world soon, but different types of thinking machines will have to work together to make various marketing efforts truly fruitful. And categorization, along with predictive analytics, is an important component. That is, if you as a consumer believe that machine-driven personalization can use a “human touch.”

Author: Stephen H. Yu

Stephen H. Yu is a world-class database marketer. He has a proven track record in comprehensive strategic planning and tactical execution, effectively bridging the gap between the marketing and technology world with a balanced view obtained from more than 30 years of experience in best practices of database marketing. Currently, Yu is president and chief consultant at Willow Data Strategy. Previously, he was the head of analytics and insights at eClerx, and VP, Data Strategy & Analytics at Infogroup. Prior to that, Yu was the founding CTO of I-Behavior Inc., which pioneered the use of SKU-level behavioral data. “As a long-time data player with plenty of battle experiences, I would like to share my thoughts and knowledge that I obtained from being a bridge person between the marketing world and the technology world. In the end, data and analytics are just tools for decision-makers; let’s think about what we should be (or shouldn’t be) doing with them first. And the tools must be wielded properly to meet the goals, so let me share some useful tricks in database design, data refinement process and analytics.” Reach him at

2 thoughts on “The Art of Data Categorization”

  1. As usual, Stephen`s clarity of thought and expression makes his blogs invaluable. I`d recommend making this one go `viral` around every company that wishes to question its own practice and focus.

    1. Thank you, Peter! I wrote this one for selfish reason as well. Analytics work so much better with clean and categorized data, even if not every piece of data is put into the right slot. And this step doesn’t even require a degree in statistics; just organizational commitment.

Leave a Reply

Your email address will not be published. Required fields are marked *