Data Must Flow, But Not All of Them

Like any resource like water, data may be locked in wrong places or in inadequate forms. We hear about all kinds of doomsday scenarios related to the water supply in Africa, and it is because of uneven distribution of water thanks to drastic climate change and border disputes.

The answer is “not really.” Let’s start with seemingly complicated transaction history data. People’s purchasing behavior — the building blocks of models that predict future behavior — can be shortened to the following list:

  • Who bought,
  • What,
  • When,
  • For how much ($),
  • Through what channel.
  • If readily available, we may consider peripheral information, such as payment method, types of stores, etc.

Of course I’m grossly simplifying the matter. But starting a conversation about data transfer this way is much easier than insisting on absolutely everything that may come with an 800-page data dictionary. Even the most guarded IT managers would say, “Yeah, sure. Why not? That list doesn’t look so bad.” The idea is to get all involved parties mobilized toward the end-goal faster, without arguing about the logistics of large data transfer. Instead, at the idea stage, get the buy-in from the stakeholders and start talking about long- and short-term data goals.

When we get to the actual project stage, we can break down the details. Continuing with the transaction data example:

  • Who Bought: This part can be expressed as name, address, email, phone number, and other complete or incomplete IDs. Before insisting on having a perfect format for every field, think about the bare minimum for what you are trying to achieve. Not all marketing projects require pinpoint accuracy in the end.
  • What: It can be a product description, SKU categories in multiple levels, etc. Some may need some serious re-categorization, but again, get ready to make do with what you get.
  • When: This is the easiest part. Just get the time-stamp and the timezone. But, if the project is for any continuity business, get ready to get all types of dates, such as member date, subscription start and end date, renewal date, payment date, delinquent date, etc.
  • How Much: This is unfortunately not that simple, as we may have to dig into differences in currencies, and details such as net price, tax, shipping, coupon, discount, return and total paid amount. Do not insist on getting every one of these fields correct. For predictive analytics, all of these numbers will be banded, anyway. Amount paid is the most important one, and discount amount can be very useful for the prediction of “bargain seekers.”
  • Through What Channel: Being consistent is important, as I have seen so many different labels for channels like “online” or “retail.” Also, outbound channel and inbound channel must be treated separately.
  • Other Fields? Payment method is very predictive, but don’t insist on non-essential items and delay the project.

The key here is the simplicity for everyone involved in data handoff. Start with the idea that no dataset is perfect, and analytics is about making the most out of provided data.

Even — seemingly huge — online behavior data can be simplified. All of those digital analytics toolsets are already working on streamlined data, anyway. Clicks, page views and conversions by various categories, such as product, channel, source or page types are the common sets of variables — no matter what toolsets are employed.

Do you want to get into some prediction business based on what is happening on the site? Let’s not try to boil the ocean. Let’s simplify what we would insert into the process. We may just need:

  • Who is looking at it (or a proxy of a person, like a cookie)
  • What the visitor looked at
  • What page or category the object belongs to
  • How long the visitor looked at it

Without a doubt, one may think of more data items to use. But these are the very basic properties that make up what we call “behavior.” So, let’s keep it simple in the beginning.

As I have been stating in my articles for three years now, Big Data are big because nothing gets to be thrown out. There’s lots of noise in any data, clean or dirty. The usefulness of the data is determined by the goals, not coolness. Depending on the purpose, the value of each nugget may change dramatically.

That is why the goal must be set first, before anyone tries to move the flow of large bodies of data. If the goal is clear and sound, the amount of data that has to change hands would look completely manageable. Even gargantuan data can be moved in small pieces through conduits the size of straws. We just have to know how to break them up before moving them for the users.

Author: Stephen H. Yu

Stephen H. Yu is a world-class database marketer. He has a proven track record in comprehensive strategic planning and tactical execution, effectively bridging the gap between the marketing and technology world with a balanced view obtained from more than 30 years of experience in best practices of database marketing. Currently, Yu is president and chief consultant at Willow Data Strategy. Previously, he was the head of analytics and insights at eClerx, and VP, Data Strategy & Analytics at Infogroup. Prior to that, Yu was the founding CTO of I-Behavior Inc., which pioneered the use of SKU-level behavioral data. “As a long-time data player with plenty of battle experiences, I would like to share my thoughts and knowledge that I obtained from being a bridge person between the marketing world and the technology world. In the end, data and analytics are just tools for decision-makers; let’s think about what we should be (or shouldn’t be) doing with them first. And the tools must be wielded properly to meet the goals, so let me share some useful tricks in database design, data refinement process and analytics.” Reach him at

Leave a Reply

Your email address will not be published. Required fields are marked *