The answer is “not really.” Let’s start with seemingly complicated transaction history data. People’s purchasing behavior — the building blocks of models that predict future behavior — can be shortened to the following list:
- Who bought,
- What,
- When,
- For how much ($),
- Through what channel.
- If readily available, we may consider peripheral information, such as payment method, types of stores, etc.
Of course I’m grossly simplifying the matter. But starting a conversation about data transfer this way is much easier than insisting on absolutely everything that may come with an 800-page data dictionary. Even the most guarded IT managers would say, “Yeah, sure. Why not? That list doesn’t look so bad.” The idea is to get all involved parties mobilized toward the end-goal faster, without arguing about the logistics of large data transfer. Instead, at the idea stage, get the buy-in from the stakeholders and start talking about long- and short-term data goals.
When we get to the actual project stage, we can break down the details. Continuing with the transaction data example:
- Who Bought: This part can be expressed as name, address, email, phone number, and other complete or incomplete IDs. Before insisting on having a perfect format for every field, think about the bare minimum for what you are trying to achieve. Not all marketing projects require pinpoint accuracy in the end.
- What: It can be a product description, SKU categories in multiple levels, etc. Some may need some serious re-categorization, but again, get ready to make do with what you get.
- When: This is the easiest part. Just get the time-stamp and the timezone. But, if the project is for any continuity business, get ready to get all types of dates, such as member date, subscription start and end date, renewal date, payment date, delinquent date, etc.
- How Much: This is unfortunately not that simple, as we may have to dig into differences in currencies, and details such as net price, tax, shipping, coupon, discount, return and total paid amount. Do not insist on getting every one of these fields correct. For predictive analytics, all of these numbers will be banded, anyway. Amount paid is the most important one, and discount amount can be very useful for the prediction of “bargain seekers.”
- Through What Channel: Being consistent is important, as I have seen so many different labels for channels like “online” or “retail.” Also, outbound channel and inbound channel must be treated separately.
- Other Fields? Payment method is very predictive, but don’t insist on non-essential items and delay the project.
The key here is the simplicity for everyone involved in data handoff. Start with the idea that no dataset is perfect, and analytics is about making the most out of provided data.
Even — seemingly huge — online behavior data can be simplified. All of those digital analytics toolsets are already working on streamlined data, anyway. Clicks, page views and conversions by various categories, such as product, channel, source or page types are the common sets of variables — no matter what toolsets are employed.
Do you want to get into some prediction business based on what is happening on the site? Let’s not try to boil the ocean. Let’s simplify what we would insert into the process. We may just need:
- Who is looking at it (or a proxy of a person, like a cookie)
- What the visitor looked at
- What page or category the object belongs to
- How long the visitor looked at it
Without a doubt, one may think of more data items to use. But these are the very basic properties that make up what we call “behavior.” So, let’s keep it simple in the beginning.
As I have been stating in my articles for three years now, Big Data are big because nothing gets to be thrown out. There’s lots of noise in any data, clean or dirty. The usefulness of the data is determined by the goals, not coolness. Depending on the purpose, the value of each nugget may change dramatically.
That is why the goal must be set first, before anyone tries to move the flow of large bodies of data. If the goal is clear and sound, the amount of data that has to change hands would look completely manageable. Even gargantuan data can be moved in small pieces through conduits the size of straws. We just have to know how to break them up before moving them for the users.