How to Outsource Analytics

In this series, I have been emphasizing the importance of statistical modeling in almost every article. While there are plenty of benefits of using statistical models in a more traditional sense (refer to “Why Model?”), in the days when “too much” data is the main challenge, I would dare to say that the most important function of statistical models is that they summarize complex data into simple-to-use “scores.”

In this series, I have been emphasizing the importance of statistical modeling in almost every article. While there are plenty of benefits of using statistical models in a more traditional sense (refer to “Why Model?”), in the days when “too much” data is the main challenge, I would dare to say that the most important function of statistical models is that they summarize complex data into simple-to-use “scores.”

The next important feature would be that models fill in the gaps, transforming “unknowns” to “potentials.” You see, even in the age of ubiquitous data, no one will ever know everything about everybody. For instance, out of 100,000 people you have permission to contact, only a fraction will be “known” wine enthusiasts. With modeling, we can assign scores for “likelihood of being a wine enthusiast” to everyone in the base. Sure, models are not 100 percent accurate, but I’ll take “70 percent chance of afternoon shower” over not knowing the weather forecast for the day of the company picnic.

I’ve already explained other benefits of modeling in detail earlier in this series, but if I may cut it really short, models will help marketers:

1. In deciding whom to engage, as they cannot afford to spam the world and annoy everyone who can read, and

2. In determining what to offer once they decide to engage someone, as consumers are savvier than ever and they will ignore and discard any irrelevant message, no matter how good it may look.

OK, then. I hope you are sold on this idea by now. The next question is, who is going to do all that mathematical work? In a country where jocks rule over geeks, it is clear to me that many folks are more afraid of mathematics than public speaking; which, in its own right, ranks higher than death in terms of the fear factor for many people. If I may paraphrase “Seinfeld,” many folks are figuratively more afraid of giving a eulogy than being in the coffin at a funeral. And thanks to a sub-par math education in the U.S. (and I am not joking about this, having graduated high school on foreign soil), yes, the fear of math tops them all. Scary, heh?

But that’s OK. This is a big world, and there are plenty of people who are really good at mathematics and statistics. That is why I purposefully never got into the mechanics of modeling techniques and related programming issues in this series. Instead, I have been emphasizing how to formulate questions, how to express business goals in a more logical fashion and where to invest to create analytics-ready environments. Then the next question is, “How will you find the right math geeks who can make all your dreams come true?”

If you have a plan to create an internal analytics team, there are a few things to consider before committing to that idea. Too many organizations just hire one or two statisticians, dump all the raw data onto them, and hope to God that they will figure some ways to make money with data, somehow. Good luck with that idea, as:

1. I’ve seen so many failed attempts like that (actually, I’d be shocked if it actually worked), and

2. I am sure God doesn’t micromanage statistical units.

(Similarly, I am almost certain that she doesn’t care much for football or baseball scores of certain teams, either. You don’t think God cares more for the Red Sox than the Yankees, do ya?)

The first challenge is locating good candidates. If you post any online ad for “Statistical Analysts,” you will receive a few hundred resumes per day. But the hiring process is not that simple, as you should ask the right questions to figure out who is a real deal, and who is a poser (and there are many posers out there). Even among qualified candidates with ample statistical knowledge, there are differences between the “Doers” and “Vendor Managers.” Depending on your organizational goal, you must differentiate the two.

Then the next challenge is keeping the team intact. In general, mathematicians and statisticians are not solely motivated by money; they also want constant challenges. Like any smart and creative folks, they will simply pack up and leave, if “they” determine that the job is boring. Just a couple of modeling projects a year with some rudimentary sets of data? Meh. Boring! Promises of upward mobility only work for a fraction of them, as the majority would rather deal with numbers and figures, showing no interest in managing other human beings. So, coming up with interesting and challenging projects, which will also benefit the whole organization, becomes a job in itself. If there are not enough challenges, smart ones will quit on you first. Then they need constant mentoring, as even the smartest statisticians will not know everything about challenges associated with marketing, target audiences and the business world, in general. (If you stumble into a statistician who is even remotely curious about how her salary is paid for, start with her.)

Further, you would need to invest to set up an analytical environment, as well. That includes software, hardware and other supporting staff. Toolsets are becoming much cheaper, but they are not exactly free yet. In fact, some famous statistical software, such as SAS, could be quite expensive year after year, although there are plenty of alternatives now. And they need an “analytics-ready” data environment, as I emphasized countless times in this series (refer to “Chicken or the Egg? Data or Analytics?” and “Marketing and IT; Cats and Dogs”). Such data preparation work is not for statisticians, and most of them are not even good at cleaning up dirty data, anyway. That means you will need different types of developers/programmers on the analytics team. I pointed out that analytical projects call for a cohesive team, not some super-duper analyst who can do it all (refer to “How to Be a Good Data Scientist”).

By now you would say “Jeez Louise, enough already,” as all this is just too much to manage to build just a few models. Suddenly, outsourcing may sound like a great idea. Then you would realize there are many things to consider when outsourcing analytical work.

First, where would you go? Everyone in the data industry and their cousins claim that they can take care of analytics. But in reality, it is a scary place where many who have “analytics” in their taglines do not even touch “predictive analytics.”

Analytics is a word that is abused as much as “Big Data,” so we really need to differentiate them. “Analytics” may mean:

  • Business Intelligence (BI) Reporting: This is mostly about the present, such as the display of key success metrics and dashboard reporting. While it is very important to know about the current state of business, much of so-called “analytics” unfortunately stops right here. Yes, it is good to have a dashboard in your car now, but do you know where you should be going?
  • Descriptive Analytics: This is about how the targets “look.” Common techniques such as profiling, segmentation and clustering fall under this category. These techniques are mainly for describing the target audience to enhance and optimize messages to them. But using these segments as a selection mechanism is not recommended, while many dare to do exactly that (more on this subject in future articles).
  • Predictive Modeling: This is about answering the questions about the future. Who would be more likely to behave certain ways? What communication channels will be most effective for whom? How much is the potential spending level of a prospect? Who is more likely to be a loyal and profitable customer? What are their preferences? Response models, various of types of cloning models, value models, and revenue models, attrition models, etc. all fall under this category, and they require hardcore statistical skills. Plus, as I emphasized earlier, these model scores compact large amounts of complex data into nice bite-size packages.
  • Optimization: This is mostly about budget allocation and attribution. Marketing agencies (or media buyers) generally deal with channel optimization and spending analysis, at times using econometrics models. This type of statistical work calls for different types of expertise, but many still insist on calling it simply “analytics.”

Let’s say that for the purpose of customer-level targeting and personalization, we decided to outsource the “predictive” modeling projects. What are our options?

We may consider:

  • Individual Consultants: In-house consultants are dedicated to your business for the duration of the contract, guaranteeing full access like an employee. But they are there for you only temporarily, with one foot out the door all the time. And when they do leave, all the knowledge walks away with them. Depending on the rate, the costs can add up.
  • Standalone Analytical Service Providers: Analytical work is all they do, so you get focused professionals with broad technical and institutional knowledge. Many of them are entrepreneurs, but that may work against you, as they could often be understaffed and stretched thin. They also tend to charge for every little step, with not many freebies. They are generally open to use any type of data, but the majority of them do not have secure sources of third-party data, which could be essential for certain types of analytics involving prospecting.
  • Database Service Providers: Almost all data compilers and brokers have statistical units, as they need to fill in the gap within their data assets with statistical techniques. (You didn’t think that they knew everyone’s income or age, did you?) For that reason, they have deep knowledge in all types of data, as well as in many industry verticals. They provide a one-stop shop environment with deep resource pools and a variety of data processing capabilities. However, they may not be as agile as smaller analytical shops, and analytics units may be tucked away somewhere within large and complex organizations. They also tend to emphasize the use of their own data, as after all, their main cash cows are their data assets.
  • Direct Marketing Agencies: Agencies are very strategic, as they touch all aspects of marketing and control creative processes through segmentation. Many large agencies boast full-scale analytical units, capable of all types of analytics that I explained earlier. But some agencies have very small teams, stretched really thin—just barely handling the reporting aspect, not any advanced analytics. Some just admit that predictive analytics is not part of their core competencies, and they may outsource such projects (not that it is a bad thing).

As you can see here, there is no clear-cut answer to “with whom you should you work.” Basically, you will need to check out all types of analysts and service providers to determine the partner best suitable for your long- and short-term business purposes, not just analytical goals. Often, many marketers just go with the lowest bidder. But pricing is just one of many elements to be considered. Here, allow me to introduce “10 Essential Items to Consider When Outsourcing Analytics.”

1. Consulting Capabilities: I put this on the top of the list, as being a translator between the marketing and the technology world is the most important differentiator (refer to “How to Be a Good Data Scientist”). They must understand the business goals and marketing needs, prescribe suitable solutions, convert such goals into mathematical expressions and define targets, making the best of available data. If they lack strategic vision to set up the data roadmap, statistical knowledge alone will not be enough to achieve the goals. And such business goals vary greatly depending on the industry, channel usage and related success metrics. Good consultants always ask questions first, while sub-par ones will try to force-fit marketers’ goals into their toolsets and methodologies.

Translating marketing goals into specific courses of action is a skill in itself. A good analytical partner should be capable of building a data roadmap (not just statistical steps) with a deep understanding of the business impact of resultant models. They should be able to break down larger goals into smaller steps, creating proper phased approaches. The plan may call for multiple models, all kinds of pre- and post-selection rules, or even external data acquisition, while remaining sensitive to overall costs.

The target definition is the core of all these considerations, which requires years of experience and industry knowledge. Simply, the wrong or inadequate targeting decision leads to disastrous results, no matter how sound the mathematical work is (refer to “Art of Targeting”).

Another important quality of a good analytical partner is the ability to create usefulness out of seemingly chaotic and unstructured data environments. Modeling is not about waiting for the perfect set of data, but about making the best of available data. In many modeling bake-offs, the winners are often decided by the creative usage of provided data, not just statistical techniques.

Finally, the consultative approach is important, as models do not exist in a vacuum, but they have to fit into the marketing engine. Be aware of the ones who want to change the world around their precious algorithms, as they are geeks not strategists. And the ones who understand the entire marketing cycle will give advice on what the next phase should be, as marketing efforts must be perpetual, not transient.

So, how will you find consultants? Ask the following questions:

  • Are they “listening” to you?
  • Can they repeat “your” goals in their own words?
  • Do their roadmaps cover both short- and long-term goals?
  • Are they confident enough to correct you?
  • Do they understand “non-statistical” elements in marketing?
  • Have they “been there, done that” for real, or just in theories?

2. Data Processing Capabilities: I know that some people look down upon the word “processing.” But data manipulation is the most important key step “before” any type of advanced analytics even begins. Simply, “garbage-in, garbage out.” And unfortunately, most datasets are completely unsuitable for analytics and modeling. In general, easily more than 80 percent of model development time goes into “fixing” the data, as most are unstructured and unrefined. I have been repeatedly emphasizing the importance of a “model-ready” (or “analytics-ready”) environment for that reason.

However, the reality dictates that the majority of databases are indeed NOT model-ready, and most of them are not even close to it. Well, someone has to clean up the mess. And in this data business, the last one who touches the dataset becomes responsible for all the errors and mistakes made to it thus far. I know it is not fair, but that is why we need to look at the potential partner’s ability to handle large and really messy data, not just the statistical savviness displayed in glossy presentations.

Yes, that dirty work includes data conversion, edit/hygiene, categorization/tagging, data summarization and variable creation, encompassing all kinds of numeric, character and freeform data (refer to “Beyond RFM Data” and “Freeform Data Aren’t Exactly Free”). It is not the most glorious part of this business, but data consistency is the key to successful implementation of any advanced analytics. So, if a model-ready environment is not available, someone had better know how to make the best of whatever is given. I have seen too many meltdowns in “before” and “after” modeling steps due to inconsistencies in databases.

So, grill the candidates with the following questions:

  • If they support file conversions, edit, categorization and summarization
  • How big of a dataset is too big, and how many files/tables are too many for them
  • How much free-form data are too much for them
  • Ask for sample model variables that they have created in the past

3. Track Records in the Industry: It can be argued that industry knowledge is even more crucial for the success than statistical know-how, as nuances are often “Lost in Translation” without relevant industry experience. In fact, some may not even be able to carry on a proper conversation with a client without it, leading to all kinds of wrong assumptions. I have seen a case where “real” rocket scientists messed up models for credit card campaigns.

The No. 1 reason why industry experience is important is everyone’s success metrics are unique. Just to name a few, financial services (banking, credit card, insurance, investment, etc.), travel and hospitality, entertainment, packaged goods, online and offline retail, catalogs, publication, telecommunications/utilities, non-profit and political organizations all call for different types of analytics and models, as their business models and the way they interact with target audiences are vastly different. For example, building a model (or a database, for that matter) for businesses where they hand over merchandise “before” they collect money is fundamentally different than the ones where exchange happens simultaneously. Even a simple concept of payment date or transaction date cannot be treated the same way. For retailers, recent dates could be better for business, but for subscription business, older dates may carry more weight. And these are just some examples with “dates,” before touching any dollar figures or other fun stuff.

Then the job gets even more complicated, if we further divide all of these industries by B-to-B vs. B-to-C, where available data do not even look similar. On top of that, divisional ROI metrics may be completely different, and even terminology and culture may play a role in all of this. When you are a consultant, you really don’t want to stop the flow of a meeting to clarify some unfamiliar acronyms, as you are supposed to know them all.

So, always demand specific industry references and examine client roasters, if allowed. (Many clients specifically ask vendors not to use their names as references.) Basically, watch out for the ones who push one-size-fits-all cookie-cutter solutions. You deserve way more than that.

4. Types of Models Supported: Speaking of cookie-cutter stuff, we need to be concerned with types of models that the outsourcing partner would support. Sure, nobody employs every technique, and no one can be good at everything. But we need to watch out for the “One-trick Ponies.”

This could be a tricky issue, as we are going into a more technical domain. Plus, marketers should not self-prescribe with specific techniques, instead of clearly stating their business goals (refer to “Marketing and IT; Cats and Dogs”). Some of the modeling goals are:

  • Rank and select prospect names
  • Lead scoring
  • Cross-sell/upsell
  • Segment the universe for messaging strategy
  • Pinpoint the attrition point
  • Assign lifetime values for prospects and customers
  • Optimize media/channel spending
  • Create new product packages
  • Detect fraud
  • Etc.

Unless you have successfully dealt with the outsourcing partner in the past (or you have a degree in statistics), do not blurt out words like Neural-net, CHAID, Cluster Analysis, Multiple Regression, Discriminant Function Analysis, etc. That would be like demanding specific medication before your new doctor even asks about your symptoms. The key is meeting your business goals, not fulfilling buzzwords. Let them present their methodology “after” the goal discussion. Nevertheless, see if the potential partner is pushing one or two specific techniques or solutions all the time.

5. Speed of Execution: In modern marketing, speed to action is the king. Speed wins, and speed gains respect. However, when it comes to modeling or other advanced analytics, you may be shocked by the wide range of time estimates provided by each outsourcing vendor. To be fair they are covering themselves, mainly because they have no idea what kind of messy data they will receive. As I mentioned earlier, pre-model data preparation and manipulation are critical components, and they are the most time-consuming part of all; especially when available data are in bad shape. Post-model scoring, audit and usage support may elongate the timeline. The key is to differentiate such pre- and post-modeling processes in the time estimate.

Even for pure modeling elements, time estimates vary greatly, depending on the complexity of assignments. Surely, a simple cloning model with basic demographic data would be much easier to execute than the ones that involve ample amounts of transaction- and event-level data, coming from all types of channels. If time-series elements are added, it will definitely be more complex. Typical clustering work is known to take longer than regression models with clear target definitions. If multiple models are required for the project, it will obviously take more time to finish the whole job.

Now, the interesting thing about building a model is that analysts don’t really finish it, but they just run out of time—much like the way marketers work on PowerPoint presentations. The commonality is that we can basically tweak models or decks forever, but we have to stop at some point.

However, with all kinds of automated tools and macros, model development time has decreased dramatically in past decades. We really came a long way since the first application of statistical techniques to marketing, and no one should be quoting a 1980s timeline in this century. But some still do. I know vendors are trained to follow the guideline “always under-promise and over-deliver,” but still.

An interesting aspect of this dilemma is that we can negotiate the timeline by asking for simpler and less sophisticated versions with diminished accuracy. If, hypothetically, it takes a week to be 98 percent accurate, but it only takes a day to be 90 percent accurate, what would you pick? That should be the business decision.

So, what is a general guideline? Again, it really depends on many factors, but allow me to share a version of it:

  • Pre-modeling Processing

– Data Conversions: from half a day to weeks

– Data Append/Enhancement: between overnight and two days

– Data Edit and Summarization: Data-dependent

  • Modeling: Ranges from half a day to weeks

– Depends on type, number of models and complexity

  • Scoring: from half a day to one week

– Mainly depends on number of records and state of the database to be scored

I know these are wide ranges, but watch out for the ones that routinely quote 30 days or more for simple clone models. They may not know what they are doing, or worse, they may be some mathematical perfectionists who don’t understand the marketing needs.

6. Pricing Structure: Some marketers would put this on top of the checklist, or worse, use the pricing factor as the only criterion. Obviously, I disagree. (Full disclosure: I have been on the service side of the fence during my entire career.) Yes, every project must make an economic sense in the end, but the budget should not and cannot be the sole deciding factor in choosing an outsourcing partner. There are many specialists under famous brand names who command top dollars, and then there are many data vendors who throw in “free” models, disrupting the ecosystem. Either way, one should not jump to conclusions too fast, as there is no free lunch, after all. In any case, I strongly recommend that no one should start the meeting with pricing questions (hence, this article). When you get to the pricing part, ask what the price includes, as the analytical journey could be a series of long and winding roads. Some of the biggest factors that need to be considered are:

  • Multiple Model Discounts—Less for second or third models within a project?
  • Pre-developed (off-the-shelf) Models—These can be “much” cheaper than custom models, while not custom-fitted.
  • Acquisition vs. CRM—Employing client-specific variables certainly increases the cost.
  • Regression Models vs. Other Types—At times, types of techniques may affect the price.
  • Clustering and Segmentations—They are generally priced much higher than target-specific models.

Again, it really depends on the complexity factor more than anything else, and the pre- and post-modeling process must be estimated and priced separately. Non-modeling charges often add up fast, and you should ask for unit prices and minimum charges for each step.

Scoring charges in time can be expensive, too, so negotiate for discounts for routine scoring of the same models. Some may offer all-inclusive package pricing for everything. The important thing is that you must be consistent with the checklist when shopping around with multiple candidates.

7. Documentation: When you pay for a custom model (not pre-developed, off-the-shelf ones), you get to own the algorithm. Because algorithms are not tangible items, the knowledge is to be transformed in model documents. Beware of the ones who offer “black-box” solutions with comments like, “Oh, it will work, so trust us.”

Good model documents must include the following, at the minimum:

  • Target and Comparison Universe Definitions: What was the target variable (or “dependent” variable) and how was it defined? How was the comparison universe defined? Was there any “pre-selection” for either of the universes? These are the most important factors in any model—even more than the mechanics of the model itself.
  • List of Variables: What are the “independent” variables? How were they transformed or binned? From where did they originate? Often, these model variables describe the nature of the model, and they should make intuitive sense.
  • Model Algorithms: What is the actual algorithm? What are the assigned weight for each independent variable?
  • Gains Chart: We need to examine potential effectiveness of the model. What are the “gains” for each model group, from top to bottom (e.g., 320 percent gain at the top model group in comparison to the whole universe)? How fast do such gains decrease as we move down the scale? How do the gains factors compare against the validation sample? A graphic representation would be nice, too.

For custom models, it is customary to have a formal model presentation, full documentation and scoring script in designated programming languages. In addition, if client files are provided, ask for a waterfall report that details input and output counts of each step. After the model scoring, it is also customary for the vendor to provide a scored universe count by model group. You will be shocked to find out that many so-called analytical vendors do not provide thorough documentation. Therefore, it is recommended to ask for sample documents upfront.

8. Scoring Validation: Models are built and presented properly, but the job is not done until the models are applied to the universe from which the names are ranked and selected for campaigns. I have seen too many major meltdowns at this stage. Simply, it is one thing to develop models with a few hundred thousand record samples, but it is quite another to apply the algorithm to millions of records. I am not saying that the scoring job always falls onto the developers, as you may have an internal team or a separate vendor for such ongoing processes. But do not let the model developer completely leave the building until everything checks out.

The model should have been validated against the validation sample by then, but live scoring may reveal all kinds of inconsistencies. You may also want to back-test the algorithms with past campaign results, as well. In short, many things go wrong “after” the modeling steps. When I hear customers complaining about models, I often find that the modeling is the only part that was done properly, and “before” and “after” steps were all messed up. Further, even machines misunderstand each other, as any differences in platform or scripting language may cause discrepancies. Or, maybe there was no technical error, but missing values may have caused inconsistencies (refer to “Missing Data Can Be Meaningful”). Nonetheless, the model developers would have the best insight as to what could have gone wrong, so make sure that they are available for questions after models are presented and delivered.

9. Back-end Analysis: Good analytics is all about applying learnings from past campaigns—good or bad—to new iterations of efforts. We often call it “closed-loop marketing—while many marketers often neglect to follow up. Any respectful analytics shop must be aware of it, while they may classify such work separately from modeling or other analytical projects. At the minimum, you need to check out if they even offer such services. In fact, so-called “match-back analysis” is not as simple as just matching campaign files against responders in this omnichannel environment. When many channels are employed at the same time, allocation of credit (i.e., “what worked?”) may call for all kinds of business rules or even dedicated models.

While you are at it, ask for a cheaper version of “canned” reports, as well, as custom back-end analysis can be even more costly than the modeling job itself, over time. Pre-developed reports may not include all the ROI metrics that you’re looking for (e.g., open, clickthrough, conversion rates, plus revenue and orders-per-mailed, per order, per display, per email, per conversion. etc.). So ask for sample reports upfront.

If you start breaking down all these figures by data source, campaign, time series, model group, offer, creative, targeting criteria, channel, ad server, publisher, keywords, etc., it can be unwieldy really fast. So contain yourself, as no one can understand 100-page reports, anyway. See if the analysts can guide you with such planning, as well. Lastly, if you are so into ROI analysis, get ready to share the “cost” side of the equation with the selected partner. Some jobs are on the marketers.

10. Ongoing Support: Models have a finite shelf life, as all kinds of changes happen in the real world. Seasonality may be a factor, or the business model or strategy may have changed. Fluctuations in data availability and quality further complicate the matter. Basically assumptions like “all things being equal” only happen in textbooks, so marketers must plan for periodic review of models and business rules.

A sure sign of trouble is decreasing effectiveness of models. When in doubt, consult the developers and they may recommend a re-fit or complete re-development of models. Quarterly reviews would be ideal, but if the cost becomes an issue, start with 6-month or yearly reviews, but never go past more than a year without any review. Some vendors may offer discounts for redevelopment, so ask for the price quote upfront.

I know this is a long list of things to check, but picking the right partner is very important, as it often becomes a long-term relationship. And you may find it strange that I didn’t even list “technical capabilities” at all. That is because:

1. Many marketers are not equipped to dig deep into the technical realm anyway, and

2. The difference between the most mathematically sound models and the ones from the opposite end of the spectrum is not nearly as critical as other factors I listed in this article.

In other words, even the worst model in the bake-off would be much better than no model, if these other business criterion are well-considered. So, happy shopping with this list, and I hope you find the right partner. Employing analytics is not an option when living in the sea of data.

How to Be a Good Data Scientist

I guess no one wants to be a plain “Analyst” anymore; now “Data Scientist” is the title of the day. Then again, I never thought that there was anything wrong with titles like “Secretary,” “Stewardess” or “Janitor,” either. But somehow, someone decided “Administrative Assistant” should replace “Secretary” completely, and that someone was very successful in that endeavor. So much so that, people actually get offended when they are called “Secretaries.” The same goes for “Flight Attendants.” If you want an extra bag of peanuts or the whole can of soda with ice on the side, do not dare to call any service personnel by the outdated title. The verdict is still out for the title “Janitor,” as it could be replaced by “Custodial Engineer,” “Sanitary Engineer,” “Maintenance Technician,” or anything that gives an impression that the job requirement includes a degree in engineering. No matter. When the inflation-adjusted income of salaried workers is decreasing, I guess the number of words in the job title should go up instead. Something’s got to give, right?

I guess no one wants to be a plain “Analyst” anymore; now “Data Scientist” is the title of the day. Then again, I never thought that there was anything wrong with titles like “Secretary,” “Stewardess” or “Janitor,” either. But somehow, someone decided “Administrative Assistant” should replace “Secretary” completely, and that someone was very successful in that endeavor. So much so that, people actually get offended when they are called “Secretaries.” The same goes for “Flight Attendants.” If you want an extra bag of peanuts or the whole can of soda with ice on the side, do not dare to call any service personnel by the outdated title. The verdict is still out for the title “Janitor,” as it could be replaced by “Custodial Engineer,” “Sanitary Engineer,” “Maintenance Technician,” or anything that gives an impression that the job requirement includes a degree in engineering. No matter. When the inflation-adjusted income of salaried workers is decreasing, I guess the number of words in the job title should go up instead. Something’s got to give, right?

Please do not ask me to be politically correct here. As an openly Asian person in America, I am not even sure why I should be offended when someone addresses me as an “Oriental.” Someone explained it to me a long time ago. The word is reserved for “things,” not for people. OK, then. I will be offended when someone knowingly addresses me as an Oriental, now that the memo has been out for a while. So, do me this favor and do not call me an Oriental (at least in front of my face), and I promise that I will not call anyone an “Occidental” in return.

In any case, anyone who touches data for living now wants to be called a Data Scientist. Well, the title is longer than one word, and that is a good start. Did anyone get a raise along with that title inflation? I highly doubt it. But I’ve noticed the qualifications got much longer and more complicated.

I have seen some job requirements for data scientists that call for “all” of the following qualifications:

  • A master’s degree in statistics or mathematics; able to build statistical models proficiently using R or SAS
  • Strong analytical and storytelling skills
  • Hands-on knowledge in technologies such as Hadoop, Java, Python, C++, NoSQL, etc., being able to manipulate the data any which way, independently
  • Deep knowledge in ETL (extract, transform and load) to handle data from all sources
  • Proven experience in data modeling and database design
  • Data visualization skills using whatever tools that are considered to be cool this month
  • Deep business/industry/domain knowledge
  • Superb written and verbal communication skills, being able to explain complex technical concepts in plain English
  • Etc. etc…

I actually cut this list short, as it is already becoming ridiculous. I just want to see the face of a recruiter who got the order to find super-duper candidates based on this list—at the same salary level as a Senior Statistician (another fine title). Heck, while we’re at it, why don’t we add that the candidate must look like Brad Pitt and be able to tap-dance, too? The long and the short of it is maybe some executive wanted to hire just “1” data scientist with all these skillsets, hoping to God that this mad scientist will be able to make sense out of mounds of unstructured and unorganized data all on her own, and provide business answers without even knowing what the question was in the first place.

Over the years, I have worked with many statisticians, analysts and programmers (notice that they are all one-word titles), dealing with large, small, clean, dirty and, at times, really dirty data (hence the title of this series, “Big Data, Small Data, Clean Data, Messy Data”). And navigating through all those data has always been a team effort.

Yes, there are some exceptional musicians who can write music and lyrics, sing really well, play all instruments, program sequencers, record, mix, produce and sell music—all on their own. But if you insist that only such geniuses can produce music, there won’t be much to listen to in this world. Even Stevie Wonder, who can write and sing, and play keyboards, drums and harmonicas, had close to 100 names on the album credits in his heyday. Yes, the digital revolution changed the music scene as much as the data industry in terms of team sizes, but both aren’t and shouldn’t be one-man shows.

So, if being a “Data Scientist” means being a super businessman/analyst/statistician who can program, build models, write, present and sell, we should all just give up searching for one in the near future within your budget. Literally, we may be able to find a few qualified candidates in the job market on a national level. Too bad that every industry report says we need tens of thousands of them, right now.

Conversely, if it is just a bloated new title for good old data analysts with some knowledge in statistical applications and the ability to understand business needs—yeah, sure. Why not? I know plenty of those people, and we can groom more of them. And I don’t even mind giving them new long-winded titles that are suitable for the modern business world and peer groups.

I have been in the data business for a long time. And even before the datasets became really large, I have always maintained the following division of labor when dealing with complex data projects involving advanced analytics:

  • Business Analysts
  • Programmers/Developers
  • Statistical Analysts

The reason is very simple: It is extremely difficult to be a master-level expert in just one of these areas. Out of hundreds of statisticians who I’ve worked with, I can count only a handful of people who even “tried” to venture into the business side. Of those, even fewer successfully transformed themselves into businesspeople, and they are now business owners of consulting practices or in positions with “Chief” in their titles (Chief Data Officer or Chief Analytics Officer being the title du jour).

On the other side of the spectrum, less than a 10th of decent statisticians are also good at coding to manipulate complex data. But even they are mostly not good enough to be completely independent from professional programmers or developers. The reality is, most statisticians are not very good at setting up workable samples out of really messy data. Simply put, handling data and developing analytical frameworks or models call for different mindsets on a professional level.

The Business Analysts, I think, are the closest to the modern-day Data Scientists; albeit that the ones in the past were less so technicians, due to available toolsets back then. Nevertheless, granted that it is much easier to teach business aspects to statisticians or developers than to convert businesspeople or marketers into coders (no offense, but true), many of these “in-between” people—between the marketing world and technology world, for example—are rooted in the technology world (myself included) or at least have a deep understanding of it.

At times labeled as Research Analysts, they are the folks who would:

  • Understand the business requirements and issues at hand
  • Prescribe suitable solutions
  • Develop tangible analytical projects
  • Perform data audits
  • Procure data from various sources
  • Translate business requirements into technical specifications
  • Oversee the progress as project managers
  • Create reports and visual presentations
  • Interpret the results and create “stories”
  • And present the findings and recommended next steps to decision-makers

Sounds complex? You bet it is. And I didn’t even list all the job functions here. And to do this job effectively, these Business/Research Analysts (or Data Scientists) must understand the technical limitations of all related areas, including database, statistics, and general analytics, as well as industry verticals, uniqueness of business models and campaign/transaction channels. But they do not have to be full-blown statisticians or coders; they just have to know what they want and how to ask for it clearly. If they know how to code as well, great. All the more power to them. But that would be like a cherry on top, as the business mindset should be in front of everything.

So, now that the data are bigger and more complex than ever in human history, are we about to combine all aspects of data and analytics business and find people who are good at absolutely everything? Yes, various toolsets made some aspects of analysts’ lives easier and simpler, but not enough to get rid of the partitions between positions completely. Some third basemen may be able to pitch, too. But they wouldn’t go on the mound as starting pitchers—not on a professional level. And yes, analysts who advance up through the corporate and socioeconomic ladder are the ones who successfully crossed the boundaries. But we shouldn’t wait for the ones who are masters of everything. Like I said, even Stevie Wonder needs great sound engineers.

Then, what would be a good path to find Data Scientists in the existing pool of talent? I have been using the following four evaluation criteria to identify individuals with upward mobility in the technology world for a long time. Like I said, it is a lot simpler and easier to teach business aspects to people with technical backgrounds than the other way around.

So let’s start with the techies. These are the qualities we need to look for:

1. Skills: When it comes to the technical aspect of it, the skillset is the most important criterion. Generally a person has it, or doesn’t have it. If we are talking about a developer, how good is he? Can he develop a database without wasting time? A good coder is not just a little faster than mediocre ones; he can be 10 to 20 times faster. I am talking about the ones who don’t have to look through some manual or the Internet every five minutes, but the ones who just know all the shortcuts and options. The same goes for statistical analysts. How well is she versed in all the statistical techniques? Or is she a one-trick pony? How is her track record? Are her models performing in the market for a prolonged time? The thing about statistical work is that time is the ultimate test; we eventually get to find out how well the prediction holds up in the real world.

2. Attitude: This is a very important aspect, as many techies are locked up in their own little world. Many are socially awkward, like characters in Dilbert or “Big Bang Theory,” and most much prefer to deal with the machines (where things are clean-cut binary) than people (well, humans can be really annoying). Some do not work well with others and do not know how to compromise at all, as they do not know how to look at the world from a different perspective. And there are a lot of lazy ones. Yes, lazy programmers are the ones who are more motivated to automate processes (primarily to support their laissez faire lifestyle), but the ones who blow the deadlines all the time are just too much trouble for the team. In short, a genius with a really bad attitude won’t be able to move to the business or the management side, regardless of the IQ score.

3. Communication: Many technical folks are not good at written or verbal communications. I am not talking about just the ones who are foreign-born (like me), even though most technically oriented departments are full of them. The issue is many technical people (yes, even the ones who were born and raised in the U.S., speaking English) do not communicate with the rest of the world very well. Many can’t explain anything without using technical jargon, nor can they summarize messages to decision-makers. Businesspeople don’t need to hear the life story about how complex the project was or how messy the data sets were. Conversely, many techies do not understand marketers or businesspeople who speak plain English. Some fail to grasp the concept that human beings are not robots, and most mortals often fail to communicate every sentence as a logical expression. When a marketer says “Omit customers in New York and New Jersey from the next campaign,” the coder on the receiving end shouldn’t take that as a proper Boolean logic. Yes, obviously a state cannot be New York “and” New Jersey at the same time. But most humans don’t (or can’t) distinguish such differences. Seriously, I’ve seen some developers who refuse to work with people whose command of logical expressions aren’t at the level of Mr. Spock. That’s the primary reason we need business analysts or project managers who work as translators between these two worlds. And obviously, the translators should be able to speak both languages fluently.

4. Business Understanding: Granted, the candidates in question are qualified in terms of criteria one through three. Their eagerness to understand the ultimate business goals behind analytical projects would truly set them apart from the rest on the path to become a data scientist. As I mentioned previously, many technically oriented people do not really care much about the business side of the deal, or even have slight curiosity about it. What is the business model of the company for which they are working? How do they make money? What are the major business concerns? What are the long- and short-term business goals of their clients? Why do they lose sleep at night? Before complaining about incomplete data, why are the databases so messy? How are the data being collected? What does all this data mean for their bottom line? Can you bring up the “So what?” question after a great scientific finding? And ultimately, how will we make our clients look good in front of “their” bosses? When we deal with technical issues, we often find ourselves at a crossroad. Picking the right path (or a path with the least amount of downsides) is not just an IT decision, but more of a business decision. The person who has a more holistic view of the world, without a doubt, would make a better decision—even for a minor difference in a small feature, in terms of programming. Unfortunately, it is very difficult to find such IT people who have a balanced view.

And that is the punchline. We want data scientists who have the right balance of business and technical acumen—not just jacks of all trades who can do all the IT and analytical work all by themselves. Just like business strategy isn’t solely set by a data strategist, data projects aren’t done by one super techie. What we need are business analysts or data scientists who truly “get” the business goals and who will be able to translate them into functional technical specifications, with an understanding of all the limitations of each technology piece that is to be employed—which is quite different from being able to do it all.

If the career path for a data scientist ultimately leads to Chief Data Officer or Chief Analytics Officer, it is important for the candidates to understand that such “chief” titles are all about the business, not the IT. As soon as a CDO, CAO or CTO start representing technology before business, that organization is doomed. They should be executives who understand the technology and employ it to increase profit and efficiency for the whole company. Movie directors don’t necessarily write scripts, hold the cameras, develop special effects or act out scenes. But they understand all aspects of the movie-making process and put all the resources together to create films that they envision. As soon as a director falls too deep into just one aspect, such as special effects, the resultant movie quickly becomes an unwatchable bore. Data business is the same way.

So what is my advice for young and upcoming data scientists? Master the basics and be a specialist first. Pick a field that fits your aptitude, whether it be programming, software development, mathematics or statistics, and try to be really good at it. But remain curious about other related IT fields.

Then travel the world. Watch lots of movies. Read a variety of books. Not just technical books, but books about psychology, sociology, philosophy, science, economics and marketing, as well. This data business is inevitably related to activities that generate revenue for some organization. Try to understand the business ecosystem, not just technical systems. As marketing will always be a big part of the Big Data phenomenon, be an educated consumer first. Then look at advertisements and marketing campaigns from the promotor’s point of view, not just from an annoyed consumer’s view. Be an informed buyer through all available channels, online or offline. Then imagine how the world will be different in the future, and how a simple concept of a monetary transaction will transform along with other technical advances, which will certainly not stop at ApplePay. All of those changes will turn into business opportunities for people who understand data. If you see some real opportunities, try to imagine how you would create a startup company around them. You will quickly realize answering technical challenges is not even the half of building a viable business model.

If you are already one of those data scientists, live up to that title and be solution-oriented, not technology-oriented. Don’t be a slave to technologies, or be whom we sometimes address as a “data plumber” (who just moves data from one place to another). Be a master who wields data and technology to provide useful answers. And most importantly, don’t be evil (like Google says), and never do things just because you can. Always think about the social consequences, as actions based on data and technology affect real people, often negatively (more on this subject in future article). If you want to ride this Big Data wave for the foreseeable future, try not to annoy people who may not understand all the ins and outs of the data business. Don’t be the guy who spoils it for everyone else in the industry.

A while back, I started to see the unemployment rate as a rate of people who are being left behind during the progress (if we consider technical innovations as progress). Every evolutionary stage since the Industrial Revolution created gaps between supply and demand of new skillsets required for the new world. And this wave is not going to be an exception. It is unfortunate that, in this age of a high unemployment rate, we have such hard times finding good candidates for high tech positions. On one side, there are too many people who were educated under the old paradigm. And on the other side, there are too few people who can wield new technologies and apply them to satisfy business needs. If this new title “Data Scientist” means the latter, then yes. We need more of them, for sure. But we all need to be more realistic about how to groom them, as it would take a village to do so. And if we can’t even agree on what the job description for a data scientist should be, we will need lots of luck developing armies of them.