Marketers, How Valid Is Your Test? Hint: It’s Not About the Sample Size

The validity of a test is not tied to the size of the sample itself, but rather to the number of responses that you get from that sample. Choose sample sizes based on your expected response rate, not from tradition, your gut or convenience.

A client I worked with years ago kept fastidious records of test results that involved offers, lists, and creative executions. At the outset of our relationship, the client shared the results of these direct mail campaigns and the corresponding decisions that were made based on those results. The usual response rates were in the 0.3 to 0.5% range, and the test sample sizes were always 25,000. If a particular test cell got 130 responses (0.52%), it was deemed to have beaten a test cell that received 110 responses (0.44%). Make sense? Intuitively, yes. Statistically, no.

In fact, those two cells are statistically equal. With a sample size of 25,000 a 0.5% response rate, your results can vary by as much as 14.7% at a 90% confidence level. That means that there was a 90% chance that the results from that test could have been as much 0.55% or as little as 0.43%, making our test cell results of 110 responses (0.44%) and 130 responses (0.52%) statistically equal. I had to gently encourage the client to consider retesting at larger sample sizes.

There are statistical formulas for calculating sample size, but a good rule of thumb to follow is that with 250 responses, you can be 90% confident that your results will vary no more than +10%. This rule of thumb is valid in any medium online or offline. For example, if you test 25,000 emails and you get a 1% response rate, that’s 250 responses. Similarly, if you buy 250,000 impressions for an online ad and you get a 0.1% response rate, you get 250 responses. That means you can be 90% confident that (all things held equal) you will get between 0.9% and 1.1% in the email rollout,  and between 0.009% and 0.01%, with a continuation of the same ad in the same media. (Older editions of Ed Nash’sDirect Marketing — Strategy, Planning, Execution contain charts that you can reference at different sample sizes and response rates).

A smaller number of responses will result in a reduced confidence level or increased variance. For example, with a test size of 10,000 emails and a 1% response rate (100 responses), your variance at a 90% confidence level would be 16%, rather than 10%. That means you can be 90% confident that you’ll get between 0.84% and 1.16% response rate  with all things being held equal. Any response within that range could have been the result of variation within the sample.

Marketers are not alone in using their gut rather than statistics to determine sample sizes. Nobel Laureate Daniel Kahneman confesses in his book “Thinking, Fast and Slow“:

“Like many research psychologists, I had routinely chosen samples that were too small and had often obtained results that made no sense … the odd results were actually artifacts of my research method. My mistake, was particularly embarrassing because I taught statistics and knew how to compute the sample size that would reduce the risk to an acceptable level. But I had never chosen a sample size by computation. Like my colleagues, I had trusted tradition and my intuition in planning my experiments and had never thought seriously about the issue.”

The most important takeaway here is that the validity of a test is not tied to the size of the sample itself, but rather to the number of responses that you get from that sample. Choose sample sizes based on your expected response rate, not from tradition, your gut or convenience.

Election Polls and the Price of Being Wrong 

The thing about predictive analytics is that the quality of a prediction is eventually exposed — clearly cut as right or wrong. There are casually incorrect outcomes, like a weather report failing to accurately declare at what time the rain will start, and then there are total shockers, like the outcome of the 2016 presidential election.

screen-shot-2016-11-17-at-1-03-34-pmThe thing about predictive analytics is that the quality of a prediction is eventually exposed — clearly cut as right or wrong. There are casually incorrect outcomes, like a weather report failing to accurately declare the time it will start raining, and then there are total shockers, like the outcome of the 2016 presidential election.

In my opinion, the biggest losers in this election cycle are pollsters, analysts, statisticians and, most of all, so-called pundits.

I am saying this from a concerned analyst’s point of view. We are talking about colossal and utter failure of prediction on every level here. Except for one or two publications, practically every source missed the mark by more than a mile — not just a couple points off here and there. Even the ones who achieved “guru” status by predicting the 2012 election outcome perfectly called for the wrong winner this time, boldly posting a confidence level of more than 70 percent just a few days before the election.

What Went Wrong? 

The losing party, pollsters and analysts must be in the middle of some deep soul-searching now. In all fairness, let’s keep in mind that no prediction can overcome serious sampling errors and data collection problems. Especially when we deal with sparsely populated areas, where the winner was decisively determined in the end, we must be really careful with the raw numbers of respondents, as errors easily get magnified by incomplete data.

Some of us saw that type of over- or under-projection when the Census Bureau cut the sampling size for budgetary reasons during the last survey cycle. For example, in a sparsely populated area, a few migrants from Asia may affect simple projections like “percent Asians” rather drastically. In large cities, conversely, the size of such errors are generally within more manageable ranges, thanks to large sample sizes.

Then there are human inconsistency elements that many pundits are talking about. Basically everyone got so sick of all of these survey calls about the election, many started to ignore them completely. I think pollsters must learn that at times, less is more. I don’t even live in a swing state, and I started to hang up on unknown callers long before Election Day. Can you imagine what the folks in swing states must have gone through?

Many are also claiming that respondents were not honest about how they were going to vote. But if that were the case, there are other techniques that surveyors and analysts could have used to project the answer based on “indirect” questions. Instead of simply asking “Whom are you voting for?”, how about asking what their major concerns were? Combined with modeling techniques, a few innocuous probing questions regarding specific issues — such as environment, gun control, immigration, foreign policy, entitlement programs, etc. — could have led us to much more accurate predictions, reducing the shock factor.

In the middle of all this, I’ve read that artificial intelligence without any human intervention predicted the election outcome correctly, by using abundant data coming out of social media. That means machines are already outperforming human analysts. It helps that machines have no opinions or feelings about the outcome one way or another.

Dystopian Future?

Maybe machine learning will start replacing human analysts and other decision-making professions sooner than expected. That means a disenfranchised population will grow even further, dipping into highly educated demographics. The future, regardless of politics, doesn’t look all that bright for the human collective, if that trend continues.

In the predictive business, there is a price to pay for being wrong. Maybe that is why in some countries, there are complete bans on posting poll numbers and result projections days — sometimes weeks — before the election. Sometimes observation and prediction change behaviors of human subjects, as anthropologists have been documenting for years.

Who’s Winning in the Polls?

It depends on whom you ask. Really. It also depends on when you ask them. Over the next several months, the news media will report on poll after poll that shows either presidential candidate Donald Trump gaining on opponent Hillary Clinton or Hillary surging against Trump.

red and blue marblesIt depends on whom you ask. Really. It also depends on when you ask them.

Over the next several months, the news media will report on poll after poll that shows either presidential candidate Donald Trump gaining on opponent Hillary Clinton or Hillary surging against Trump. There will be polls on what’s happening in different swing states and among different demographic groups. How accurate they are depends on the methodology used, how the sample was derived and the margin of error associated with the sample size – not to mention how today’s events in the 24/7 news cycle can throw the results of yesterday’s poll into turmoil.

Many years ago, I remember playing with a low-tech exhibit at the Franklin Institute in Philadelphia that was the best illustration of how sample size can affect the outcome of a poll.

My memory may not be entirely accurate on this, but the concept is simple. There was a box that contained 100 marbles: 45 red marbles and 55 blue marbles. You would tilt the box so that all the marbles ran to the top. Then, you would tilt the box the other way. As the marbles rolled to the bottom, 10 were captured in little cups — while the rest fell to the bottom. Sometimes, the cups captured more red marbles than blue marbles. Other times, the blue marbles far exceeded the number of red marbles. Do it enough times, and the blue marbles will eventually win.

Nate Cohn draws a comparison between polls and the national pastime in the New York Times:

It’s a lot like baseball. Even great baseball players go 0 for 4 in a game — or have rough stretches for weeks on end. On the other end might be a few multi-hit nights with extra-base hits, or a spectacular few weeks.

Sometimes, these rough stretches or hot streaks really do indicate changes in the underlying ability of a player. More often, they are just part of the noise inevitable with small samples. Taking more polls is like watching more at-bats, and you need many if you want to be confident about whether a candidate is ahead or tied.

That’s why baseball is a statistician’s favorite sport; it has a large sample size. Thirty teams each play 162 games in the regular season for a total of 2,430 contests. As the wins and losses converge toward the mean, the best teams win about 60 percent of their games and the worst teams win about 40 percent.

So be wary of placing your faith and trust in the poll du jour. It’s a long season.