Forgive the borrowed interest, but predictive modeling is to marketers as sex is to schoolboys.

They’re all talking about it, but few are doing it. And among those who are, fewer are doing it right.

In customer relationship marketing (CRM), predictive modeling uses data to predict the likelihood of a customer taking a specific action. It’s a three-step process:

1. Examine the characteristics of the customers who took a desired action

2. Compare them against the characteristics of customers who didn’t take that action

3. Determine which characteristics are most predictive of the customer taking the action and the value or degree to which each variable is predictive

Predictive modeling is useful in allocating CRM resources efficiently. If a model predicts that certain customers are less likely respond to a specific offer, then fewer resources can be allocated to those customers, allowing more resources to be allocated to those who are more likely to respond.

**Data Inputs**

A predictive model will only be as good as the input data that’s used in the modeling process. You need the data that define the dependent variable; that is, the outcome the model is trying to predict (such as response to a particular offer). You’ll also need the data that define the independent variables, or the characteristics that will be predictive of the desired outcome (such as age, income, purchase history, etc.). Attitudinal and behavioral data may also be predictive, such as an expressed interest in weight loss, fitness, healthy eating, etc.

The more variables that are fed into the model at the beginning, the more likely the modeling process will identify relevant predictors. Modeling is an iterative process, and those variables that are not at all predictive will fall out in the early iterations, leaving those that are most predictive for more precise analysis in later iterations. The danger in not having enough independent variables to model is that the resultant model will only explain a portion of the desired outcome.

For example, a predictive model created to determine the factors affecting physician prescribing of a particular brand was inconclusive, because there weren’t enough dependent variables to explain the outcome fully. In a standard regression analysis, the number of RXs written in a specific timeframe was set as the dependent variable. There were only three independent variables available: sales calls, physician samples and direct mail promotions to physicians. And while each of the three variables turned out to have a positive effect on prescriptions written, the “Multiple R”* *value of the regression equation was high at 0.44, meaning that these variables only explained 44 percent of the variance in RXs. The other 56 percent of the variance is from factors that were not included in the model input.

**Sample Size**

Larger samples will produce more robust models than smaller ones. Some modelers recommend a minimum data set of 10,000 records, 500 of those with the desired outcome. Others report acceptable results with as few as 100 records with the desired outcome. But in general, size matters.

Regardless, it is important to hold out a validation sample from the modeling process. That allows the model to be applied to the hold-out sample to validate its ability to predict the desired outcome.

**Important First Steps**

**1. Define Your Outcome. **What do you want the model to do for your business? Predict likelihood to opt-in? Predict likelihood to respond to a particular offer? Your objective will drive the data set that you need to define the dependent variable. For example, if you’re looking to predict likelihood to respond to a particular offer, you’ll need to have prospects who responded and prospects who didn’t in order to discriminate between them.

**2. Gather the Data to Model.** This requires tapping into several data sources, including your CRM database, as well as external sources where you can get data appended (see below).

**3. Set the Timeframe. **Determine the time period for the data you will analyze. For example, if you’re looking to model likelihood to respond, the start and end points for the data should be far enough in the past that you have a sufficient sample of responders and non-responders.

**4. Examine Variables Individually.** Some variables will not be correlated with the outcome, and these can be eliminated prior to building the model.

**Data Sources***Independent variable data* may include

- In-house database fields
- Data overlays (demographics, HH income, lifestyle interests, presence of children,

marital status, etc.) from a data provider such as Experian, Epsilon or Acxiom.

**Don’t Try This at Home**

While you can do regression analysis in Microsoft Excel, if you’re going to invest a lot of promotion budget in the outcome, you should definitely leave the number crunching to the professionals. Expert modelers know how to analyze modeling results and make adjustments where necessary.