Sisifo - Uplift

Introduction & purpose

Uplift modeling is a technique for predicting the response of customers while taking into account the incremental impact if they have been targeted by e.g. a marketing action.

In other words, the kind of question that uplift modeling answers looks like this: is there a way to distinguish the customers that respond (or not) to a marketing action when targeted, from the customers that respond (or not) when not targeted? Examples for its applications are quite diverse: churn, up-sell, cross-sell. The technique is relatively old (first papers are from 1999), but there is not so much literature on it; in particular, the evaluation from a cost/benefit perspective is not obvious -that’s the subject of this post.

As a conclusion of a profit analysis of this kind, one can deduce which scenarios are relevant for the uplift modeling, compared to a “traditional” model -depending on the profit for a successful response, the cost of targeting, and the cost of an unsuccessful response. For the uplift model to make a big difference, you probably need a scenario with relatively high cost of targeting. If you also take into account that you would ideally need a A/B experiment to get the training data, the main conclusion is that the cost/profit analysis becomes even more important than usual -i.e. in addition to the technical assessment on the accuracy of the model.

Dataset used

A very good available dataset is the MineThatData E-mail Analyitics and Mining Challenge, 2008, usually referred to as the “Hillstom challenge”. It contains registers for 64K customers who last purchased within 12 months, and were split randomly in 3 groups; 1/3 of the customers were not targeted at all, and the other 2/3 were targeted by two different e-mail campaigns. Some of the questions of the challenge can be answered with an uplift classifier: what is the incremental response of the customers that were targeted by any of two the campaigns? Is there a way to select optimally a subset of the customers that should definitively be targeted (i.e. they will likely respond to an action)? What about a subset of customers that could be removed from the campaigns (since they would convert anyway)?

For most of the text below, it will be assumed that a “successful response” of the customer actually means a “conversion” -which would be ok for this dataset, but it is not really a good term for e.g. for a churn problem.

References on evaluation

On the evaluation of the model, there are very good and accepted references by N. J. Radcliffe (2007, 2011). This post follows their definition of “Qini curves” and their method to calculate the profitability of the campaign.

References for fitting the model

There’s quite a few papers describing different approaches to fitting an uplift model; a particularly interesting one is by Jaskowski & Jaroszewicz 2012, since it describes a way to adapt any classification technique to uplift modeling. Besides, this is implemented for R in the uplift package.

(Additional useful references are at the bottom of this post.)

R code

Also at the bottom of this post, you will find all the R code used for the evaluation of the “Hillstom challenge” dataset, both for fitting the model and for plotting the profit curves. If you like, you can use as a tutorial, and you should be able to follow even if your knowledge of R is low.

Evaluation

Let’s start by the evaluation, assuming that the model has already been fitted. This below would be the cumulative response curve for the “traditional” classifier; the curve accumulates the number of conversions per ratio of customers that have been targeted.

In this figure:

The red curve represents the accumulated conversions when the customers are taken at random (without a classifier)
The blue curve represents the accumulated conversions in the order defined by the classifier
The area under the classifier (blue) curve and over the random (red) curve should be the bigger the better: it means that the classifier orders the customers well according to the probability of conversion; i.e. for a large area, targeting a small(er) percentage of the customers will bring big(ger) profits
If 100% of customers are targeted, the expected conversion rate is the baseline (around 12%), regardless of the method to order the customers

Anyhow, this cumulative response curve is “technical”, in the sense that it only assesses the accuracy of the classifier. It does not say anything about the profit of the campaign, neither about the optimal % of customers that should be targeted (i.e. you cannot choose it from the curve).

On the other hand, it assesses that the accuracy of this “traditional” classifier is at least decent.

The curve below is the “qini curve” for the uplift classifier; it accumulates the number of conversions per ratio of customers that have been targeted (same as the cumulative response), however not scoring for the customers in the control group (i.e. the customers which would have converted anyway, regardless of the marketing action).

In this figure:

Again the red curve represents the random classifier
The blue curve represents the accumulated conversions in the order defined by the uplift classifier
The black curve represents the ideal way to order the customers: first, all the customers that convert because they have been targeted (1st slope up), then customers that do not convert (flat slice), and finally all the customers that convert even if they are not targeted (and thus should not be targeted, 2nd slope down)
Again, the bigger the area under the classifier (blue) curve and over the random (red) curve, the better, since it means that the classifier orders the customers well according to the incremental probability of conversion; thus, for a large area, targeting a small(er) percentage of the customers will bring big(ger) profits
If 100% of customers are targeted, although the total number of conversion would be again the baseline 12%, the incremental conversion is around 5% -that’s what the curve accumulates. (15% of targeted customers converted, but around 10% in the control group would have converted anyway without being targeted)

Once again, this curve would be “technical” -it only assesses the accuracy of the classifier; it does not imply profits. The accuracy of this classifier is not outstanding but ok -the reason why will become clearer when going into details on the fit.

How to compare the cumulative response with the Qini curve?

From this two curves, there is no way to decide between the “traditional” and the uplift modeling. Of course, when drawing the Qini curve for the “traditional” classifier (look at the R code section at the bottom), the results are worse compared to the uplift, but the question is: is the real business scenario expressed better by the Qini curve? If one is worried about high targeting costs (e.g. a discount or an offer when customers are targeted), or about “sleeping dogs” (customers that have less probability of conversion when targeted), then intuitively the Qini curve and uplift model should behave better.

However, assuming that the traditional classifier might have pretty good accuracy (although the comparison of accuracies is a difficult matter), how high does the targeting cost need to be so that the uplift classifier is actually better than the traditional with a significant difference? It could be that a good traditional classifier makes the same (or more) profit for low enough costs? Analogously, what would be more precisely the impact of “sleeping dogs”?

Profit curves in a business scenario

A more accurate comparison can be done with profit curves. There’s only an additional difficulty: a business scenario must be defined, using 3 parameters:

profit.tp: the profit of a conversion (or, in a churn problem, the profit when the customer stays)
targeting.cost: the cost of targeting a customer, because e.g. a discount or an offer is made
cost.fp: the cost of targeting customers that do not convert (or do not stay) -i.e. they do not accept the discount, and they could be “sleeping dogs”

Scenario 1: churn

One suitable scenario for the analysis would be a mobile operator dealing with churn. Operators often offer a new terminal, which has a high cost, to their potentially leaving customers. Let’s assume that the profit of a customer staying is 400, that the targeting cost is 220 (i.e. basically the terminal), and the average costs of failed targets are 1 (due e.g. to costs of the call center). With those numbers, the profit comparison will show a clear advantage for the fitted uplift classifier.

In these figures:

The red curve is once again the random baseline. If 0 customers were targeted, the benefit of the campaign would actually be the profits for the customers in the control group (i.e. customers that convert spontaneously, or, if talking about churn, customers that stay without any additional offer). If all customers were targeted, the profit for every customer that converts/stays would be profit.tp - targeting.cost; however for every customer in the control group that was not targeted, the targeting costs could have been saved (i.e. they would have converted with no need to offer them a new terminal)
The blue curve represents the ordering of the classifier. As usual, the higher the area, the better
The blue curve ideally (although only for some conditions) has a maximum value for a ratio of customers different than 1. When this happens, not all the customers have to be addressed in order to get the maximum profit -in most scenarios, it would actually be impossible to target all of them, anyway

In this setup, the uplift classifier would get a profit of 46.4x(# of customers) when targeting only 54% of the customers. The traditional classifier would get smaller profits, 44.81x(# of customers), besides requiring to target pretty much all the customers -which makes it substantially worse. The traditional classifier would get a profit much closer to the baseline value when targeting the same 54% of the customers.

Very good for the uplift this far; however, the scenario is quite sensitive to the ratio of profits and costs. If the targeting cost was above 240 (the curves above are for 220), both classifiers would get optimal value when targeting 0 customers (i.e. the costs of the call center for the unsuccessful contacts would not be worth the successes, unless more accurate classifiers were fitted).

If the targeting cost was below 200, the profits of both classifiers would be maximized when targeting all customers, which is usually not feasible, as mentioned already (at least not in the churn scenario); targeting a lower amount, the uplift model would still give higher profits, but the difference between both would be quite small. See the curves below, for targeting.cost = 180; observe that the difference is negligible for ratios below 40%.

Pushing the targeting.cost even lower, below a certain threshold, the traditional classifier would actually be better for this example (again, this depends on the accuracy of the classifiers).

Scenario 2: retail seller

Let’s take another good scenario for the uplift: a retail seller has a profit of 40 for a certain article, they run an e-mail campaign offering a discount with a cost of 18 per article for the company. Besides, even if the costs of the e-mail campaign do not depend on the customers that have been targeted, there is an additional cost associated to the “sleeping dogs” (customers who have less probability of conversion if targeted), approximated as an average of 1 per targeted customer who does not convert (since not all customers that do not convert are actually “sleeping dogs”). Find below the curves.

Because of the “sleeping dogs”, and before looking at the curves, it would intuitively seem like a good idea to target a smaller number of customers. Additionally, for these ratios of profits and costs, you can see above that targeting 100% would actually produce less profits than the baseline for 0%. The uplift model has again a better profit, 4.45x(# of customers) when targeting around 50%.

However, once again, the scenario is quite sensitive to changes in costs and profits. For targeting.cost below 15, the traditional classifier starts to be better, and if above 20, none of the classifiers is profitable compared to the baseline. On the other hand, if the estimated cost of the “sleeping dogs” was 2 instead of 1, none of the classifiers is profitable; if 0.5, the traditional classifier behaves better.

The main question: how do these results generalize?

Of course, the concrete numbers of this example do not generalize at all; however, it will always be true that a cost/benefit analysis will be key. Using uplift modeling and a Qini curve will not say much about how the classifier is going to behave in the business scenario, nor about the comparison with a traditional classifier.

The picture would look much better than above if the accuracy of both classifiers were significantly better, and it would look just different if the relation of the accuracy of the two classifiers was different. However, note that the uplift classifier is always more difficult to fit, and it would ideally need a A/B experiment to be run by the company. This is a call for a more careful assessment.

Let’s dig into the R code for training the models -and also take the opportunity to see why the uplift model is more difficult.

The R code (at last)

First, let’s download the csv of the “Hillstrom challenge”.

url = "http://www.minethatdata.com/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv"
hc = read.csv(file=url)

Take a look to the original post of the Hillstom challenge for a description of the features; segment describes whether the register corresponds to either any of the two campaigns, or the control group that has not been target. Let’s create a new column to indicate if the register corresponds to the control or not, and convert to factors some features with logical values 0/1.

hc$treat <- ifelse(as.character(hc$segment) != "No E-Mail", TRUE, FALSE)
hc$mens <- as.factor(hc$mens)
hc$womens <- as.factor(hc$womens)
hc$newbie <- as.factor(hc$newbie)

The package uplift contains several useful functions for uplift modeling; rvtu transforms the dataset to create a new variable that will represent, when fitted by a classifier, the incremental probability of conversion (i.e. the probability that the customer does convert when they have been targeted, minus the probability that the customer does convert when they have not been targeted). For details, just run:

require(uplift)
?rvtu

or alternatively go to the source, Jaskowski & Jaroszewicz 2012. The technique works well if the populations of the campaign and the control are similar, which is true for the dataset when selecting either of the two campaigns. You can check it as follows,

require(plyr)
count(hc, "segment")

At last, let’s create the missing variable for the incremental probability, taking into account that the formula needs to include the column for distinguishing the control registers in numeric format (created above), plus any column that the classifier will later use. Also, let’s select the data for the control and the “Womens E-Mail” campaign, since it has a higher effect and it is also used by another paper (useful to be able compare, in a few paragraphs more). For the same reason, the visits to the web page are taken as conversions (i.e. the fact that sending a e-mail to a customer produces a visit to the web page means a conversion).

hc_rvtu <- rvtu(visit~recency+history_segment+history+mens+womens+zip_code+newbie+channel+trt(as.numeric(treat)),
                data=hc[hc$segment != "Mens E-Mail",],
                method="none")
names(hc_rvtu)

##  [1] "recency"         "history_segment" "history"
##  [4] "mens"            "womens"          "zip_code"
##  [7] "newbie"          "channel"         "ct"
## [10] "y"               "z"

Where the new columns are:

y is the original target, i.e. the conversion (the original column was visit)
z is the transformed target for the uplift classifier, used to estimate the incremental probability of conversion
ct is 1 for target customers and 0 for the control (equivalent then to treat)

For the exploratory analysis, there is another useful function in the uplift package, explore, which gives a numeric summary of the incremental conversion (i.e. uplift) per value of each of the features -using ranges for the numeric features. The only number missing in the output of the function is the overall uplift.

explore(y~recency+history_segment+history+mens+womens+zip_code+newbie+channel+trt(ct),
        data=hc_rvtu)
# targeted
count(hc_rvtu[hc_rvtu$ct == 1,], "y")$freq / sum(hc_rvtu$ct == 1)
# control
count(hc_rvtu[hc_rvtu$ct == 0,], "y")$freq / sum(hc_rvtu$ct == 0)

Besides, it is always a good idea to explore the data visually. This post is far too long already; find below an example of plots comparing the campaign and control registers, for the 3 types of conversion (visit to the web page, actual purchase or conversion, and money spent), related to the column recency (time elapsed since last purchase). Similar functions can be used for the rest of the features.

par(mfrow=c(2,3))
boxplot(recency~visit, data=hc[hc$treat,],
        ylab="recency", xlab="visit")
boxplot(recency~conversion, data=hc[hc$treat,],
        ylab="recency", xlab="conversion")
boxplot(split(hc$recency[hc$spend != 0 & hc$treat],
              cut(hc$spend[hc$spend != 0 & hc$treat], 3)),
        ylab="recency", xlab="spend for converted")
mtext("recency target", side=3, line=-3, outer=TRUE, cex=2, font=2)
boxplot(recency~visit, data=hc[! hc$treat,],
        ylab="recency", xlab="visit")
boxplot(recency~conversion, data=hc[hc$treat,],
        ylab="recency", xlab="conversion")
boxplot(split(hc$recency[hc$spend != 0 & ! hc$treat],
              cut(hc$spend[hc$spend != 0 & ! hc$treat], 3)),
        ylab="recency", xlab="spend for converted")
mtext("recency control", side=3, line=-39, outer=TRUE, cex=2, font=2)

Finally, a simple visual analysis shows that some of the features have strong correlation with some other features (as could be expected just from reading the descriptions of the features themselves). It can be seen as follows.

pairs(~recency+history_segment+history,
      data=hc_rvtu, col=hc_rvtu$y+1)

Let’s fit the classifier then, using logistic regression. There are reasons for this decision: it is relatively fast and simple to fit and (cross-)validate, it usually behaves well, and it gives a good approximation of the actual probability of a register belonging to a class (i.e. the probability of a customer converting for the “traditional” classifier). It’s also possible to interpret the results somewhat. Finally, using the lasso as regularization technique, the optimal model uses only a subset of the features, thus helping interpretability while performing feature selection.

However, note that in the literature around uplift the authors use decision trees most of the time (or some of the techniques for aggregating trees like boosting or random forests). Actually, most of the papers modify the way that the tree is built, so that it is optimized for the uplift setup. The uplift package provides such a function for fitting random forests.

This post will stick to logistic regression, which also makes it more workable when read as a tutorial -random forests takes some time to fit and tune for the 60K rows of the dataset. Anyhow, the results from the uplift package and the papers will be an excellent framework for comparison.

A simple linear formula like the one below does not give good enough results. It is basically including all the available features (except the ones suspect of correlation, after some trials to see how the model behaves).

logit.formula <- ~recency+mens+womens+zip_code+newbie+channel

For the “traditional” classifier it could be ok, even with low accuracy. However for the uplift classifier the results of cross-validation basically make no sense and indicate there is not enough “signal” to fit the model. Let’s take a minute to get at least an intuition on why this is happening.

The “traditional” classifier with try to estimate the probability of conversion (y as response) using the features of the formula. All the registers that have converted, regardless of the fact that they were targeted or not, will have y == 1, and they have exactly that in common: they have converted. This is likely quite a good “signal” for fitting the model.

However, the uplift classifier will try to estimate the incremental probability of conversion, using z as response. z is 1 for the customers that were targeted and converted, and also for the customers that were not targeted and did not convert. Those customers will likely have less in common, less “signal”, compared to the “traditional” setup.

Anyway, a formula with interactions is already much better for the fit -that’s the one used for the evaluation above.

logit.formula.interactions <- as.formula(paste("~history_segment*mens+history_segment*womens",
                                               "history_segment*newbie",
                                               "zip_code*mens+zip_code*womens+zip_code*newbie",
                                               "channel*mens+channel*womens+channel*newbie",
                                               "channel*history_segment",
                                               sep="+"))

Let’s fit the logit model using glmnet.

set.seed(1024)
require(glmnet)
logit.x.interactions <- model.matrix(logit.formula.interactions, data=hc_rvtu)
logit.z <- hc_rvtu$z
logit.y <- hc_rvtu$y

# traditional classifier, y as response
logit.cv.lasso.y.interactions <- cv.glmnet(logit.x.interactions, logit.y, alpha=1, family="binomial")
plot(logit.cv.lasso.y.interactions)

# uplift classifier, z as response
logit.cv.lasso.z.interactions <- cv.glmnet(logit.x.interactions, logit.z, alpha=1, family="binomial")
plot(logit.cv.lasso.z.interactions)

The traditional classifier uses 21 non-zero features when selecting the optimal result from cross-validation (i.e. using lambda.1se, which the lambda 1 standard error away from the lambda for the minimum error; for details use ?cv.glmnet). For the uplift classifier, these are the non-zero coefficients for the optimal and minimal lambda.

coef(logit.cv.lasso.z.interactions)[which(coef(logit.cv.lasso.z.interactions) != 0),]

## (Intercept)     womens1
##  0.08199330  0.01040769

coef(logit.cv.lasso.z.interactions,
     s=logit.cv.lasso.z.interactions$lambda.min)[which(coef(logit.cv.lasso.z.interactions,
                                                            s=logit.cv.lasso.z.interactions$lambda.min) != 0),]

##               (Intercept)                   womens1
##               0.046530578               0.073274482
## womens1:zip_codeSurburban
##               0.003960451

For lambda.1se, just one coefficient is selected, for the feature womens (which indicates that the customer purchased Womens merchandise in the past year). For lambda.min, just one coefficient is added: the interaction of the womens feature with the value “suburban” for zip_code. This gives a pretty understandable idea of what are the customers that the model is selecting as having an incremental probability of conversion.

Is such a simple model good enough? Let’s try to assess it from a technical perspective (the business evaluation is already done above). For both classifiers, lambda.1se is taken, to stay on the safe side.

First, some numeric measures, basically the AUC; in a strictly technical assessment there are no values for costs and profits, and there is no way to determine how many customers to target. The model orders the customers by the probability of converting, and the decision to target them or not depends on a threshold for that probability. The higher the AUC value, the higher the chances that there is a good accuracy for any threshold finally selected. 0.5 would be the AUC value of a random choice, and 1 the AUC value of the ideal classifier.

require(ROCR)
preds.y.interactions.1se <- predict(logit.cv.lasso.y.interactions, logit.x.interactions,
                                    s=logit.cv.lasso.y.interactions$lambda.1se, type="response")
pred.y <- prediction(preds.y.interactions.1se, hc_rvtu$y)
auc.y <- ROCR:::performance(pred.y, "auc")
as.numeric(auc.y@y.values)

## [1] 0.6174656

preds.z.interactions.1se <- predict(logit.cv.lasso.z.interactions, logit.x.interactions,
                                    s=logit.cv.lasso.z.interactions$lambda.1se, type="response")
pred.z <- prediction(preds.z.interactions.1se, hc_rvtu$z)
auc.z <- ROCR:::performance(pred.z, "auc")
as.numeric(auc.z@y.values)

## [1] 0.5145047

These numbers are not exactly outstanding, but the performance of such classifiers could be ok at least for the “traditional” classifier. Does it mean that the “traditional” is better than the uplift? Possibly; however, for the uplift the AUC is not necessarily a good measure.

The Qini curve, already explained for the evaluation, should do much better.

qini_from_data(preds.z.interactions.1se,
               hc_rvtu,
               plotit=TRUE,
               x_axis_ratio=TRUE)

## [1] 0.1777919

You will find the R code for the curves in github: qini and cumulative

The area under the Qini curve is a good way to compare different uplift classifiers. In this paper by Rzepakowski & Jaroszewicz, 2012, the authors actually take the data of the “Hillstron challenge” and fit an uplift classifier using decision trees, with a comparable result.

On the other hand, fitting a random forest using the uplift package does not improve the numbers significantly -not shown in this post.

As for the comparison between the “traditional” and the uplift classifier, they just correspond to different scenarios. If you plot a qini curve for the “traditional”, it performs worse than the uplift, and the opposite if you plot a cumulative response. You can try the code below.

# 1. compare cumulative responses just taking into account classifications (not much sense for the uplift)
par(mfrow=c(2,1))
cumulative_response(preds.y.interactions.1se,
                    hc_rvtu,
                    "y",
                    x_axis_ratio=TRUE)

cumulative_response(preds.z.interactions.1se,
                    hc_rvtu,
                    "y",
                    plotit=TRUE,
                    x_axis_ratio=TRUE)


# 2. compare cumulative penalized as in qini (of course traditional classifier is worse)
par(mfrow=c(2,1))
cum_penalized_curve(preds.y.interactions.1se,
                    hc_rvtu,
                    "y",
                    "z",
                    x_axis_ratio=TRUE)

cum_penalized_curve(preds.z.interactions.1se,
                    hc_rvtu,
                    "y",
                    "ct",
                    x_axis_ratio=TRUE)

Full list of references

Some additional references that have been very useful as well.

On business evaluation in general:

– Provost, P. & Fawcett, T. (2013). Data Science for Business. O’Really.

Includes a full chapter on uplift:

– Strickland, J. (2014). Predictive Modelling and Analytics. Lulu.

And some more papers:

– Lo, V. (2002). The True Lift Model. ACM SIGKDD Explorations Newsletter, 4(2), 72-78.

– Soltys, M., Jaroszewicz, S. & Rzepakowski, P. (2013). Ensemble methods for uplift modeling. Springer.

– Radcliffe, N. J. (2008). Hillstrom’s Mine That Data Analytics Challenge: An Approach Using Uplift Modeling. Stochastic Solutions Limited.

And finally, just in case any of the links get broken, these are the references already mentioned along with the text.

On evaluation:

– Radcliffe, N. J. (2007). Using Control Groups to Target on Predicted Lift: Building and Assessing Uplift Models. Direct Marketing Analytics Journal, Direct Marketing Association, 14-21.

– Radcliffe, N. J. & Surry, P. (2001). Real-World Uplift Modelling with Significance-Base Uplift Trees. Stochastic Solutions White Paper 2011.

On fitting the model:

– Jaskowski, M. & Jaroszewicz, S. (2012). Uplift modeling for clinical trial data. ICML 2012 Workshop on Clinical Data Analysis, Edinburgh, Scotland, UK.

On the evaluation for the “Hillstrom challenge data”:

– Rzepakowski, P. & Jaroszewicz, S. (2012). Uplift modeling in Direct Marketing. Journal of Telecommunications and Information Technology, 2/2012, 43-50.

Cost/benefit evaluation of Uplift Modeling with an example in R