Introduction & purpose

This is a followup from the most visited post in this site, dedicated to Uplift modeling; thanks to a gentle reader who asked for clarifications and pointed out a potential problem that may happen in any dataset. This post reproduces the problem with a public dataset, hoping that it might be useful to other readers.

As starters, for an introduction to Uplift modeling, please read the previous post already mentioned. Anyway, a very short copy-paste from there:

the kind of question that uplift modeling answers looks like this: is there a way to distinguish the customers that respond (or not) to a marketing action when targeted, from the customers that respond (or not) when not targeted?

The problem that the reader pointed out: what happens when the split between treatment and control (i.e. the split between customers that are targeted and customers that are not targeted), is not done at random? Quick conclusion is that, at least for the technique used in the previous post, such a not random spit will probably break the model. Besides, it will presumably make it difficult for any modeling technique.

This post includes an example with code in R that may be followed as a sort of tutorial. First, the problem is reproduced and the broken model’s performance is shown. After, there’s a test with modelling algorithm which works a bit better, but it also breaks.

Dataset used

The same as in the previous post: MineThatData E-mail Analyitics and Mining Challenge, 2008, usually referred to as the “Hillstom challenge”.

References

The model used in the previous post uses the very interesting technique described by Jaskowski & Jaroszewicz 2012, which is a way to adapt any classification technique to uplift modeling. Besides, it is implemented for R in the uplift package.

As an alternative technique, the same uplift package provides a function called upliftRF, which modifies Random Forests for figuring out the uplift effect. This paper explains the algorithm.

The problem of a not random split

What happens when the split is not done at random? There is an explanatory variable in the dataset that is able to split the customers (i.e. split between customers that are targeted and not targeted).

There is a straight forward way to simulate this not random selection: to include the control variable ct in the model’s formula, as a explanatory variable (the ct variable takes a 0 if the customer has been targeted, and a 1 if it hasn’t).

logit.formula.fail <- as.formula(paste('~history_segment*mens+history_segment*womens',
                                       'history_segment*newbie',
                                       'zip_code*mens+zip_code*womens+zip_code*newbie',
                                       'channel*mens+channel*womens+channel*newbie',
                                       'channel*history_segment',
                                       'ct', # <=== THIS IS THE CONTROL VARIABLE
                                       sep='+'))

Note that in a more realistic case, the variable that provides info on the split will be somewhat hidden; for example, it could be that only customers older than 50 have been targeted, or customers in a certain area, etc. Maybe the variable does not allow a perfect split, but it may be already bad enough if it makes it right for a good percentage.

Symptoms when you fit the classifier

The following code reads and prepares a bit the data (taken from the previous post).

url = "http://www.minethatdata.com/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv"
hc = read.csv(file=url)
hc$treat <- ifelse(as.character(hc$segment) != "No E-Mail", TRUE, FALSE)
hc$mens <- as.factor(hc$mens)
hc$womens <- as.factor(hc$womens)
hc$newbie <- as.factor(hc$newbie)
library(uplift)
hc_rvtu <- rvtu(visit~recency+history_segment+history+mens+womens+zip_code+newbie+channel+trt(as.numeric(treat)),
                data=hc[hc$segment != "Mens E-Mail",],
                method="none")
logit.x.fail <- model.matrix(logit.formula.fail, data=hc_rvtu)
logit.z.fail <- hc_rvtu$z
logit.y.fail <- hc_rvtu$y

When you fit a traditional classifier using the dataset and the formula logit.formula.fail above, you get the results below: a deviance plot for variable regularization and a AUC curve.

library(glmnet)
logit.cv.lasso.y.fail <- cv.glmnet(logit.x.fail, logit.y.fail, alpha=1, family='binomial')
plot(logit.cv.lasso.y.fail)

preds.y.fail.1se <- predict(logit.cv.lasso.y.fail, logit.x.fail,
s=logit.cv.lasso.y.fail$lambda.1se, type='response')
pred.y.fail <- prediction(preds.y.fail.1se, hc_rvtu$y)
auc.y.fail <- ROCR:::performance(pred.y.fail, 'auc')
as.numeric(auc.y.fail@y.values)
## [1] 0.6296751
plot(ROCR:::performance(pred.y.fail, 'tpr', 'fpr'))

The curves look ok, not very impressive performance though. The following is the cumulative response.

# in order to source these two files, you need to download the repository https://github.com/lrnzcig/sisifoR
# and source the .R files from whatever local route you've given them
source("~/git/sisifoR/R/profit_curves.R")
source("~/git/sisifoR/R/uplift.R")
cumulative_response(preds.y.fail.1se,
                    hc_rvtu,
                    'y',
                    x_axis_ratio=TRUE,
                    print_max=FALSE,
                    main_title='cumulative response of traditional classifier')