Introduction & purpose

This is a followup from the most visited post in this site, dedicated to Uplift modeling; thanks to a gentle reader who asked for clarifications and pointed out a potential problem that may happen in any dataset. This post reproduces the problem with a public dataset, hoping that it might be useful to other readers.

As starters, for an introduction to Uplift modeling, please read the previous post already mentioned. Anyway, a very short copy-paste from there:

the kind of question that uplift modeling answers looks like this: is there a way to distinguish the customers that respond (or not) to a marketing action when targeted, from the customers that respond (or not) when not targeted?

The problem that the reader pointed out: what happens when the split between treatment and control (i.e. the split between customers that are targeted and customers that are not targeted), is not done at random? Quick conclusion is that, at least for the technique used in the previous post, such a not random spit will probably break the model. Besides, it will presumably make it difficult for any modeling technique.

This post includes an example with code in R that may be followed as a sort of tutorial. First, the problem is reproduced and the broken model’s performance is shown. After, there’s a test with modelling algorithm which works a bit better, but it also breaks.

Dataset used

The same as in the previous post: MineThatData E-mail Analyitics and Mining Challenge, 2008, usually referred to as the “Hillstom challenge”.

References

The model used in the previous post uses the very interesting technique described by Jaskowski & Jaroszewicz 2012, which is a way to adapt any classification technique to uplift modeling. Besides, it is implemented for R in the uplift package.

As an alternative technique, the same uplift package provides a function called upliftRF, which modifies Random Forests for figuring out the uplift effect. This paper explains the algorithm.

The problem of a not random split

What happens when the split is not done at random? There is an explanatory variable in the dataset that is able to split the customers (i.e. split between customers that are targeted and not targeted).

There is a straight forward way to simulate this not random selection: to include the control variable ct in the model’s formula, as a explanatory variable (the ct variable takes a 0 if the customer has been targeted, and a 1 if it hasn’t).

logit.formula.fail <- as.formula(paste('~history_segment*mens+history_segment*womens',
                                       'history_segment*newbie',
                                       'zip_code*mens+zip_code*womens+zip_code*newbie',
                                       'channel*mens+channel*womens+channel*newbie',
                                       'channel*history_segment',
                                       'ct', # <=== THIS IS THE CONTROL VARIABLE
                                       sep='+'))

Note that in a more realistic case, the variable that provides info on the split will be somewhat hidden; for example, it could be that only customers older than 50 have been targeted, or customers in a certain area, etc. Maybe the variable does not allow a perfect split, but it may be already bad enough if it makes it right for a good percentage.

Symptoms when you fit the classifier

The following code reads and prepares a bit the data (taken from the previous post).

url = "http://www.minethatdata.com/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv"
hc = read.csv(file=url)
hc$treat <- ifelse(as.character(hc$segment) != "No E-Mail", TRUE, FALSE)
hc$mens <- as.factor(hc$mens)
hc$womens <- as.factor(hc$womens)
hc$newbie <- as.factor(hc$newbie)
library(uplift)
hc_rvtu <- rvtu(visit~recency+history_segment+history+mens+womens+zip_code+newbie+channel+trt(as.numeric(treat)),
                data=hc[hc$segment != "Mens E-Mail",],
                method="none")
logit.x.fail <- model.matrix(logit.formula.fail, data=hc_rvtu)
logit.z.fail <- hc_rvtu$z
logit.y.fail <- hc_rvtu$y

When you fit a traditional classifier using the dataset and the formula logit.formula.fail above, you get the results below: a deviance plot for variable regularization and a AUC curve.

library(glmnet)
logit.cv.lasso.y.fail <- cv.glmnet(logit.x.fail, logit.y.fail, alpha=1, family='binomial')
plot(logit.cv.lasso.y.fail)

preds.y.fail.1se <- predict(logit.cv.lasso.y.fail, logit.x.fail,
s=logit.cv.lasso.y.fail$lambda.1se, type='response')
pred.y.fail <- prediction(preds.y.fail.1se, hc_rvtu$y)
auc.y.fail <- ROCR:::performance(pred.y.fail, 'auc')
as.numeric(auc.y.fail@y.values)
## [1] 0.6296751
plot(ROCR:::performance(pred.y.fail, 'tpr', 'fpr'))

The curves look ok, not very impressive performance though. The following is the cumulative response.

# in order to source these two files, you need to download the repository https://github.com/lrnzcig/sisifoR
# and source the .R files from whatever local route you've given them
source("~/git/sisifoR/R/profit_curves.R")
source("~/git/sisifoR/R/uplift.R")
cumulative_response(preds.y.fail.1se,
                    hc_rvtu,
                    'y',
                    x_axis_ratio=TRUE,
                    print_max=FALSE,
                    main_title='cumulative response of traditional classifier')

## [1] 0.1130162

Conclusion: there is some gain for using the classifier compared to a random selection, however small.

On the other hand, look what happens when one fits an uplift model using Jaskowski & Jaroszewicz 2012’s technique, as in the previous post.

logit.cv.lasso.z.fail <- cv.glmnet(logit.x.fail, logit.z.fail, alpha=1, family='binomial')
plot(logit.cv.lasso.z.fail)

preds.z.fail.1se <- predict(logit.cv.lasso.z.fail, logit.x.fail,
s=logit.cv.lasso.z.fail$lambda.1se, type='response')
pred.z.fail <- prediction(preds.z.fail.1se, hc_rvtu$z)
auc.z.fail <- ROCR:::performance(pred.z.fail, 'auc')
as.numeric(auc.z.fail@y.values)
## [1] 0.8719292
plot(ROCR:::performance(pred.z.fail, 'tpr', 'fpr'))

Its performance is surprisingly high. The model has small deviance, even when the regularization selects one variable, and the AUC is quite high.

And this is the cumulative response and the Qini curves.

cumulative_response(preds.z.fail.1se,
                    hc_rvtu,
                    'y',
                    x_axis_ratio=TRUE,
                    print_max=FALSE,
                    main_title='cumulative response of uplift classifier')

## [1] -0.04259469
qini_from_data(preds.z.fail.1se,
               hc_rvtu,
               plotit=TRUE,
               x_axis_ratio=TRUE,
               ylim=c(-0.15, 0.15))

## [1] -1.404521

The gain is negative, thus it would be better to select customers at random. Why is that?

Why is the AUC negative?

Let’s again go through the base concepts of uplift. There are 4 kind of customers when you put together the control and treatment groups:

  • Customers that are targeted and do convert

  • Customers that are targeted and do not convert

  • Customers that are not targeted and do convert

  • Customers that are not targeted and do not convert

The objective of the uplift classifier is as follows:

  • To identify the customers that would convert when targeted

  • Not to target customers than won’t convert anyway; it would be a waste of resources and the customers might get annoyed

  • It’s also important to identify the customers that do convert even if they are not targeted; better to leave those alone, since if they were targeted, there is a risk they get annoyed and do not convert anymore

  • It’s not so bad to target customers that won’t convert if not targeted. These are customers that anyway would not provide revenue if not targeted

On the other hand, there is one important point to notice about the way in which Jaskowski and Jaroszewicz approach the uplift problem. The variable Z, which is the one used as the target of the classifier, is constructed in the following way:

  • It takes value 1 either for customers that have been targeted and convert, or customers that have not been targeted and do not convert ==> i.e. customers that the model wants to target (the former generate revenue, and the latter won’t generate revenue anyway if not targeted)

  • It takes value 0 either for customers that have been targeted and do not convert, or customers that have not been targeted and convert ==> i.e. customers that the model won’t choose as targets since they might get annoyed; besides, with the latter, there’s even the risk of loosing revenue

Now, let’s imagine that there is a way for the model to figure out if the customer has been targeted or not (since the split was not done at random, and there is an explanatory variable that gives this information, as mentioned above). Also consider the case in which the conversion rate is relatively low, so that the customers with Z==1 are mostly customers that have not been targeted (and do not convert), and those in Z==0 are mostly customers that have been targeted (and do not convert).

If both happen at the same time (not random split and low conversion rate), the algorithm could become ‘lazy’: since it has an easy way to split into targeted and not targeted, and the conversion rate is relatively low, ‘it is not worth it’ to take into account the conversions -the algorithm will get already a good result (small deviance) just by splitting between targeted and not targeted.

Thus, it is quite clear that an uplift classifier fit to predict the Z variable will give good performance pretty easily in this situation. That’s what happens in the example above, when looking in particular to the AUC curve. But, what about the performance evaluation of such a model from a business point of view? Let’s focus on the Qini curve, which looks pretty bad above. The ideal classifier will provide a Qini curve very closed to the upper black curve if it is able to output the customers like this:

  1. Customers that are targeted and do convert, which score +1

  2. Customers that are not targeted and do not convert, which score 0

  3. Customers that are targeted and do not convert, which score 0

  4. Customers that are not targeted and do convert, which score -1

When the split is not done at random, the classifier provides first the customers that are not targeted, regardless they have converted or not. Thus the total score of this first group will be Total Number of Customers * Ratio of Customers that have been targeted * Conversion Rate of not targeted Customers * -1 (just think it though using the score rules above). In the example Qini curve above, the conversion rate of not targeted customers is about 10%, and the ratio of customers that have been targeted is around 50%. That’s why the 1st half of the Qini curve (the blue curve) goes downto around -0.1, i.e. down to the conversion rate of not targeted times -1, and it does so for 50% of the customers. It goes down more or less as a straight line since the classifier does not really distinguish customers that either convert or do not.

After this, the classifier provides customers that have been targeted, regardless they have converted or not. The final score for all customers will be the uplift rate. In the example Qini curve above, the blue curve goes up, upto 5% approx. The uplift rate is the conversion rate of targeted customers (15%) minus the conversion rate of not targeted (10%). Again it goes as a straight line, since there is no way to distinguish customers that convert or not.

Alternative uplift fit

The presented method is not by far the only method to build an uplift classifier. It has the advantage that it can be used with any model that you feel comfortable with, however it has its weaknesses too…

There’s other algorithms around, specifically developed for modelling uplift; let’s try to fit the data for a not random split of targeted customers, and see what happens. The following uses the upliftRF method by Guelman et al. (see references above).

# straight forward fit without `ct` variable
fit.ok.upliftRF <- upliftRF(y~recency+mens+womens+zip_code+newbie+channel+trt(ct),
                            data=hc_rvtu,
                            ntree=500,
                            split_method='KL',
                            minsplit=100,
                            verbose=FALSE)
summary(fit.ok.upliftRF)

## $call
## upliftRF(formula = y ~ recency + mens + womens + zip_code + newbie +
##     channel + trt(ct), data = hc_rvtu, ntree = 500, split_method = "KL",
##     minsplit = 100, verbose = FALSE)
##
## $importance
##        var   rel.imp
## 1  recency 48.247503
## 2 zip_code 19.564431
## 3  channel 19.281927
## 4   newbie  6.811469
## 5     mens  3.339901
## 6   womens  2.754770
##
## $ntree
## [1] 500
##
## $mtry
## [1] 2
##
## $split_method
## [1] "KL"
##
## attr(,"class")
## [1] "summary.upliftRF"
preds.ok.upliftRF <- as.data.frame(predict(fit.ok.upliftRF, newdata=hc_rvtu))
qini_from_data(preds.ok.upliftRF$pr.y1_ct1 - preds.ok.upliftRF$pr.y1_ct0,
               hc_rvtu,
               plotit=TRUE)

## [1] 0.3928221

First, there’s a fit with a random split. The result is pretty decent, as it can be seen in the Qini curve above.

# 1st create a new variable `ct_bis`. If you add `ct` directly to the formula, the algorithm does not use it
# Below `ct_bis` is equal to `ct` 90% of the time
hc_rvtu_mod <- hc_rvtu
set.seed(123)
hc_rvtu_mod$ct_bis <- ifelse(round(runif(nrow(hc_rvtu_mod), min=0, max=0.6)) == 0,
                             hc_rvtu_mod$ct,
                             as.integer(! head(hc_rvtu_mod$ct)))
sum(hc_rvtu_mod$ct_bis != hc_rvtu_mod$ct) 
## [1] 3475
sum(hc_rvtu_mod$ct_bis == hc_rvtu_mod$ct) 
## [1] 39218
# fit upliftRF including `ct_bis` in the formula
fit.fail.upliftRF <- upliftRF(y~recency+mens+womens+zip_code+newbie+channel+ct_bis+trt(ct),
                              data=hc_rvtu_mod,
                              ntree=500,
                              split_method='KL',
                              minsplit=100,
                              verbose=FALSE)
summary(fit.fail.upliftRF)

## $call
## upliftRF(formula = y ~ recency + mens + womens + zip_code + newbie +
##     channel + ct_bis + trt(ct), data = hc_rvtu_mod, ntree = 500,
##     split_method = "KL", minsplit = 100, verbose = FALSE)
##
## $importance
##        var   rel.imp
## 1  recency 49.401996
## 2 zip_code 17.621061
## 3  channel 17.013792
## 4   newbie  7.667880
## 5     mens  3.611280
## 6   womens  2.994238
## 7   ct_bis  1.689751
##
## $ntree
## [1] 500
##
## $mtry
## [1] 2
##
## $split_method
## [1] "KL"
##
## attr(,"class")
## [1] "summary.upliftRF"
preds.fail.upliftRF <- as.data.frame(predict(fit.fail.upliftRF, newdata=hc_rvtu_mod))
qini_from_data(preds.fail.upliftRF$pr.y1_ct1 - preds.fail.upliftRF$pr.y1_ct0,
               hc_rvtu_mod,
               plotit=TRUE)

## [1] 0.07556322

However, when a not random split is simulated, the performance goes down heavily (although not exactly under the same conditions as in the rest of the post, due to the way the variable ct_bis is constructed, if you care to take a look to the code -closed enough conditions anyway). In conclusion, upliftRF seems to be a bit more robust for a not random split, but it still breaks; this anyhow would deserve a deeper investigation.

Conclusion

Jaskowski and Jaroszewicz’s approach to the uplift problem has a number of advantages but it is not robust for a situation in which the split between targeted and not targeted customers is not done at random.

If you fit your models you should get worried if some of the following suspicious results:

  • The AUC of the uplift classifier is better than the AUC of the traditional classifier (this is weird, since the uplift classifier tries to solve a tougher problem)

  • The AUC of the uplift classifier is good but the Qini curve is negative (again this is weird)

Both are definitively good hints that you are doing something wrong, that you need to take a look to your data, and/or choose a stronger algorithm.

Related posts

  • Cost/benefit evaluation of Uplift Modeling with an example in R
    The original post, which raised the question about not random splits. Is there a way to distinguish the customers that respond (or not) to a marketing action when targeted, from the customers that respond (or not) when they are not targeted? Which scenarios are relevant for uplift modeling, compared to a “traditional” model? How to evaluate them from a business perspective? How to fit the uplift model in R?