Uplift modeling is a technique for predicting the response of customers while taking into account the incremental impact if they have been targeted by e.g. a marketing action.
In other words, the kind of question that uplift modeling answers looks like this: is there a way to distinguish the customers that respond (or not) to a marketing action when targeted, from the customers that respond (or not) when not targeted? Examples for its applications are quite diverse: churn, up-sell, cross-sell. The technique is relatively old (first papers are from 1999), but there is not so much literature on it; in particular, the evaluation from a cost/benefit perspective is not obvious -that’s the subject of this post.
As a conclusion of a profit analysis of this kind, one can deduce which scenarios are relevant for the uplift modeling, compared to a “traditional” model -depending on the profit for a successful response, the cost of targeting, and the cost of an unsuccessful response. For the uplift model to make a big difference, you probably need a scenario with relatively high cost of targeting. If you also take into account that you would ideally need a A/B experiment to get the training data, the main conclusion is that the cost/profit analysis becomes even more important than usual -i.e. in addition to the technical assessment on the accuracy of the model.
A very good available dataset is the MineThatData E-mail Analyitics and Mining Challenge, 2008, usually referred to as the “Hillstom challenge”. It contains registers for 64K customers who last purchased within 12 months, and were split randomly in 3 groups; 1/3 of the customers were not targeted at all, and the other 2/3 were targeted by two different e-mail campaigns. Some of the questions of the challenge can be answered with an uplift classifier: what is the incremental response of the customers that were targeted by any of two the campaigns? Is there a way to select optimally a subset of the customers that should definitively be targeted (i.e. they will likely respond to an action)? What about a subset of customers that could be removed from the campaigns (since they would convert anyway)?
For most of the text below, it will be assumed that a “successful response” of the customer actually means a “conversion” -which would be ok for this dataset, but it is not really a good term for e.g. for a churn problem.
On the evaluation of the model, there are very good and accepted references by N. J. Radcliffe (2007, 2011). This post follows their definition of “Qini curves” and their method to calculate the profitability of the campaign.
There’s quite a few papers describing different approaches to fitting an uplift model; a particularly interesting one is by Jaskowski & Jaroszewicz 2012, since it describes a way to adapt any classification technique to uplift modeling. Besides, this is implemented for R in the uplift package.
(Additional useful references are at the bottom of this post.)
Also at the bottom of this post, you will find all the R code used for the evaluation of the “Hillstom challenge” dataset, both for fitting the model and for plotting the profit curves. If you like, you can use as a tutorial, and you should be able to follow even if your knowledge of R is low.
Let’s start by the evaluation, assuming that the model has already been fitted. This below would be the cumulative response curve for the “traditional” classifier; the curve accumulates the number of conversions per ratio of customers that have been targeted.
In this figure:
The red curve represents the accumulated conversions when the customers are taken at random (without a classifier)
The blue curve represents the accumulated conversions in the order defined by the classifier
The area under the classifier (blue) curve and over the random (red) curve should be the bigger the better: it means that the classifier orders the customers well according to the probability of conversion; i.e. for a large area, targeting a small(er) percentage of the customers will bring big(ger) profits
If 100% of customers are targeted, the expected conversion rate is the baseline (around 12%), regardless of the method to order the customers
Anyhow, this cumulative response curve is “technical”, in the sense that it only assesses the accuracy of the classifier. It does not say anything about the profit of the campaign, neither about the optimal % of customers that should be targeted (i.e. you cannot choose it from the curve).
On the other hand, it assesses that the accuracy of this “traditional” classifier is at least decent.
The curve below is the “qini curve” for the uplift classifier; it accumulates the number of conversions per ratio of customers that have been targeted (same as the cumulative response), however not scoring for the customers in the control group (i.e. the customers which would have converted anyway, regardless of the marketing action).
In this figure:
Again the red curve represents the random classifier
The blue curve represents the accumulated conversions in the order defined by the uplift classifier
The black curve represents the ideal way to order the customers: first, all the customers that convert because they have been targeted (1st slope up), then customers that do not convert (flat slice), and finally all the customers that convert even if they are not targeted (and thus should not be targeted, 2nd slope down)
Again, the bigger the area under the classifier (blue) curve and over the random (red) curve, the better, since it means that the classifier orders the customers well according to the incremental probability of conversion; thus, for a large area, targeting a small(er) percentage of the customers will bring big(ger) profits
If 100% of customers are targeted, although the total number of conversion would be again the baseline 12%, the incremental conversion is around 5% -that’s what the curve accumulates. (15% of targeted customers converted, but around 10% in the control group would have converted anyway without being targeted)
Once again, this curve would be “technical” -it only assesses the accuracy of the classifier; it does not imply profits. The accuracy of this classifier is not outstanding but ok -the reason why will become clearer when going into details on the fit.
From this two curves, there is no way to decide between the “traditional” and the uplift modeling. Of course, when drawing the Qini curve for the “traditional” classifier (look at the R code section at the bottom), the results are worse compared to the uplift, but the question is: is the real business scenario expressed better by the Qini curve? If one is worried about high targeting costs (e.g. a discount or an offer when customers are targeted), or about “sleeping dogs” (customers that have less probability of conversion when targeted), then intuitively the Qini curve and uplift model should behave better.
However, assuming that the traditional classifier might have pretty good accuracy (although the comparison of accuracies is a difficult matter), how high does the targeting cost need to be so that the uplift classifier is actually better than the traditional with a significant difference? It could be that a good traditional classifier makes the same (or more) profit for low enough costs? Analogously, what would be more precisely the impact of “sleeping dogs”?
A more accurate comparison can be done with profit curves. There’s only an additional difficulty: a business scenario must be defined, using 3 parameters:
profit.tp: the profit of a conversion (or, in a churn problem, the profit when the customer stays)
targeting.cost: the cost of targeting a customer, because e.g. a discount or an offer is made
cost.fp: the cost of targeting customers that do not convert (or do not stay) -i.e. they do not accept the discount, and they could be “sleeping dogs”
One suitable scenario for the analysis would be a mobile operator dealing with churn. Operators often offer a new terminal, which has a high cost, to their potentially leaving customers. Let’s assume that the profit of a customer staying is 400, that the targeting cost is 220 (i.e. basically the terminal), and the average costs of failed targets are 1 (due e.g. to costs of the call center). With those numbers, the profit comparison will show a clear advantage for the fitted uplift classifier.