This post assumes that you use Google Analytics for monitoring the traffic of your website. You probably know that, besides the graphs and dashboards, the tool provides an API for downloading the data, and one can analyze the traffic further using R (among many other means).
As an instance of a deeper analysis, this post explores the possibility of finding Association Rules over the pages of the web site that the users have visited. The algorithm provides sets of pages that are commonly visited together, which could be useful for finding insights e.g. on:
the structure of the site as seen by the users
links that could be added to certain pages, which could be of interest and would improve the user experience
the seed for a dynamic recommendation system (of a large web site)
This post will go through the steps for setting Google Analytics, downloading the data from GA to R, and implementing the analysis using the package arules
, with a practical example. It may be followed as a tutorial, it does not require deep knowledge on GA or R, even if it uses some more “non-beginner” features as custom variables of GA and the R package data.table
.
Quite surprisingly, it is quite difficult to find an open dataset suitable for this task. The Wikipedia Clickstream dataset, available e.g. here, does not contain user id’s, and neither does U.S. open GA data, to cite just a couple of interesting open and well-known datasets. This blog uses GA and all the configuration, but unfortunately does not have a complex enough structure to make it worth it (nor a significant enough volume anyway).
In the UCI Machine Learning Repository there is an interesting dataset of anonymous Microsoft Web Data. This is from 1998 (just from the time when Google was started as a company!!), and contains data for around 38K anonymous users, randomly selected, and all the “areas” of the web site that each user visited during one week of February of that year. Data is actually taken from machine logs (of course Google Analytics did not exist yet), and the cleanup conversion of the data into an structure equivalent to what GA would provide is at the bottom of the post -since it obviously is the least interesting step.
There is a complete list of references at the bottom of the page. Short preview:
GA provides quite detailed help pages, and in case of doubt http://stackoverflow.com/
“The Elements of Statistical Learning”, available on-line for free, provides an accessible explanation on Association Rules
There’s many resources on how to use apriori
with R in practice, and also quite a few papers on Recommendation Systems on Clickstream data
At the time of this writing, Google Analytics web tool already provides a User Explorer view, including info on sessions from users identified by an internal id. However, the pages visited per user cannot be seen using this view, and the clientId’s generated by the tool cannot be downloaded through GA API either. See here for reference.
To add info on users that can be queried later through the GA API, an easy solution is to use custom dimensions. In the end, you have to add code into your web pages; this would be an example in JavaScript using localStorage (although cookies could be used as well):
var GA_LOCAL_STORAGE_KEY = 'ga:clientId';
var CLIENT_ID;
if (window.localStorage) {
ga('create', 'XXX your tracking ID XXX', 'auto', {
'storage': 'none',
'clientId': localStorage.getItem(GA_LOCAL_STORAGE_KEY)
});
ga(function(tracker) {
localStorage.setItem(GA_LOCAL_STORAGE_KEY, tracker.get('clientId'));
});
CLIENT_ID = localStorage.getItem(GA_LOCAL_STORAGE_KEY);
}
else {
ga('create', 'XXX your tracking ID XXX', 'auto');
CLIENT_ID = "no local storage"
}
ga('send', 'pageview', {
'dimension1': CLIENT_ID
});
This adds the id of the client to the 1st custom dimension of GA. Of course, you need to add it to every page that you want to track. You also need to configure the custom dimension under in the GA web tool, under Admin/Custom Definitions/Custom Dimensions, as in the figure below.
Note that dimension1
in the JavaScript code above and the R code below correspond to the index equal to 1 in the variable defined in the configuration.
If it is the first time you use GA API, you need to get credentials for your user. The procedure may change over time, but you can always rely on help pages like this one. Anyhow, as of today the steps would be:
Go to https://console.developers.google.com/projectselector/apis/credentials. If you do it from a tab of your browser in which you’ve already login with your GA user id, you will see already your project for GA tracking
Create a new project, if you don’t have one already
When you select your project and continue, you’ll reach https://console.developers.google.com/apis/credentials, where you can create new credentials. Add credentials selecting Other UI, key type JSON, and OAuth 2.0 client Id
You end up with a client.id and a sort of password called client.secret
Finally, you need to enable the API for the proyect, at https://console.developers.google.com/projectselector/apis/library. Anyhow, if you skip this step, you will get a very clear error message in your R session when you do your first access; just follow the instructions
There’s a few R packages for an easy access to the GA API; this post uses RGoogleAnalytics
. The first step in R is to install/load the library and create a token.
library(RGoogleAnalytics)
token <- Auth("XXXX client id XXXX",
"XXXX client secret XXXX")
save(token, file="./token_file")
This generates a token file in the current working directory of R; with this token you don’t need to use the id/password anymore, you may start your script just reading and validating the token.
require(RGoogleAnalytics)
load("./token_file")
ValidateToken(token)
## Access Token successfully updated
The attributes of any query to the GA API are as follows:
start and end dates
dimensions and metrics, see here since not all combinations of dimensions and metrics are allowed
table.id: this is the id of the view, go to GA web tool, Admin/Your property/Your view/View settings
optional: e.g. maximum number of registers, sort criteria
One typical query for visits per time could be as follows:
query.time <- Init(start.date=start_date,
end.date=end_date,
dimensions=c("ga:date",
"ga:hour",
"ga:minute",
"ga:sourceMedium",
"ga:city",
"ga:country",
"ga:pagePath"),
metrics=c("ga:sessions",
"ga:sessionDuration",
"ga:goalCompletionsAll",
"ga:pageviews",
"ga:newUsers"),
table.id="ga:XXX your view id XXX",
sort="-ga:date,-ga:hour,-ga:minute",
max.results = 10000)
ga.query <- QueryBuilder(query.time)
ga.data.time <- GetReportData(ga.query, token)
ga.data.time[1:2,]
## Status of Query:
## The API returned 15 results
## date hour minute sourceMedium city country
## 1 20160426 17 16 (direct) / (none) Madrid Spain
## 2 20160426 17 13 (direct) / (none) Madrid Spain
## pagePath sessions sessionDuration
## 1 /2016-04-01-uplift-evaluation.html 0 0
## 2 /2016-04-01-uplift-evaluation.html 1 194
## goalCompletionsAll pageviews newUsers
## 1 1 1 0
## 2 1 1 1
The following query uses the custom dimension for the user id, and gets the pages visited per user, thus could be used for the Association Rules analysis.
query.coll <- Init(start.date=start_date,
end.date=end_date,
dimensions=c("ga:date",
"ga:hour",
"ga:minute",
"ga:dimension1",
"ga:pagePath",
"ga:previousPagePath"),
metrics=c("ga:sessions",
"ga:goalCompletionsAll",
"ga:pageviews",
"ga:newUsers"),
table.id="ga:XXX your view id XXX",
sort="-ga:date,-ga:hour,-ga:minute",
max.results = 10000)
ga.query <- QueryBuilder(query.coll)
ga.data.coll <- GetReportData(ga.query, token)
ga.data.coll[1:2,]
## Status of Query:
## The API returned 10 results
## date hour minute dimension1
## 1 20160426 17 16 1591284960.1461683628
## 2 20160426 17 13 1591284960.1461683628
## pagePath previousPagePath
## 1 /2016-04-01-uplift-evaluation.html /2016-04-01-uplift-evaluation.html
## 2 /2016-04-01-uplift-evaluation.html (entrance)
## sessions goalCompletionsAll pageviews newUsers
## 1 0 1 1 0
## 2 1 1 1 1
After this reasonably easy setup, let’s analyse the dataset with users’ pageviews. As mentioned above, the dataset does not actually come from Google Analytics, but at this point let’s assume it’s been transformed to a dataframe with a equivalent structure. (You’ll find the full R code for loading the dataset further down).
head(msweb[, c("title", "userId")])
## title userId
## 1 regwiz 10001
## 998 Support Desktop 10001
## 5551 End User Produced View 10001
## 1461 Support Desktop 10002
## 8310 Knowledge Base 10002
## 1463 Support Desktop 10003
where title
corresponds to a description the web page that has been visited. Note that this will be identical to the GA data using the right columns:
head(ga.data.coll[, c("pagePath", "dimension1")])
In other to use this with the apriori
implementation of the package arules
, this data.frame
has to be converted to transactions
, in which there is one line per user and a field with a list of all the pages that the user has visited.
library(data.table)
msweb.items <- data.table(msweb[, c("title", "userId")])
msweb.items[, l:=.(list(unique(title))), by=userId] # creates list of pages per user, see note
msweb.items <- msweb.items[! duplicated(userId), l] # removes duplicated lines per user and selects only the list of pages
head(msweb.items, 3)
## [[1]]
## [1] regwiz Support Desktop End User Produced View
## 296 Levels: 0 About Microsoft ... WorldWide Offices - US Districts
##
## [[2]]
## [1] Support Desktop Knowledge Base
## 296 Levels: 0 About Microsoft ... WorldWide Offices - US Districts
##
## [[3]]
## [1] Support Desktop Knowledge Base Microsoft.com Search
## 296 Levels: 0 About Microsoft ... WorldWide Offices - US Districts
length(msweb.items)
## [1] 32711
library(arules)
msweb.trans <- as(msweb.items, "transactions") # converts it to transactions
inspect(msweb.trans[1:3])
## items
## 1 {End User Produced View,
## regwiz,
## Support Desktop}
## 2 {Knowledge Base,
## Support Desktop}
## 3 {Knowledge Base,
## Microsoft.com Search,
## Support Desktop}
Note that the transactions do not allow for duplicated elements of the list per user, i.e. if a user has visited a web page twice, that information is lost.
Once the data is formatted into transactions, let’s use the function for plotting item frequencies provided by the arules
package.
itemFrequencyPlot(msweb.trans, topN=12)
From the plot we see that the most visited page is Free Downloads, by far. Also Products and pages on technical support in general are quite visited. The first 2 could be considered as conversions from a web analytics perspective, and the support pages could be clues of users having problems (although possibly Windows was quite more stable in 1998??)
Finally, to interpret the plot, take into account that a user visits on average 3.02 pages; the sum of relative item frequencies is more than 1.
Everything is ready for the analysis of Association Rules. However, a short digression would be handy to understand at least the basic concepts.
Association Rules is a popular technique for discovering relations (i.e. frequent associations, connections) in a dataset.
One of the most popular applications is “Market Basket Analysis”, applied in commercial datasets for discovering products that are purchased together by the customers.
Apriori is an solution for mining association rules devised by Agrawal et al., 1995, which can solve the Market Basket Analysis problem in a computationally feasible way. One of the difficulties of the analysis is that the sets of items can be of any size; Apriori exploits the fact that if a set of elements appears frequently in the data, then any subset of elements of the set must be frequent as well. In the implementation, the algorithm starts evaluating sets of just one element in one pass through all the data, and then in every subsequent pass it looks for sets of increasing size.
As an output, the algorithm provides rules of the form A => B, where both A and B can be sets of elements in the dataset (can be also single elements). Three main properties about the rules need to be understood:
“support” is the percentage of times that all the items of A and B combined appear in the dataset
“confidence” is the percentage of times where B happens, provided that A has happened
“lift” is the ratio between confidence of the rule and the support of B alone, i.e. how much more likely is that A and B happen together, over A and B happening on their own; (a lift of 1 would mean that A and B are actually independent)
Let’s see the 3 concepts in play for an example in the dataset.
## lhs rhs support confidence lift
## 1 {Internet Explorer,
## isapi,
## Windows Family of OSs} => {Free Downloads} 0.0104552 0.8382353 2.530409
This is the rule with higher support that contains Free Downloads in the rhs
, i.e. the right term, or B in the explanation above. The A term is then {Internet Explorer, isapi, Windows Family of OSs}, where btw isapi seems to be a particular kind of DLL which was quite frequently browsed in 98 and still exists. The 3 concepts will then be for this rule:
“support” of 0.01 means that 1% of the sets of pages visited contain the pages Internet Explorer, isapi, Windows Family of OSs and Free Downloads
“confidence” of 0.84 means that 84% of the times that the user has browsed Internet Explorer, isapi and Windows Family of OSs, they have browsed also Free Downloads
“lift” of 2.53 gives an idea of how generalizable the rule is (the lift would be 1 if Internet Explorer, isapi and Windows Family of OSs happened independently of Free Downloads)
apriori
allows mining Association Rules as described just above, however as a first step for this kind of data it is a good idea to mine just frequent itemsets, which are sets of items that appear together, and thus only have “support”.
rules.tr2 <- apriori(msweb.trans,
parameter=list(minlen=2, support=0.0001, target="frequent"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## NA 0.1 1 none FALSE TRUE 1e-04 2 10
## target ext
## frequent itemsets FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 3
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[284 item(s), 32711 transaction(s)] done [0.01s].
## sorting and recoding items ... [241 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.09s].
## writing ... [179806 set(s)] done [0.02s].
## creating S4 object ... done [0.09s].
rules.tr2
## set of 179806 itemsets
rules.tr2.sorted <- sort(rules.tr2, by="support")
inspect(rules.tr2.sorted[1:5])
## items support
## 4677 {Free Downloads,Internet Explorer} 0.16080218
## 4667 {Free Downloads,Windows Family of OSs} 0.07792486
## 4674 {Free Downloads,isapi} 0.07306411
## 4671 {Free Downloads,Products } 0.06123322
## 4676 {Free Downloads,Microsoft.com Search} 0.06043838
The page Free Downloads appears in the the 5 itemsets with higher support. Let’s dig deeper on the itemsets that contain Free Downloads using the function subset
.
inspect(subset(rules.tr2.sorted, items %in% "Free Downloads")[1:10])
## items support
## 4677 {Free Downloads,Internet Explorer} 0.16080218
## 4667 {Free Downloads,Windows Family of OSs} 0.07792486
## 4674 {Free Downloads,isapi} 0.07306411
## 4671 {Free Downloads,Products } 0.06123322
## 4676 {Free Downloads,Microsoft.com Search} 0.06043838
## 4662 {Free Downloads,Support Desktop} 0.03640977
## 25376 {Free Downloads,Internet Explorer,Products } 0.03173244
## 4656 {Free Downloads,Knowledge Base} 0.03118217
## 25367 {Free Downloads,isapi,Windows Family of OSs} 0.03026505
## 25379 {Free Downloads,Internet Explorer,isapi} 0.02907279
Note that the itemset 25376 contains the items of the itemsets 4677 and 4671. A piece of code is usually copied and pasted into several R tutorials, which removes the duplicates, something like this:
remove_redundant <- function(rules) {
subset.matrix <- is.subset(rules, rules)
subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
redundant <- colSums(subset.matrix, na.rm=T) >= 1
return(rules[!redundant])
}
inspect(remove_redundant(subset(rules.tr2.sorted, items %in% "Free Downloads")[1:10]))
## items support
## 4677 {Free Downloads,Internet Explorer} 0.16080218
## 4667 {Free Downloads,Windows Family of OSs} 0.07792486
## 4674 {Free Downloads,isapi} 0.07306411
## 4671 {Free Downloads,Products } 0.06123322
## 4676 {Free Downloads,Microsoft.com Search} 0.06043838
## 4662 {Free Downloads,Support Desktop} 0.03640977
## 4656 {Free Downloads,Knowledge Base} 0.03118217
However when using this you need to take into account that the decision on the line that has been removed, in this case 25376 instead of 4677 and 4671, is a bit arbitrary, and that you could be removing info that you are actually interested on. At the end, in the particular case of this example, the interesting info would be: which pages a user visits together with the conversion page (i.e. the Free Download page)?
One can access the internal elements of the itemsets object provided by arules, use unclass
to take a look to the structure,
all <- subset(rules.tr2.sorted, items %in% "Free Downloads")
unclass(all[1:5])
attr(all[1:5], "quality") # dataframe with one column for support
attr(all[1:5], "items") # this is an itemMatrix object that contains the list of itemsets
as(as(attr(all[1:5], "items"), "transactions"), "data.frame") # conversion to a data.frame
In a similar way, one could investigate which pages are visited together with another conversion, the Training page.
inspect(subset(rules.tr2.sorted, items %in% "Training")[1:10])
## items support
## 4146 {isapi,Training} 0.009690930
## 4147 {Microsoft.com Search,Training} 0.005839014
## 4149 {Free Downloads,Training} 0.004493901
## 4143 {Support Desktop,Training} 0.003393354
## 4145 {Products ,Training} 0.003362783
## 4148 {Internet Explorer,Training} 0.003271071
## 21280 {isapi,Microsoft.com Search,Training} 0.003026505
## 4144 {Training,Windows Family of OSs} 0.002690227
## 21282 {Free Downloads,isapi,Training} 0.002353948
## 21276 {isapi,Products ,Training} 0.001987099
Let’s figure out as well the most common itemsets related to technical support pages.
supp_terms <- as.character(attribute.lines$title[attribute.lines$title %like% "Support"]) # all pages that contain the keyword
inspect(subset(rules.tr2.sorted, items %in% supp_terms)[1:10])
## items support
## 4659 {isapi,Support Desktop} 0.05942955
## 4650 {Knowledge Base,Support Desktop} 0.05521079
## 4660 {Microsoft.com Search,Support Desktop} 0.04857693
## 4638 {isapi,Windows95 Support} 0.04607013
## 4662 {Free Downloads,Support Desktop} 0.03640977
## 25327 {isapi,Knowledge Base,Support Desktop} 0.03323041
## 4636 {Windows Family of OSs,Windows95 Support} 0.03283299
## 4658 {Products ,Support Desktop} 0.03240500
## 4635 {Support Desktop,Windows95 Support} 0.02959249
## 25283 {isapi,Windows Family of OSs,Windows95 Support} 0.02873651
After the frequent itemsets, let’s mine now the rules provided by apriori, of the form A => B.
rules.tr3 <- apriori(msweb.trans,
parameter=list(minlen=2, support=0.0001, target="rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## 0.8 0.1 1 none FALSE TRUE 1e-04 2 10
## target ext
## rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 3
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[284 item(s), 32711 transaction(s)] done [0.01s].
## sorting and recoding items ... [241 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.10s].
## writing ... [281346 rule(s)] done [0.06s].
## creating S4 object ... done [0.12s].
rules.tr3.sorted <- sort(rules.tr3, by="lift")
inspect(rules.tr3.sorted[1:5])
## lhs rhs support confidence lift
## 1 {SNA Support} => {SNA Server} 0.0001528538 1.0 1362.9583
## 2 {SNA Support,
## Support Desktop} => {SNA Server} 0.0001222830 1.0 1362.9583
## 3 {Free Downloads,
## isapi,
## MS Schedule+ News,
## Support Desktop} => {MS Schedule+} 0.0001222830 0.8 872.2933
## 4 {Free Downloads,
## isapi,
## Knowledge Base,
## MS Schedule+ News,
## Support Desktop} => {MS Schedule+} 0.0001222830 0.8 872.2933
## 5 {Latin America Region,
## Windows95 Support} => {Argentina} 0.0001222830 0.8 817.7750
inspect(sort(rules.tr3, by="support")[1:5])
## lhs rhs support confidence lift
## 1 {Windows95 Support} => {isapi} 0.04607013 0.8414294 5.163977
## 2 {Windows 95} => {Windows Family of OSs} 0.03243557 0.9146552 6.464841
## 3 {Windows Family of OSs,
## Windows95 Support} => {isapi} 0.02873651 0.8752328 5.371433
## 4 {SiteBuilder Network Membership} => {Internet Site Construction for Developers} 0.02729969 0.8045045 8.172716
## 5 {Free Downloads,
## Windows95 Support} => {isapi} 0.02464003 0.9056180 5.557912
Note that for such a low support threshold, it does not really make sense to order them by lift -these are rules that indeed have a very high lift, but do not happen often at all. On the other hand, when the rules are sorted by support, note that the lift for the rules of high support is still quite significant.
Let’s find again the rules that involve Free Downloads in the right hand side of the rule A => B. The apriori
allows configuration through the object appearance
for getting the rules with a certain rhs
.
inspect(sort(apriori(msweb.trans,
parameter=list(minlen=2, support=0.0001, target="rules"),
appearance=list(rhs="Free Downloads", default="lhs")), by="support")[1:5])
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## 0.8 0.1 1 none FALSE TRUE 1e-04 2 10
## target ext
## rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 3
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[284 item(s), 32711 transaction(s)] done [0.01s].
## sorting and recoding items ... [241 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.10s].
## writing ... [42229 rule(s)] done [0.01s].
## creating S4 object ... done [0.03s].
## lhs rhs support confidence lift
## 1 {Internet Explorer,
## isapi,
## Windows Family of OSs} => {Free Downloads} 0.010455199 0.8382353 2.530409
## 2 {Internet Explorer,
## Windows95 Support} => {Free Downloads} 0.008009538 0.8061538 2.433564
## 3 {Internet Explorer,
## isapi,
## Windows95 Support} => {Free Downloads} 0.007459264 0.8472222 2.557538
## 4 {Internet Explorer,
## Windows Family of OSs,
## Windows95 Support} => {Free Downloads} 0.006542142 0.8492063 2.563528
## 5 {Internet Explorer,
## isapi,
## Windows Family of OSs,
## Windows95 Support} => {Free Downloads} 0.006144722 0.8815789 2.661252
Although, as with the itemsets, one could also dig into the output object using the in
clause.
inspect(subset(rules.tr3.sorted, rhs %in% "Free Downloads")[1:5])
Something similar can be done for the rules with that contain Training at the rhs
.
inspect(sort(apriori(msweb.trans,
parameter=list(minlen=2, support=0.0001, target="rules"),
appearance=list(rhs="Training", default="lhs")), by="support")[1:5])
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## 0.8 0.1 1 none FALSE TRUE 1e-04 2 10
## target ext
## rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 3
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[284 item(s), 32711 transaction(s)] done [0.01s].
## sorting and recoding items ... [241 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.09s].
## writing ... [57 rule(s)] done [0.01s].
## creating S4 object ... done [0.02s].
## lhs rhs support confidence lift
## 1 {isapi,
## Mastering Series,
## Visual Basic} => {Training} 0.0002139953 1.0000000 45.68575
## 2 {Free Downloads,
## isapi,
## Mastering Series} => {Training} 0.0001834245 0.8571429 39.15922
## 3 {Free Downloads,
## Internet Explorer,
## Mastering Series} => {Training} 0.0001834245 0.8571429 39.15922
## 4 {isapi,
## Mastering Series,
## Products } => {Training} 0.0001528538 0.8333333 38.07146
## 5 {Free Downloads,
## isapi,
## Mastering Series,
## Visual Basic} => {Training} 0.0001528538 1.0000000 45.68575
Here the rules have considerable lift but very little support. Let’s review the first one in more detail:
support of 0.0002 means that isapi, Mastering Series, Visual Basic and Training appear together 0.02% of the time
confidence of 1 means that every time that a user visits the pages isapi, Mastering Series and Visual Basic, they visit the Training page as well
lift of 45.69 indicates that the rule is very probably generalizable (it would be 1 if isapi, Mastering Series and Visual Basic were visited independently of Training)
Take into account that the overall support of Training is pretty small, around 2%:
sum(unlist(msweb.items) %in% "Training")/length(msweb.items)
## [1] 0.02188866
thus the rules obtained will have even smaller support. Anyhow, they could give a good idea of what the user is often interested at, when they would looking for training in the web site.
Finally, these would be the rules for users looking for terms related to technical support.
inspect(sort(subset(rules.tr3.sorted, rhs %in% supp_terms), by="support")[1:5])
## lhs rhs support confidence lift
## 1 {Support Network Program Information} => {Support Desktop} 0.008957232 0.8542274 6.277833
## 2 {isapi,
## Support Network Program Information} => {Support Desktop} 0.005441595 0.9035533 6.640335
## 3 {isapi,
## Knowledge Base,
## Microsoft.com Search,
## Products } => {Support Desktop} 0.005013604 0.8039216 5.908128
## 4 {Knowledge Base,
## Support Network Program Information} => {Support Desktop} 0.004310477 0.9276316 6.817290
## 5 {isapi,
## Knowledge Base,
## NT Server Support} => {Support Desktop} 0.003760203 0.8310811 6.107727
Apriori
could be a good exploratory tool for digging into the structure of the web site as seen by the users. If a set pages of the web site can be seen as conversions of any kind, then the itemsets and the rules may allow you to understand what other pages the user with the conversion has also visited.
However, note that all along this post there is the assumption that the fact that the user visits a page means that it is interested on it, which is not necessarily so. Sticking to Google Analytics, there’s additional info that you could use to try to assess interest:
info on how many times a user has visited a page
assuming the page is not the exit page (the last page in the session), you could get the time that the user has spent on it
when the page is actually the bounce page, it could be an indication of lack of interest on the user
using ga:previousPagePath
, as in the queries above, you could get the full picture on every visit; order of page visits could matter
also, depending on the web page, you could add other events to be tracked that could become additional hints
Anyway, apriori
should be taken as an initial exploratory tool, and cannot use any measure of interest from the user. When introducing some measure of the interest of the user on each page, something more similar to a collaborative filtering approach could be set up.
Look at this link for details on how the dataset is structured. It is coming from the logs of the machines, and:
lines starting with A are attribute lines that contain the tree structure of the web site
lines starting with C are case lines, correspond to a user, and are followed by vote lines, which are visits from that same user
The number of fields on each of the lines depend on the type of line; a simple way to read it is adding column names,
msweb.orig <- read.csv("~/Downloads/clickstream/anonymous-msweb.data",
header=FALSE,
sep=",",
col.names=c("V1", "V2", "V3", "V4", "V5", "V6"))
The attribute lines are pretty straightforward to process,
attribute.lines <- msweb.orig[msweb.orig$V1 == "A", c("V2", "V4", "V5")]
colnames(attribute.lines) <- c("id", "title", "url")
head(attribute.lines)
## id title url
## 8 1287 International AutoRoute /autoroute
## 9 1288 library /library
## 10 1289 Master Chef Product Information /masterchef
## 11 1297 Central America /centroam
## 12 1215 For Developers Only Info /developer
## 13 1279 Multimedia Golf /msgolf
However the case and vote lines require some post-processing to convert them to a structure similar to the output of Google Analytics.
casevote.lines <- msweb.orig[msweb.orig$V1 == "C" | msweb.orig$V1 == "V", c("V1", "V2")]
head(casevote.lines)
## V1 V2
## 302 C 10001
## 303 V 1000
## 304 V 1001
## 305 V 1002
## 306 C 10002
## 307 V 1001
The transformation is not really complicated: after a C line for a user, all the V lines until the next C correspond to pages visited by the user. However, a direct implementation of that logic can take a long time to run (up to hours). The implementation below uses data.table
and the function shift
used in a loop, and runs in a few seconds; shifting first one row will solve the parents of the visits of only one page, shifting a second row solves parents of two-page visits, and so on until all parents are solved.
casevote.lines$rownames <- as.numeric(rownames(casevote.lines)) # add indexes as a new column
dt <- data.table(casevote.lines[, !(names(casevote.lines)
%in% "parent")]) # convert to dataframe, excluding the parent column from a previos trial
head(dt)
## V1 V2 rownames
## 1: C 10001 302
## 2: V 1000 303
## 3: V 1001 304
## 4: V 1002 305
## 5: C 10002 306
## 6: V 1001 307
dt[V1 == "C", parent:=rownames, by=rownames] # parents of C lines are themselves
difference <- 1
while (sum(is.na(dt$parent)) > 0) {
dt[, shiftV1 := shift(V1, difference)] # shift by "difference"
dt[shiftV1 == "C" & is.na(parent),
parent:=(rownames-difference), by=rownames] # set parent value of visits with n. of pages == "difference"
difference <- difference + 1
}
casevote.lines$parent = dt$parent
head(casevote.lines)
## V1 V2 rownames parent
## 302 C 10001 302 302
## 303 V 1000 303 302
## 304 V 1001 304 302
## 305 V 1002 305 302
## 306 C 10002 306 306
## 307 V 1001 307 306
After that, a simple merge gives a line per user and page visited,
clicks <- merge(x=casevote.lines[casevote.lines$V1 == "C", c("V2", "parent")],
y=casevote.lines[casevote.lines$V1 == "V", c("V2", "parent")],
by="parent")
colnames(clicks) <- c("remove", "userId", "attributeId")
head(clicks)
## remove userId attributeId
## 1 302 10001 1000
## 2 302 10001 1001
## 3 302 10001 1002
## 4 306 10002 1001
## 5 306 10002 1003
## 6 309 10003 1001
And finally a merge with attribute.lines
gives the pages visited in a way understandable by a human being.
msweb <- merge(x=clicks[,c("userId", "attributeId")],
y=attribute.lines,
by.x=c("attributeId"),
by.y=c("id"))
msweb <- msweb[order(msweb$userId),]
head(msweb)
## attributeId userId title url
## 1 1000 10001 regwiz /regwiz
## 998 1001 10001 Support Desktop /support
## 5551 1002 10001 End User Produced View /athome
## 1461 1001 10002 Support Desktop /support
## 8310 1003 10002 Knowledge Base /kb
## 1463 1001 10003 Support Desktop /support
Where msweb
is the starting point of the analysis above.
– Hastie, T., Tibshirani R. & Friedman J., The Elements of Statistical Learning - has a quite accessible explanation
– Leskovec, J., Rajaraman A. & Ullman J., Mining of Massive Datasets - details on market basket analysis implementation for big volumes
– Hashler M., recommenderlab: A Framework for Developing and Testing Recommendation Algorithms - a framework for recommendation algorithms, good summary
– Mobasher B., Dai H., Luo T. & Nakagawa M.: Improving the effectiveness of Collaborative Filtering on Anonymous Web Usage Data