Sisifo - Arules clickstream

Introduction & purpose

This post assumes that you use Google Analytics for monitoring the traffic of your website. You probably know that, besides the graphs and dashboards, the tool provides an API for downloading the data, and one can analyze the traffic further using R (among many other means).

As an instance of a deeper analysis, this post explores the possibility of finding Association Rules over the pages of the web site that the users have visited. The algorithm provides sets of pages that are commonly visited together, which could be useful for finding insights e.g. on:

the structure of the site as seen by the users
links that could be added to certain pages, which could be of interest and would improve the user experience
the seed for a dynamic recommendation system (of a large web site)

This post will go through the steps for setting Google Analytics, downloading the data from GA to R, and implementing the analysis using the package arules, with a practical example. It may be followed as a tutorial, it does not require deep knowledge on GA or R, even if it uses some more “non-beginner” features as custom variables of GA and the R package data.table.

Dataset used

Quite surprisingly, it is quite difficult to find an open dataset suitable for this task. The Wikipedia Clickstream dataset, available e.g. here, does not contain user id’s, and neither does U.S. open GA data, to cite just a couple of interesting open and well-known datasets. This blog uses GA and all the configuration, but unfortunately does not have a complex enough structure to make it worth it (nor a significant enough volume anyway).

In the UCI Machine Learning Repository there is an interesting dataset of anonymous Microsoft Web Data. This is from 1998 (just from the time when Google was started as a company!!), and contains data for around 38K anonymous users, randomly selected, and all the “areas” of the web site that each user visited during one week of February of that year. Data is actually taken from machine logs (of course Google Analytics did not exist yet), and the cleanup conversion of the data into an structure equivalent to what GA would provide is at the bottom of the post -since it obviously is the least interesting step.

References

There is a complete list of references at the bottom of the page. Short preview:

GA provides quite detailed help pages, and in case of doubt http://stackoverflow.com/
“The Elements of Statistical Learning”, available on-line for free, provides an accessible explanation on Association Rules
There’s many resources on how to use apriori with R in practice, and also quite a few papers on Recommendation Systems on Clickstream data

Setting up Google Analytics for tracking users

At the time of this writing, Google Analytics web tool already provides a User Explorer view, including info on sessions from users identified by an internal id. However, the pages visited per user cannot be seen using this view, and the clientId’s generated by the tool cannot be downloaded through GA API either. See here for reference.

To add info on users that can be queried later through the GA API, an easy solution is to use custom dimensions. In the end, you have to add code into your web pages; this would be an example in JavaScript using localStorage (although cookies could be used as well):

    var GA_LOCAL_STORAGE_KEY = 'ga:clientId';
    var CLIENT_ID;

    if (window.localStorage) {
      ga('create', 'XXX your tracking ID XXX', 'auto', {
        'storage': 'none',
        'clientId': localStorage.getItem(GA_LOCAL_STORAGE_KEY)
      });
      ga(function(tracker) {
        localStorage.setItem(GA_LOCAL_STORAGE_KEY, tracker.get('clientId'));
      });
      CLIENT_ID = localStorage.getItem(GA_LOCAL_STORAGE_KEY);
    }
    else {
      ga('create', 'XXX your tracking ID XXX', 'auto');
      CLIENT_ID = "no local storage"
    }

    ga('send', 'pageview', {
      'dimension1': CLIENT_ID
    });

This adds the id of the client to the 1st custom dimension of GA. Of course, you need to add it to every page that you want to track. You also need to configure the custom dimension under in the GA web tool, under Admin/Custom Definitions/Custom Dimensions, as in the figure below.

Note that dimension1 in the JavaScript code above and the R code below correspond to the index equal to 1 in the variable defined in the configuration.

Downloading data from GA to R

If it is the first time you use GA API, you need to get credentials for your user. The procedure may change over time, but you can always rely on help pages like this one. Anyhow, as of today the steps would be:

Go to https://console.developers.google.com/projectselector/apis/credentials. If you do it from a tab of your browser in which you’ve already login with your GA user id, you will see already your project for GA tracking
Create a new project, if you don’t have one already
When you select your project and continue, you’ll reach https://console.developers.google.com/apis/credentials, where you can create new credentials. Add credentials selecting Other UI, key type JSON, and OAuth 2.0 client Id
You end up with a client.id and a sort of password called client.secret
Finally, you need to enable the API for the proyect, at https://console.developers.google.com/projectselector/apis/library. Anyhow, if you skip this step, you will get a very clear error message in your R session when you do your first access; just follow the instructions

There’s a few R packages for an easy access to the GA API; this post uses RGoogleAnalytics. The first step in R is to install/load the library and create a token.

library(RGoogleAnalytics)

token <- Auth("XXXX client id     XXXX",
              "XXXX client secret XXXX")

save(token, file="./token_file")

This generates a token file in the current working directory of R; with this token you don’t need to use the id/password anymore, you may start your script just reading and validating the token.

require(RGoogleAnalytics)

load("./token_file")
ValidateToken(token)

## Access Token successfully updated

The attributes of any query to the GA API are as follows:

start and end dates
dimensions and metrics, see here since not all combinations of dimensions and metrics are allowed
table.id: this is the id of the view, go to GA web tool, Admin/Your property/Your view/View settings
optional: e.g. maximum number of registers, sort criteria

One typical query for visits per time could be as follows:

query.time <- Init(start.date=start_date,
                   end.date=end_date,
                   dimensions=c("ga:date",
                                "ga:hour",
                                "ga:minute",
                                "ga:sourceMedium",
                                "ga:city",
                                "ga:country",
                                "ga:pagePath"),
                   metrics=c("ga:sessions",
                             "ga:sessionDuration",
                             "ga:goalCompletionsAll",
                             "ga:pageviews",
                             "ga:newUsers"),
                   table.id="ga:XXX your view id XXX",
                   sort="-ga:date,-ga:hour,-ga:minute",
                   max.results = 10000)

ga.query <- QueryBuilder(query.time)
ga.data.time <- GetReportData(ga.query, token)
ga.data.time[1:2,]

## Status of Query:
## The API returned 15 results

##       date hour minute      sourceMedium   city country
## 1 20160426   17     16 (direct) / (none) Madrid   Spain
## 2 20160426   17     13 (direct) / (none) Madrid   Spain
##                             pagePath sessions sessionDuration
## 1 /2016-04-01-uplift-evaluation.html        0               0
## 2 /2016-04-01-uplift-evaluation.html        1             194
##   goalCompletionsAll pageviews newUsers
## 1                  1         1        0
## 2                  1         1        1

The following query uses the custom dimension for the user id, and gets the pages visited per user, thus could be used for the Association Rules analysis.

query.coll <- Init(start.date=start_date,
                   end.date=end_date,
                   dimensions=c("ga:date",
                                "ga:hour",
                                "ga:minute",
                                "ga:dimension1",
                                "ga:pagePath",
                                "ga:previousPagePath"),
                   metrics=c("ga:sessions",
                             "ga:goalCompletionsAll",
                             "ga:pageviews",
                             "ga:newUsers"),
                   table.id="ga:XXX your view id XXX",
                   sort="-ga:date,-ga:hour,-ga:minute",
                   max.results = 10000)

ga.query <- QueryBuilder(query.coll)
ga.data.coll <- GetReportData(ga.query, token)
ga.data.coll[1:2,]

## Status of Query:
## The API returned 10 results

##       date hour minute            dimension1
## 1 20160426   17     16 1591284960.1461683628
## 2 20160426   17     13 1591284960.1461683628
##                             pagePath                   previousPagePath
## 1 /2016-04-01-uplift-evaluation.html /2016-04-01-uplift-evaluation.html
## 2 /2016-04-01-uplift-evaluation.html                         (entrance)
##   sessions goalCompletionsAll pageviews newUsers
## 1        0                  1         1        0
## 2        1                  1         1        1

Association Rules analysis

After this reasonably easy setup, let’s analyse the dataset with users’ pageviews. As mentioned above, the dataset does not actually come from Google Analytics, but at this point let’s assume it’s been transformed to a dataframe with a equivalent structure. (You’ll find the full R code for loading the dataset further down).

head(msweb[, c("title", "userId")])

##                       title userId
## 1                    regwiz  10001
## 998         Support Desktop  10001
## 5551 End User Produced View  10001
## 1461        Support Desktop  10002
## 8310         Knowledge Base  10002
## 1463        Support Desktop  10003

where title corresponds to a description the web page that has been visited. Note that this will be identical to the GA data using the right columns:

head(ga.data.coll[, c("pagePath", "dimension1")])

In other to use this with the apriori implementation of the package arules, this data.frame has to be converted to transactions, in which there is one line per user and a field with a list of all the pages that the user has visited.

library(data.table)
msweb.items <- data.table(msweb[, c("title", "userId")])
msweb.items[, l:=.(list(unique(title))), by=userId] # creates list of pages per user, see note
msweb.items <- msweb.items[! duplicated(userId), l] # removes duplicated lines per user and selects only the list of pages
head(msweb.items, 3)

## [[1]]
## [1] regwiz                 Support Desktop        End User Produced View
## 296 Levels:  0 About Microsoft  ... WorldWide Offices - US Districts
##
## [[2]]
## [1] Support Desktop Knowledge Base
## 296 Levels:  0 About Microsoft  ... WorldWide Offices - US Districts
##
## [[3]]
## [1] Support Desktop      Knowledge Base       Microsoft.com Search
## 296 Levels:  0 About Microsoft  ... WorldWide Offices - US Districts

length(msweb.items)

## [1] 32711

library(arules)
msweb.trans <- as(msweb.items, "transactions") # converts it to transactions
inspect(msweb.trans[1:3])

##   items
## 1 {End User Produced View,
##    regwiz,
##    Support Desktop}
## 2 {Knowledge Base,
##    Support Desktop}
## 3 {Knowledge Base,
##    Microsoft.com Search,
##    Support Desktop}

Note that the transactions do not allow for duplicated elements of the list per user, i.e. if a user has visited a web page twice, that information is lost.

Once the data is formatted into transactions, let’s use the function for plotting item frequencies provided by the arules package.

itemFrequencyPlot(msweb.trans, topN=12)

From the plot we see that the most visited page is Free Downloads, by far. Also Products and pages on technical support in general are quite visited. The first 2 could be considered as conversions from a web analytics perspective, and the support pages could be clues of users having problems (although possibly Windows was quite more stable in 1998??)

Finally, to interpret the plot, take into account that a user visits on average 3.02 pages; the sum of relative item frequencies is more than 1.

Everything is ready for the analysis of Association Rules. However, a short digression would be handy to understand at least the basic concepts.

On Association Rules

Association Rules is a popular technique for discovering relations (i.e. frequent associations, connections) in a dataset.

One of the most popular applications is “Market Basket Analysis”, applied in commercial datasets for discovering products that are purchased together by the customers.

Apriori is an solution for mining association rules devised by Agrawal et al., 1995, which can solve the Market Basket Analysis problem in a computationally feasible way. One of the difficulties of the analysis is that the sets of items can be of any size; Apriori exploits the fact that if a set of elements appears frequently in the data, then any subset of elements of the set must be frequent as well. In the implementation, the algorithm starts evaluating sets of just one element in one pass through all the data, and then in every subsequent pass it looks for sets of increasing size.

As an output, the algorithm provides rules of the form A => B, where both A and B can be sets of elements in the dataset (can be also single elements). Three main properties about the rules need to be understood:

“support” is the percentage of times that all the items of A and B combined appear in the dataset
“confidence” is the percentage of times where B happens, provided that A has happened
“lift” is the ratio between confidence of the rule and the support of B alone, i.e. how much more likely is that A and B happen together, over A and B happening on their own; (a lift of 1 would mean that A and B are actually independent)

Let’s see the 3 concepts in play for an example in the dataset.

##   lhs                        rhs                support confidence     lift
## 1 {Internet Explorer,
##    isapi,
##    Windows Family of OSs} => {Free Downloads} 0.0104552  0.8382353 2.530409

This is the rule with higher support that contains Free Downloads in the rhs, i.e. the right term, or B in the explanation above. The A term is then {Internet Explorer, isapi, Windows Family of OSs}, where btw isapi seems to be a particular kind of DLL which was quite frequently browsed in 98 and still exists. The 3 concepts will then be for this rule:

“support” of 0.01 means that 1% of the sets of pages visited contain the pages Internet Explorer, isapi, Windows Family of OSs and Free Downloads
“confidence” of 0.84 means that 84% of the times that the user has browsed Internet Explorer, isapi and Windows Family of OSs, they have browsed also Free Downloads
“lift” of 2.53 gives an idea of how generalizable the rule is (the lift would be 1 if Internet Explorer, isapi and Windows Family of OSs happened independently of Free Downloads)

Further analysis on the dataset: itemsets

apriori allows mining Association Rules as described just above, however as a first step for this kind of data it is a good idea to mine just frequent itemsets, which are sets of items that appear together, and thus only have “support”.

rules.tr2 <- apriori(msweb.trans,
                     parameter=list(minlen=2, support=0.0001, target="frequent"))

## Apriori
##
## Parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen
##          NA    0.1    1 none FALSE            TRUE   1e-04      2     10
##             target   ext
##  frequent itemsets FALSE
##
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
##
## Absolute minimum support count: 3
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[284 item(s), 32711 transaction(s)] done [0.01s].
## sorting and recoding items ... [241 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.09s].
## writing ... [179806 set(s)] done [0.02s].
## creating S4 object  ... done [0.09s].

rules.tr2

## set of 179806 itemsets

rules.tr2.sorted <- sort(rules.tr2, by="support")
inspect(rules.tr2.sorted[1:5])

##      items                                  support
## 4677 {Free Downloads,Internet Explorer}     0.16080218
## 4667 {Free Downloads,Windows Family of OSs} 0.07792486
## 4674 {Free Downloads,isapi}                 0.07306411
## 4671 {Free Downloads,Products }             0.06123322
## 4676 {Free Downloads,Microsoft.com Search}  0.06043838

The page Free Downloads appears in the the 5 itemsets with higher support. Let’s dig deeper on the itemsets that contain Free Downloads using the function subset.

inspect(subset(rules.tr2.sorted, items %in% "Free Downloads")[1:10])

##       items                                        support
## 4677  {Free Downloads,Internet Explorer}           0.16080218
## 4667  {Free Downloads,Windows Family of OSs}       0.07792486
## 4674  {Free Downloads,isapi}                       0.07306411
## 4671  {Free Downloads,Products }                   0.06123322
## 4676  {Free Downloads,Microsoft.com Search}        0.06043838
## 4662  {Free Downloads,Support Desktop}             0.03640977
## 25376 {Free Downloads,Internet Explorer,Products } 0.03173244
## 4656  {Free Downloads,Knowledge Base}              0.03118217
## 25367 {Free Downloads,isapi,Windows Family of OSs} 0.03026505
## 25379 {Free Downloads,Internet Explorer,isapi}     0.02907279

Note that the itemset 25376 contains the items of the itemsets 4677 and 4671. A piece of code is usually copied and pasted into several R tutorials, which removes the duplicates, something like this:

remove_redundant <- function(rules) {
  subset.matrix <- is.subset(rules, rules)
  subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
  redundant <- colSums(subset.matrix, na.rm=T) >= 1
  return(rules[!redundant])
}
inspect(remove_redundant(subset(rules.tr2.sorted, items %in% "Free Downloads")[1:10]))

##      items                                  support
## 4677 {Free Downloads,Internet Explorer}     0.16080218
## 4667 {Free Downloads,Windows Family of OSs} 0.07792486
## 4674 {Free Downloads,isapi}                 0.07306411
## 4671 {Free Downloads,Products }             0.06123322
## 4676 {Free Downloads,Microsoft.com Search}  0.06043838
## 4662 {Free Downloads,Support Desktop}       0.03640977
## 4656 {Free Downloads,Knowledge Base}        0.03118217

However when using this you need to take into account that the decision on the line that has been removed, in this case 25376 instead of 4677 and 4671, is a bit arbitrary, and that you could be removing info that you are actually interested on. At the end, in the particular case of this example, the interesting info would be: which pages a user visits together with the conversion page (i.e. the Free Download page)?

One can access the internal elements of the itemsets object provided by arules, use unclass to take a look to the structure,

all <- subset(rules.tr2.sorted, items %in% "Free Downloads")
unclass(all[1:5])
attr(all[1:5], "quality")  # dataframe with one column for support
attr(all[1:5], "items")    # this is an itemMatrix object that contains the list of itemsets
as(as(attr(all[1:5], "items"), "transactions"), "data.frame") # conversion to a data.frame

In a similar way, one could investigate which pages are visited together with another conversion, the Training page.

inspect(subset(rules.tr2.sorted, items %in% "Training")[1:10])

##       items                                 support
## 4146  {isapi,Training}                      0.009690930
## 4147  {Microsoft.com Search,Training}       0.005839014
## 4149  {Free Downloads,Training}             0.004493901
## 4143  {Support Desktop,Training}            0.003393354
## 4145  {Products ,Training}                  0.003362783
## 4148  {Internet Explorer,Training}          0.003271071
## 21280 {isapi,Microsoft.com Search,Training} 0.003026505
## 4144  {Training,Windows Family of OSs}      0.002690227
## 21282 {Free Downloads,isapi,Training}       0.002353948
## 21276 {isapi,Products ,Training}            0.001987099

Let’s figure out as well the most common itemsets related to technical support pages.

supp_terms <- as.character(attribute.lines$title[attribute.lines$title %like% "Support"]) # all pages that contain the keyword
inspect(subset(rules.tr2.sorted, items %in% supp_terms)[1:10])

##       items                                           support
## 4659  {isapi,Support Desktop}                         0.05942955
## 4650  {Knowledge Base,Support Desktop}                0.05521079
## 4660  {Microsoft.com Search,Support Desktop}          0.04857693
## 4638  {isapi,Windows95 Support}                       0.04607013
## 4662  {Free Downloads,Support Desktop}                0.03640977
## 25327 {isapi,Knowledge Base,Support Desktop}          0.03323041
## 4636  {Windows Family of OSs,Windows95 Support}       0.03283299
## 4658  {Products ,Support Desktop}                     0.03240500
## 4635  {Support Desktop,Windows95 Support}             0.02959249
## 25283 {isapi,Windows Family of OSs,Windows95 Support} 0.02873651

Further analysis on the dataset: rules

After the frequent itemsets, let’s mine now the rules provided by apriori, of the form A => B.

rules.tr3 <- apriori(msweb.trans,
                     parameter=list(minlen=2, support=0.0001, target="rules"))

## Apriori
##
## Parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen
##         0.8    0.1    1 none FALSE            TRUE   1e-04      2     10
##  target   ext
##   rules FALSE
##
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
##
## Absolute minimum support count: 3
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[284 item(s), 32711 transaction(s)] done [0.01s].
## sorting and recoding items ... [241 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.10s].
## writing ... [281346 rule(s)] done [0.06s].
## creating S4 object  ... done [0.12s].

rules.tr3.sorted <- sort(rules.tr3, by="lift")
inspect(rules.tr3.sorted[1:5])

##   lhs                       rhs                 support confidence      lift
## 1 {SNA Support}          => {SNA Server}   0.0001528538        1.0 1362.9583
## 2 {SNA Support,
##    Support Desktop}      => {SNA Server}   0.0001222830        1.0 1362.9583
## 3 {Free Downloads,
##    isapi,
##    MS Schedule+ News,
##    Support Desktop}      => {MS Schedule+} 0.0001222830        0.8  872.2933
## 4 {Free Downloads,
##    isapi,
##    Knowledge Base,
##    MS Schedule+ News,
##    Support Desktop}      => {MS Schedule+} 0.0001222830        0.8  872.2933
## 5 {Latin America Region,
##    Windows95 Support}    => {Argentina}    0.0001222830        0.8  817.7750

inspect(sort(rules.tr3, by="support")[1:5])

##   lhs                                 rhs                                            support confidence     lift
## 1 {Windows95 Support}              => {isapi}                                     0.04607013  0.8414294 5.163977
## 2 {Windows 95}                     => {Windows Family of OSs}                     0.03243557  0.9146552 6.464841
## 3 {Windows Family of OSs,
##    Windows95 Support}              => {isapi}                                     0.02873651  0.8752328 5.371433
## 4 {SiteBuilder Network Membership} => {Internet Site Construction for Developers} 0.02729969  0.8045045 8.172716
## 5 {Free Downloads,
##    Windows95 Support}              => {isapi}                                     0.02464003  0.9056180 5.557912

Note that for such a low support threshold, it does not really make sense to order them by lift -these are rules that indeed have a very high lift, but do not happen often at all. On the other hand, when the rules are sorted by support, note that the lift for the rules of high support is still quite significant.

Let’s find again the rules that involve Free Downloads in the right hand side of the rule A => B. The apriori allows configuration through the object appearance for getting the rules with a certain rhs.

inspect(sort(apriori(msweb.trans,
                     parameter=list(minlen=2, support=0.0001, target="rules"),
                     appearance=list(rhs="Free Downloads", default="lhs")), by="support")[1:5])

## Apriori
##
## Parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen
##         0.8    0.1    1 none FALSE            TRUE   1e-04      2     10
##  target   ext
##   rules FALSE
##
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
##
## Absolute minimum support count: 3
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[284 item(s), 32711 transaction(s)] done [0.01s].
## sorting and recoding items ... [241 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.10s].
## writing ... [42229 rule(s)] done [0.01s].
## creating S4 object  ... done [0.03s].
##   lhs                        rhs                  support confidence     lift
## 1 {Internet Explorer,
##    isapi,
##    Windows Family of OSs} => {Free Downloads} 0.010455199  0.8382353 2.530409
## 2 {Internet Explorer,
##    Windows95 Support}     => {Free Downloads} 0.008009538  0.8061538 2.433564
## 3 {Internet Explorer,
##    isapi,
##    Windows95 Support}     => {Free Downloads} 0.007459264  0.8472222 2.557538
## 4 {Internet Explorer,
##    Windows Family of OSs,
##    Windows95 Support}     => {Free Downloads} 0.006542142  0.8492063 2.563528
## 5 {Internet Explorer,
##    isapi,
##    Windows Family of OSs,
##    Windows95 Support}     => {Free Downloads} 0.006144722  0.8815789 2.661252

Although, as with the itemsets, one could also dig into the output object using the in clause.

inspect(subset(rules.tr3.sorted, rhs %in% "Free Downloads")[1:5])

Something similar can be done for the rules with that contain Training at the rhs.

inspect(sort(apriori(msweb.trans,
                     parameter=list(minlen=2, support=0.0001, target="rules"),
                     appearance=list(rhs="Training", default="lhs")), by="support")[1:5])

## Apriori
##
## Parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen
##         0.8    0.1    1 none FALSE            TRUE   1e-04      2     10
##  target   ext
##   rules FALSE
##
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
##
## Absolute minimum support count: 3
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[284 item(s), 32711 transaction(s)] done [0.01s].
## sorting and recoding items ... [241 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.09s].
## writing ... [57 rule(s)] done [0.01s].
## creating S4 object  ... done [0.02s].
##   lhs                    rhs             support confidence     lift
## 1 {isapi,
##    Mastering Series,
##    Visual Basic}      => {Training} 0.0002139953  1.0000000 45.68575
## 2 {Free Downloads,
##    isapi,
##    Mastering Series}  => {Training} 0.0001834245  0.8571429 39.15922
## 3 {Free Downloads,
##    Internet Explorer,
##    Mastering Series}  => {Training} 0.0001834245  0.8571429 39.15922
## 4 {isapi,
##    Mastering Series,
##    Products }         => {Training} 0.0001528538  0.8333333 38.07146
## 5 {Free Downloads,
##    isapi,
##    Mastering Series,
##    Visual Basic}      => {Training} 0.0001528538  1.0000000 45.68575

Here the rules have considerable lift but very little support. Let’s review the first one in more detail:

support of 0.0002 means that isapi, Mastering Series, Visual Basic and Training appear together 0.02% of the time
confidence of 1 means that every time that a user visits the pages isapi, Mastering Series and Visual Basic, they visit the Training page as well
lift of 45.69 indicates that the rule is very probably generalizable (it would be 1 if isapi, Mastering Series and Visual Basic were visited independently of Training)

Take into account that the overall support of Training is pretty small, around 2%:

sum(unlist(msweb.items) %in% "Training")/length(msweb.items)

## [1] 0.02188866

thus the rules obtained will have even smaller support. Anyhow, they could give a good idea of what the user is often interested at, when they would looking for training in the web site.

Finally, these would be the rules for users looking for terms related to technical support.

inspect(sort(subset(rules.tr3.sorted, rhs %in% supp_terms), by="support")[1:5])

##   lhs                                      rhs                   support confidence     lift
## 1 {Support Network Program Information} => {Support Desktop} 0.008957232  0.8542274 6.277833
## 2 {isapi,
##    Support Network Program Information} => {Support Desktop} 0.005441595  0.9035533 6.640335
## 3 {isapi,
##    Knowledge Base,
##    Microsoft.com Search,
##    Products }                           => {Support Desktop} 0.005013604  0.8039216 5.908128
## 4 {Knowledge Base,
##    Support Network Program Information} => {Support Desktop} 0.004310477  0.9276316 6.817290
## 5 {isapi,
##    Knowledge Base,
##    NT Server Support}                   => {Support Desktop} 0.003760203  0.8310811 6.107727

Conclusions & getting further

Apriori could be a good exploratory tool for digging into the structure of the web site as seen by the users. If a set pages of the web site can be seen as conversions of any kind, then the itemsets and the rules may allow you to understand what other pages the user with the conversion has also visited.

However, note that all along this post there is the assumption that the fact that the user visits a page means that it is interested on it, which is not necessarily so. Sticking to Google Analytics, there’s additional info that you could use to try to assess interest:

info on how many times a user has visited a page
assuming the page is not the exit page (the last page in the session), you could get the time that the user has spent on it
when the page is actually the bounce page, it could be an indication of lack of interest on the user
using ga:previousPagePath, as in the queries above, you could get the full picture on every visit; order of page visits could matter
also, depending on the web page, you could add other events to be tracked that could become additional hints

Anyway, apriori should be taken as an initial exploratory tool, and cannot use any measure of interest from the user. When introducing some measure of the interest of the user on each page, something more similar to a collaborative filtering approach could be set up.

And finally, R code for transforming the original dataset into a structure similar to GA’s

Look at this link for details on how the dataset is structured. It is coming from the logs of the machines, and:

lines starting with A are attribute lines that contain the tree structure of the web site
lines starting with C are case lines, correspond to a user, and are followed by vote lines, which are visits from that same user

The number of fields on each of the lines depend on the type of line; a simple way to read it is adding column names,

msweb.orig <- read.csv("~/Downloads/clickstream/anonymous-msweb.data",
                       header=FALSE,
                       sep=",",
                       col.names=c("V1", "V2", "V3", "V4", "V5", "V6"))

The attribute lines are pretty straightforward to process,

attribute.lines <- msweb.orig[msweb.orig$V1 == "A", c("V2", "V4", "V5")]
colnames(attribute.lines) <- c("id", "title", "url")
head(attribute.lines)

##      id                           title         url
## 8  1287         International AutoRoute  /autoroute
## 9  1288                         library    /library
## 10 1289 Master Chef Product Information /masterchef
## 11 1297                 Central America   /centroam
## 12 1215        For Developers Only Info  /developer
## 13 1279                 Multimedia Golf     /msgolf

However the case and vote lines require some post-processing to convert them to a structure similar to the output of Google Analytics.

casevote.lines <- msweb.orig[msweb.orig$V1 == "C" | msweb.orig$V1 == "V", c("V1", "V2")]
head(casevote.lines)

##     V1    V2
## 302  C 10001
## 303  V  1000
## 304  V  1001
## 305  V  1002
## 306  C 10002
## 307  V  1001

The transformation is not really complicated: after a C line for a user, all the V lines until the next C correspond to pages visited by the user. However, a direct implementation of that logic can take a long time to run (up to hours). The implementation below uses data.table and the function shift used in a loop, and runs in a few seconds; shifting first one row will solve the parents of the visits of only one page, shifting a second row solves parents of two-page visits, and so on until all parents are solved.

casevote.lines$rownames <- as.numeric(rownames(casevote.lines)) # add indexes as a new column
dt <- data.table(casevote.lines[, !(names(casevote.lines)
                                    %in% "parent")]) # convert to dataframe, excluding the parent column from a previos trial
head(dt)

##    V1    V2 rownames
## 1:  C 10001      302
## 2:  V  1000      303
## 3:  V  1001      304
## 4:  V  1002      305
## 5:  C 10002      306
## 6:  V  1001      307

dt[V1 == "C", parent:=rownames, by=rownames] # parents of C lines are themselves
difference <- 1
while (sum(is.na(dt$parent)) > 0) {
  dt[, shiftV1 := shift(V1, difference)] # shift by "difference"
  dt[shiftV1 == "C" & is.na(parent),
     parent:=(rownames-difference), by=rownames] # set parent value of visits with n. of pages == "difference"
  difference <- difference + 1
}
casevote.lines$parent = dt$parent
head(casevote.lines)

##     V1    V2 rownames parent
## 302  C 10001      302    302
## 303  V  1000      303    302
## 304  V  1001      304    302
## 305  V  1002      305    302
## 306  C 10002      306    306
## 307  V  1001      307    306

After that, a simple merge gives a line per user and page visited,

clicks <- merge(x=casevote.lines[casevote.lines$V1 == "C", c("V2", "parent")],
                y=casevote.lines[casevote.lines$V1 == "V", c("V2", "parent")],
                by="parent")
colnames(clicks) <- c("remove", "userId", "attributeId")
head(clicks)

##   remove userId attributeId
## 1    302  10001        1000
## 2    302  10001        1001
## 3    302  10001        1002
## 4    306  10002        1001
## 5    306  10002        1003
## 6    309  10003        1001

And finally a merge with attribute.lines gives the pages visited in a way understandable by a human being.

msweb <- merge(x=clicks[,c("userId", "attributeId")],
               y=attribute.lines,
               by.x=c("attributeId"),
               by.y=c("id"))
msweb <- msweb[order(msweb$userId),]
head(msweb)

##      attributeId userId                  title      url
## 1           1000  10001                 regwiz  /regwiz
## 998         1001  10001        Support Desktop /support
## 5551        1002  10001 End User Produced View  /athome
## 1461        1001  10002        Support Desktop /support
## 8310        1003  10002         Knowledge Base      /kb
## 1463        1001  10003        Support Desktop /support

Where msweb is the starting point of the analysis above.

Full list of references

On Association Rules and Apriori, beyond the quite useful Wikipedia page, in well-known “bibles”:

– Hastie, T., Tibshirani R. & Friedman J., The Elements of Statistical Learning - has a quite accessible explanation

– Leskovec, J., Rajaraman A. & Ullman J., Mining of Massive Datasets - details on market basket analysis implementation for big volumes

On collaborative filtering:

– Hashler M., recommenderlab: A Framework for Developing and Testing Recommendation Algorithms - a framework for recommendation algorithms, good summary

– Mobasher B., Dai H., Luo T. & Nakagawa M.: Improving the effectiveness of Collaborative Filtering on Anonymous Web Usage Data

Association Rules applied to Google Analytics data for insights on the structure of a web site