### Introduction & purpose

This is a followup for last month’s post, on Association Rules applied to Google Analytics data. It could happen that when you talk to a customer and you tell them how this technique could be a nice 1st step for the analysis of their web side, and they show interest (yes!), however they haven’t set a custom GA dimension for detecting users sessions… and you cannot apply it, at least not as explain in last month’s post. It could happen, and actually it has happened. Of course, your customer has historic data covering months or even years, which makes the perfect starting point for your analysis, and you cannot waste that chance.

There could be various approaches to figure out sessions from those unidentified registers; in this post you will find a detail description and some code that has worked well in practice for a medium sized dataset. Note that this solution, as it is shown, infers sessions but not user sessions -i.e. if the same user has connected several times at different points in time, there will be no way to match those different sessions to the same user. Well, if there is no info on the user.

Note that you (or your customer) could have info on the user (i.e. the custom id of the user is set in GA), however you could face an apparently different problem which has the same kind of solution. For example, you may want to identify the different user sessions because you need to associate events to sessions, like in this StackOverflow question. You’ll recognize the accepted answer since is actually a summary of this post.

##### Dataset used

If you have a web site and GA active on it, the best would be that you generated you own dataset with the following GA query:

query.coll.d <- Init(start.date=start_date.d,
end.date=end_date.d,
dimensions=c("ga:date",
"ga:hour",
"ga:minute",
"ga:sourceMedium",
"ga:pagePath",
"ga:previousPagePath",
"ga:sessionDurationBucket"),
metrics=c("ga:sessions",
"ga:pageviews",
"ga:newUsers"),
sort="-ga:date,-ga:hour,-ga:minute",
max.results = 20000)        # obviously depends on the # of registers you want to process

ga.query.d <- QueryBuilder(query.coll.d)
ga.data.coll.d <- GetReportData(ga.query.d, token, paginate_query=TRUE)  # need to paginate when large amount of registers

If you don’t have the setup ready for accessing GA from R, you’ll find details on how to do it again in last month’s post.

If you don’t have access to any GA site but you still want to play with the code below, you can use this R data file, which uses data from this humble blog. You can’t expect much volume in the file, the blog is just two months old at the time of this writing, but it should be ok for demo purposes.

load("GADeduceSessionsInput.Rdata")
ga.data.coll.d[1:2,]
##       date hour minute      sourceMedium
## 1 20160427   13     36 (direct) / (none)
## 2 20160426   17     16 (direct) / (none)
##                             pagePath                   previousPagePath
## 1 /2016-04-01-uplift-evaluation.html                         (entrance)
## 2 /2016-04-01-uplift-evaluation.html /2016-04-01-uplift-evaluation.html
##   sessionDurationBucket sessions pageviews newUsers          datetime
## 1                     0        1         1        0 (160427 13:36:00)
## 2                   194        0         1        0 (160426 17:16:00)
dim(ga.data.coll.d)
## [1] 140  11

### Figuring out the sessions

All the explanations below assume you are using exactly the GA query above, and thus the custom user id is not available.

The proposed procedure can be summarized as follows:

• first construct a timestamp for each record from ga:date, ga:hour and ga:minute

• then group the records by ga:sessionDurationBucket

• for each of the groups, in a loop: (a) get the older timestamp of the group (i.e. the older of the remaining), and add to it the session duration (ga:sessionDurationBucket/60+1)/(60\*24) to get the “last” timestamp, (b) assign to the session all records from the first to the “last” timestamp just calculated, and (c) if there are records still unassigned, the new first timestamp is just the more recent timestamp of these remaining records, and go to (a)

At the end of the process, maybe not all the registers are actually assigned to a session, depending on your input data. It’s usually just a negligible number of registers that are left out; more details below.

An additional validation can be done: the column sessions, in the normal case and for a session bucket, will have a 1 as a value for the record which belongs to the entrance page, and a 0 for the rest of the records; thus one can validate that the set of records that the loop assigns to a session belong to one session only. For example:

ga.data.coll.d[ga.data.coll.d$sessionDurationBucket == 90,] ## date hour minute sourceMedium ## 61 20160412 23 53 google / organic ## 62 20160412 23 53 google / organic ## 63 20160412 23 53 google / organic ## 64 20160412 23 53 google / organic ## 65 20160412 23 52 google / organic ## 66 20160412 23 51 google / organic ## 67 20160412 23 51 google / organic ## pagePath previousPagePath ## 61 /2016-04-01-uplift-evaluation.html /about.html ## 62 /about.html /credits.html ## 63 /credits.html /about.html ## 64 /index.html /2016-04-01-uplift-evaluation.html ## 65 /about.html /2016-04-01-uplift-evaluation.html ## 66 / (entrance) ## 67 /2016-04-01-uplift-evaluation.html / ## sessionDurationBucket sessions pageviews newUsers datetime ## 61 90 0 1 0 (160412 23:53:00) ## 62 90 0 1 0 (160412 23:53:00) ## 63 90 0 1 0 (160412 23:53:00) ## 64 90 0 1 0 (160412 23:53:00) ## 65 90 0 1 0 (160412 23:52:00) ## 66 90 1 1 0 (160412 23:51:00) ## 67 90 0 1 0 (160412 23:51:00) This in fact gives a clue on when the algorithm will make a mistake. In the input dataset you’re using, if there are several sessions that: • started exactly at the same timestamp (up to the minute) • had the same sessionDurationBucket, i.e. the session duration • and come from the same sourceMedium those registers will not be split correctly. In practice, this usually happens with several sessions of duration 0 that enter the site at exactly the same time, possibly coming from a bot. Anyhow, if you figure out or experience any other situation in which this may happen, please post a comment! ##### The R code Let’s explain it with more detail with the actual code. First the timestamp calculation: require(chron) ga.data.coll.d$datetime <- chron(ga.data.coll.d$date, paste(ga.data.coll.d$hour,
ga.data.coll.d$minute, '00', sep=":"), format=c("ymd","h:m:s")) The following function uses a global variable in R for generating and storing a unique id (userId), and it will be used by the loop. The global variable is needed since, in each of the passes through the loop, there is no visibility of the last id that was used. getNewId <- function() { current <- get("userId", envir = .GlobalEnv) current <- current + 1 assign("userId", current, envir = .GlobalEnv) return(current) } A variable to make the code more readable: 1 minute defined as the fraction of time that it represents from 1 day. one_minute <- 1/(60*24) And the main function, which contains the processing for each of the groups: setUserId <- function(datetimes, sessions, sessionDurationBucket) { if (length(datetimes) == 1) { # special case: just one row received for this sessionDurationBucket return(getNewId()) } if (sum(sessions) == 1) { # special case: just one session, all rows belong to the same user return(getNewId()) } users <- rep(NA, length(datetimes)) if (sessionDurationBucket == 0) { # special case: sessions of 1 page and length 0 => assign a user per session for (i in 1:length(datetimes)) { users[i] <- getNewId() } return(users) } # this is the generic case described in the paragraphs above # ... convert sessionDurationBucket to minutes minutes <- ceiling(as.numeric(sessionDurationBucket)/60) + 1 i <- 1 while(TRUE) { # ... step (a): from the list of input datetimes, the oldest not assigned yet is datetimes[i] # the duration of the session is minutes*one_minute # and index_lag is TRUE for the set index from the oldest to the newest timestamp # in the session duration # (is.na(users) is used to get the records not assigned yet) index_lag <- (datetimes >= datetimes[i] - minutes*one_minute) & is.na(users) # ... check sum sessions for the indexes that correspond to the session sessions_lag <- sum(sessions[index_lag]) if (sessions_lag == 1) { # ... step (b): assign the records to a new id users[index_lag] <- getNewId() } else { # ... number of sessions is not 1, keep those registers unassigned users[index_lag] <- 0 } # ... step (c): check if there are users not yet assign, and continue/break the loop rem_users <- which(is.na(users)) if (length(rem_users) == 0) { break } i <- min(rem_users) } return(users) } And finally, just apply the function to the registers using data.table. Note that: • the starting userId (i.e. the global variable) is set to 1, since the value 0 is reserved in the function setUserId for the registers with error (which cannot be identified properly) • data.table groups the registers by sessionDurationBucket (and additionally by sourceMedium, so that there are less chances of error). The function setUserId receives as parameters (1) a list of all the datetimes for all the records with the same sessionDurationBucket (and sourceMedium), (2) analogously, a list of the column sessions for all the registers with the same sessionDurationBucket (and sourceMedium) and (3) a number corresponding to the sessionDurationBucket require(data.table) ga.data.coll.dt <- data.table(ga.data.coll.d) userId <- 1 ga.data.coll.dt[, userId:=setUserId(datetime, sessions, sessionDurationBucket), by=list(sessionDurationBucket, sourceMedium)] Use the following to cleanup the records that could not be assigned: ga.data.coll.dt.clean <- ga.data.coll.dt[userId!=0] And take a look to the results! The code below uses plyr to output some stats. length(unique(ga.data.coll.dt.clean$userId))
## [1] 66
ga.data.coll.dt.clean[1:5,]
##        date hour minute      sourceMedium
## 1: 20160427   13     36 (direct) / (none)
## 2: 20160426   17     16 (direct) / (none)
## 3: 20160426   17     13 (direct) / (none)
## 4: 20160426   17     08 (direct) / (none)
## 5: 20160426   17     05 (direct) / (none)
##                              pagePath                   previousPagePath
## 1: /2016-04-01-uplift-evaluation.html                         (entrance)
## 2: /2016-04-01-uplift-evaluation.html /2016-04-01-uplift-evaluation.html
## 3: /2016-04-01-uplift-evaluation.html                         (entrance)
## 4: /2016-04-01-uplift-evaluation.html /2016-04-01-uplift-evaluation.html
## 5: /2016-04-01-uplift-evaluation.html                         (entrance)
##    sessionDurationBucket sessions pageviews newUsers          datetime
## 1:                     0        1         1        0 (160427 13:36:00)
## 2:                   194        0         1        0 (160426 17:16:00)
## 3:                   194        1         1        1 (160426 17:13:00)
## 4:                  1323        0         1        0 (160426 17:08:00)
## 5:                     0        1         1        1 (160426 17:05:00)
##    userId
## 1:      2
## 2:     19
## 3:     19
## 4:     20
## 5:      3
library(plyr)
count(ga.data.coll.dt.clean[,unique(sourceMedium),by=userId], "V1")
##                                                          V1 freq
## 1                                         (direct) / (none)   41
## 2               forum.topic75939736.6hopping.com / referral    2
## 3                            getrichquickly.info / referral    1
## 4                                          google / organic    6
## 5                                       linkedin / referral    1
## 6                                   linkedin.com / referral    8
## 7                                  lnkdin_groups / referral    2
## 8                                makeprogress.ga / referral    2
## 9                        smartphonediscount.info / referral    1
## 10 website-stealer-warning-alert.hdmoviecams.com / referral    2
counts <- count(ga.data.coll.dt.clean[,unique(sourceMedium),by=userId], "V1")
counts$percent <- counts$freq/sum(counts$freq) counts[order(counts$freq, decreasing=TRUE),]
##                                                          V1 freq
## 1                                         (direct) / (none)   41
## 6                                   linkedin.com / referral    8
## 4                                          google / organic    6
## 2               forum.topic75939736.6hopping.com / referral    2
## 7                                  lnkdin_groups / referral    2
## 8                                makeprogress.ga / referral    2
## 10 website-stealer-warning-alert.hdmoviecams.com / referral    2
## 3                            getrichquickly.info / referral    1
## 5                                       linkedin / referral    1
## 9                        smartphonediscount.info / referral    1
##       percent
## 1  0.62121212
## 6  0.12121212
## 4  0.09090909
## 2  0.03030303
## 7  0.03030303
## 8  0.03030303
## 10 0.03030303
## 3  0.01515152
## 5  0.01515152
## 9  0.01515152

And finally, if your aim is to apply Association Rules to the registers, the following code will convert it to transactions for arules.

site.items <- ga.data.coll.dt.clean[, list(pagePath, userId)]
site.items[, l:=.(list(unique(pagePath))), by=userId]   # creates list of pages per user
site.items <- site.items[! duplicated(userId), l]       # removes duplicated lines per user and selects only the list of pages
head(site.items, 3)
## [[1]]
## [1] "/2016-04-01-uplift-evaluation.html"
##
## [[2]]
## [1] "/2016-04-01-uplift-evaluation.html"
##
## [[3]]
## [1] "/2016-04-01-uplift-evaluation.html"
## [2] "/index.html"
## [3] "/"
length(site.items)
## [1] 66
library(arules)
site.items <- as(site.items, "transactions") # converts it to transactions
inspect(site.items[1:3])
##   items
## 1 {/2016-04-01-uplift-evaluation.html}
## 2 {/2016-04-01-uplift-evaluation.html}
## 3 {/,/2016-04-01-uplift-evaluation.html,/index.html}