This is a followup for last month’s post, on Association Rules applied to Google Analytics data. It could happen that when you talk to a customer and you tell them how this technique could be a nice 1st step for the analysis of their web side, and they show interest (yes!), however they haven’t set a custom GA dimension for detecting users sessions… and you cannot apply it, at least not as explain in last month’s post. It could happen, and actually it has happened. Of course, your customer has historic data covering months or even years, which makes the perfect starting point for your analysis, and you cannot waste that chance.
There could be various approaches to figure out sessions from those unidentified registers; in this post you will find a detail description and some code that has worked well in practice for a medium sized dataset. Note that this solution, as it is shown, infers sessions but not user sessions -i.e. if the same user has connected several times at different points in time, there will be no way to match those different sessions to the same user. Well, if there is no info on the user.
Note that you (or your customer) could have info on the user (i.e. the custom id of the user is set in GA), however you could face an apparently different problem which has the same kind of solution. For example, you may want to identify the different user sessions because you need to associate events to sessions, like in this StackOverflow question. You’ll recognize the accepted answer since is actually a summary of this post.
If you have a web site and GA active on it, the best would be that you generated you own dataset with the following GA query:
query.coll.d <- Init(start.date=start_date.d,
end.date=end_date.d,
dimensions=c("ga:date",
"ga:hour",
"ga:minute",
"ga:sourceMedium",
"ga:pagePath",
"ga:previousPagePath",
"ga:sessionDurationBucket"),
metrics=c("ga:sessions",
"ga:pageviews",
"ga:newUsers"),
table.id="XXXXXX", # your table id
sort="-ga:date,-ga:hour,-ga:minute",
max.results = 20000) # obviously depends on the # of registers you want to process
ga.query.d <- QueryBuilder(query.coll.d)
ga.data.coll.d <- GetReportData(ga.query.d, token, paginate_query=TRUE) # need to paginate when large amount of registers
If you don’t have the setup ready for accessing GA from R, you’ll find details on how to do it again in last month’s post.
If you don’t have access to any GA site but you still want to play with the code below, you can use this R data file, which uses data from this humble blog. You can’t expect much volume in the file, the blog is just two months old at the time of this writing, but it should be ok for demo purposes.
load("GADeduceSessionsInput.Rdata")
ga.data.coll.d[1:2,]
## date hour minute sourceMedium
## 1 20160427 13 36 (direct) / (none)
## 2 20160426 17 16 (direct) / (none)
## pagePath previousPagePath
## 1 /2016-04-01-uplift-evaluation.html (entrance)
## 2 /2016-04-01-uplift-evaluation.html /2016-04-01-uplift-evaluation.html
## sessionDurationBucket sessions pageviews newUsers datetime
## 1 0 1 1 0 (160427 13:36:00)
## 2 194 0 1 0 (160426 17:16:00)
dim(ga.data.coll.d)
## [1] 140 11
All the explanations below assume you are using exactly the GA query above, and thus the custom user id is not available.
The proposed procedure can be summarized as follows:
first construct a timestamp for each record from ga:date
, ga:hour
and ga:minute
then group the records by ga:sessionDurationBucket
for each of the groups, in a loop: (a) get the older timestamp of the group (i.e. the older of the remaining), and add to it the session duration (ga:sessionDurationBucket/60+1)/(60\*24)
to get the “last” timestamp, (b) assign to the session all records from the first to the “last” timestamp just calculated, and (c) if there are records still unassigned, the new first timestamp is just the more recent timestamp of these remaining records, and go to (a)
At the end of the process, maybe not all the registers are actually assigned to a session, depending on your input data. It’s usually just a negligible number of registers that are left out; more details below.
An additional validation can be done: the column sessions
, in the normal case and for a session bucket, will have a 1
as a value for the record which belongs to the entrance page, and a 0
for the rest of the records; thus one can validate that the set of records that the loop assigns to a session belong to one session only. For example:
ga.data.coll.d[ga.data.coll.d$sessionDurationBucket == 90,]
## date hour minute sourceMedium
## 61 20160412 23 53 google / organic
## 62 20160412 23 53 google / organic
## 63 20160412 23 53 google / organic
## 64 20160412 23 53 google / organic
## 65 20160412 23 52 google / organic
## 66 20160412 23 51 google / organic
## 67 20160412 23 51 google / organic
## pagePath previousPagePath
## 61 /2016-04-01-uplift-evaluation.html /about.html
## 62 /about.html /credits.html
## 63 /credits.html /about.html
## 64 /index.html /2016-04-01-uplift-evaluation.html
## 65 /about.html /2016-04-01-uplift-evaluation.html
## 66 / (entrance)
## 67 /2016-04-01-uplift-evaluation.html /
## sessionDurationBucket sessions pageviews newUsers datetime
## 61 90 0 1 0 (160412 23:53:00)
## 62 90 0 1 0 (160412 23:53:00)
## 63 90 0 1 0 (160412 23:53:00)
## 64 90 0 1 0 (160412 23:53:00)
## 65 90 0 1 0 (160412 23:52:00)
## 66 90 1 1 0 (160412 23:51:00)
## 67 90 0 1 0 (160412 23:51:00)
This in fact gives a clue on when the algorithm will make a mistake. In the input dataset you’re using, if there are several sessions that:
started exactly at the same timestamp (up to the minute)
had the same sessionDurationBucket
, i.e. the session duration
and come from the same sourceMedium
those registers will not be split correctly.
In practice, this usually happens with several sessions of duration 0 that enter the site at exactly the same time, possibly coming from a bot. Anyhow, if you figure out or experience any other situation in which this may happen, please post a comment!
Let’s explain it with more detail with the actual code. First the timestamp calculation:
require(chron)
ga.data.coll.d$datetime <- chron(ga.data.coll.d$date, paste(ga.data.coll.d$hour,
ga.data.coll.d$minute,
'00',
sep=":"),
format=c("ymd","h:m:s"))
The following function uses a global variable in R for generating and storing a unique id (userId
), and it will be used by the loop. The global variable is needed since, in each of the passes through the loop, there is no visibility of the last id that was used.
getNewId <- function() {
current <- get("userId", envir = .GlobalEnv)
current <- current + 1
assign("userId", current, envir = .GlobalEnv)
return(current)
}
A variable to make the code more readable: 1 minute defined as the fraction of time that it represents from 1 day.
one_minute <- 1/(60*24)
And the main function, which contains the processing for each of the groups:
setUserId <- function(datetimes, sessions, sessionDurationBucket) {
if (length(datetimes) == 1) {
# special case: just one row received for this sessionDurationBucket
return(getNewId())
}
if (sum(sessions) == 1) {
# special case: just one session, all rows belong to the same user
return(getNewId())
}
users <- rep(NA, length(datetimes))
if (sessionDurationBucket == 0) {
# special case: sessions of 1 page and length 0 => assign a user per session
for (i in 1:length(datetimes)) {
users[i] <- getNewId()
}
return(users)
}
# this is the generic case described in the paragraphs above
# ... convert sessionDurationBucket to minutes
minutes <- ceiling(as.numeric(sessionDurationBucket)/60) + 1
i <- 1
while(TRUE) {
# ... step (a): from the list of input datetimes, the oldest not assigned yet is datetimes[i]
# the duration of the session is minutes*one_minute
# and index_lag is TRUE for the set index from the oldest to the newest timestamp
# in the session duration
# (is.na(users) is used to get the records not assigned yet)
index_lag <- (datetimes >= datetimes[i] - minutes*one_minute) & is.na(users)
# ... check sum sessions for the indexes that correspond to the session
sessions_lag <- sum(sessions[index_lag])
if (sessions_lag == 1) {
# ... step (b): assign the records to a new id
users[index_lag] <- getNewId()
} else {
# ... number of sessions is not 1, keep those registers unassigned
users[index_lag] <- 0
}
# ... step (c): check if there are users not yet assign, and continue/break the loop
rem_users <- which(is.na(users))
if (length(rem_users) == 0) {
break
}
i <- min(rem_users)
}
return(users)
}
And finally, just apply the function to the registers using data.table
. Note that:
the starting userId
(i.e. the global variable) is set to 1, since the value 0 is reserved in the function setUserId
for the registers with error (which cannot be identified properly)
data.table
groups the registers by sessionDurationBucket
(and additionally by sourceMedium
, so that there are less chances of error). The function setUserId
receives as parameters (1) a list of all the datetimes for all the records with the same sessionDurationBucket
(and sourceMedium
), (2) analogously, a list of the column sessions
for all the registers with the same sessionDurationBucket
(and sourceMedium
) and (3) a number corresponding to the sessionDurationBucket
require(data.table)
ga.data.coll.dt <- data.table(ga.data.coll.d)
userId <- 1
ga.data.coll.dt[, userId:=setUserId(datetime, sessions, sessionDurationBucket), by=list(sessionDurationBucket, sourceMedium)]
Use the following to cleanup the records that could not be assigned:
ga.data.coll.dt.clean <- ga.data.coll.dt[userId!=0]
And take a look to the results! The code below uses plyr
to output some stats.
length(unique(ga.data.coll.dt.clean$userId))
## [1] 66
ga.data.coll.dt.clean[1:5,]
## date hour minute sourceMedium
## 1: 20160427 13 36 (direct) / (none)
## 2: 20160426 17 16 (direct) / (none)
## 3: 20160426 17 13 (direct) / (none)
## 4: 20160426 17 08 (direct) / (none)
## 5: 20160426 17 05 (direct) / (none)
## pagePath previousPagePath
## 1: /2016-04-01-uplift-evaluation.html (entrance)
## 2: /2016-04-01-uplift-evaluation.html /2016-04-01-uplift-evaluation.html
## 3: /2016-04-01-uplift-evaluation.html (entrance)
## 4: /2016-04-01-uplift-evaluation.html /2016-04-01-uplift-evaluation.html
## 5: /2016-04-01-uplift-evaluation.html (entrance)
## sessionDurationBucket sessions pageviews newUsers datetime
## 1: 0 1 1 0 (160427 13:36:00)
## 2: 194 0 1 0 (160426 17:16:00)
## 3: 194 1 1 1 (160426 17:13:00)
## 4: 1323 0 1 0 (160426 17:08:00)
## 5: 0 1 1 1 (160426 17:05:00)
## userId
## 1: 2
## 2: 19
## 3: 19
## 4: 20
## 5: 3
library(plyr)
count(ga.data.coll.dt.clean[,unique(sourceMedium),by=userId], "V1")
## V1 freq
## 1 (direct) / (none) 41
## 2 forum.topic75939736.6hopping.com / referral 2
## 3 getrichquickly.info / referral 1
## 4 google / organic 6
## 5 linkedin / referral 1
## 6 linkedin.com / referral 8
## 7 lnkdin_groups / referral 2
## 8 makeprogress.ga / referral 2
## 9 smartphonediscount.info / referral 1
## 10 website-stealer-warning-alert.hdmoviecams.com / referral 2
counts <- count(ga.data.coll.dt.clean[,unique(sourceMedium),by=userId], "V1")
counts$percent <- counts$freq/sum(counts$freq)
counts[order(counts$freq, decreasing=TRUE),]
## V1 freq
## 1 (direct) / (none) 41
## 6 linkedin.com / referral 8
## 4 google / organic 6
## 2 forum.topic75939736.6hopping.com / referral 2
## 7 lnkdin_groups / referral 2
## 8 makeprogress.ga / referral 2
## 10 website-stealer-warning-alert.hdmoviecams.com / referral 2
## 3 getrichquickly.info / referral 1
## 5 linkedin / referral 1
## 9 smartphonediscount.info / referral 1
## percent
## 1 0.62121212
## 6 0.12121212
## 4 0.09090909
## 2 0.03030303
## 7 0.03030303
## 8 0.03030303
## 10 0.03030303
## 3 0.01515152
## 5 0.01515152
## 9 0.01515152
And finally, if your aim is to apply Association Rules to the registers, the following code will convert it to transactions
for arules
.
site.items <- ga.data.coll.dt.clean[, list(pagePath, userId)]
site.items[, l:=.(list(unique(pagePath))), by=userId] # creates list of pages per user
site.items <- site.items[! duplicated(userId), l] # removes duplicated lines per user and selects only the list of pages
head(site.items, 3)
## [[1]]
## [1] "/2016-04-01-uplift-evaluation.html"
##
## [[2]]
## [1] "/2016-04-01-uplift-evaluation.html"
##
## [[3]]
## [1] "/2016-04-01-uplift-evaluation.html"
## [2] "/index.html"
## [3] "/"
length(site.items)
## [1] 66
library(arules)
site.items <- as(site.items, "transactions") # converts it to transactions
inspect(site.items[1:3])
## items
## 1 {/2016-04-01-uplift-evaluation.html}
## 2 {/2016-04-01-uplift-evaluation.html}
## 3 {/,/2016-04-01-uplift-evaluation.html,/index.html}