Association Rules is quite a powerful technique for exploratory analysis of any dataset with categorical values. It finds categories that often happen at the same time. It’s been applied to market basket analysis -i.e. what is it that customers usually buy together. And in a previous post, it was applied to clickstream data to discover which web pages are usually visited together.
As powerful as Association Rules is a technique, its results are usually a bit difficult to communicate (at least, that’s the experience of the writer of this post, who has used the technique quite a few times in their professional life). However, at the Spanish R users event 2019, the netCoin
package was presented in a practical lab and proved to be a good solution for generating a visualization for a frequent itemset analysis.
In this post, netCoin
is applied to clickstream data from Microsoft in 1998 available at UCI ML repository, for which Association Rules was applied in a previous post.
netCoin
for clickstream dataWhen running Association Rules to a dataset, the results are rules, which are really descriptive and bring insights, but a bit cumbersome to read. As an example, let’s take a look to the 10 most important rules regarding frequent intemsets (i.e. pages that are visited together for the dataset of Microsoft’s visits in 1998).
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## NA 0.1 1 none FALSE TRUE 5 0.01 2
## maxlen target ext
## 10 frequent itemsets FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 327
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[284 item(s), 32711 transaction(s)] done [0.01s].
## sorting and recoding items ... [47 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [150 set(s)] done [0.00s].
## creating S4 object ... done [0.01s].
## items support count
## [1] {Free Downloads,Internet Explorer} 0.16080218 5260
## [2] {Free Downloads,Windows Family of OSs} 0.07792486 2549
## [3] {Free Downloads,isapi} 0.07306411 2390
## [4] {Free Downloads,Products } 0.06123322 2003
## [5] {Free Downloads,Microsoft.com Search} 0.06043838 1977
## [6] {isapi,Support Desktop} 0.05942955 1944
## [7] {Knowledge Base,Support Desktop} 0.05521079 1806
## [8] {Internet Explorer,Microsoft.com Search} 0.05328483 1743
## [9] {Microsoft.com Search,Products } 0.04989147 1632
## [10] {Microsoft.com Search,Support Desktop} 0.04857693 1589
(For details on how to perform this analysis, take a look to the previous post).
Imagine yourself trying to explain these to a customer. It is doable of course, but everything would certainly go smoother if, first of all, you could provide to them an image like the the picture below, which shows the results of applying netCoin
to the same dataset.
The information is in essence the same that can be obtained from a frequent itemset analysis to the data (i.e. the results shown above), with the advantage that it can be easily grasped, and some qualitative conclusions may be deduced quickly. For example:
there are two main kind of users, the first seeking support on MS products (and free downloads? was that possible in 1998?), while the second are developers
there is additionally a smaller number of users interested in games
users looking for free downloads are mostly interested in Internet Explorer; when interested about a OS, they likely go to Windows95
users looking for products are interested either in Windows 95 or NT
developers are a smaller percent of users, but they very likely visit a subser of pages sequentially
It looks natural to think about applying a clustering technique to the users of a web page. This could certainly be done by applying a community algorithm to the network graph that results of netCoin
. The package is able to export the graph into an igraph
and from there one could apply any community algorithm.
The only difficulty would be to add the users themselves into the graph. But let’s leave this for a futher (hopefully interesting) post.
netCoin
to the clickstream datasetThe dataset is described in a previous post and easily reproducible running the code at the bottom of this page. It contains users’ visits to webpages.
## attributeId userId title url
## 1 1000 10001 regwiz /regwiz
## 998 1001 10001 Support Desktop /support
## 5551 1002 10001 End User Produced View /athome
## 1461 1001 10002 Support Desktop /support
## 8310 1003 10002 Knowledge Base /kb
## 1463 1001 10003 Support Desktop /support
For applying netCoin
, the first step is to convert this to adjancency matrix, in which:
all the pages of the website are the columns of the matrix
each row of the matrix represents a visit from a user
values are 1 if the user visited the page and 0 otherwise
msweb.adj <- reshape2::dcast(msweb, as.formula("userId~title"),
value.var="title", fun.aggregate=function(x) as.integer(length(x) >= 1))
value_names <- names(msweb.adj)[names(msweb.adj) != "userId"]
value_names <- make.names(value_names)
names(msweb.adj) <- c("userId", value_names)
head(msweb.adj[, c("End.User.Produced.View", "Microsoft.com.Search", "Middle.East", "misc", "MS.Access",
"Norway", "regwiz", "SQL.Support", "Support.Desktop")])
## End.User.Produced.View Microsoft.com.Search Middle.East misc MS.Access Norway
## 1 1 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 1 0 0 0 0
## 4 0 0 0 0 0 1
## 5 0 0 0 1 0 0
## 6 0 1 0 0 0 0
## regwiz SQL.Support Support.Desktop
## 1 1 0 1
## 2 0 0 1
## 3 0 0 1
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
This adjancency matrix is an input to netCoin
for calculating the coincidence matrix. The coincidence matrix outputs how many times 2 web pages are visited at the same time. The diagonals just represent how many times a page is visited.
library(netCoin)
C <- coin(msweb.adj[, value_names]) # coincidence matrix
C[1:3, 1:3]
## About.Microsoft. Access.Development ActiveX.Data.Objects
## About.Microsoft. 123 0 0
## Access.Development 0 215 0
## ActiveX.Data.Objects 0 0 7
From the coincidence matrix, it is easy to obtain a network that represents these events that happen together.
N <- asNodes(C) # node data frame
E <- edgeList(C) # edge data frame
Net <- netCoin(N, E) # network object
And the package provides a handy visualization tool -although the graph can also be exported to igraph
quite easily. The following line opens the visualization in your web browser.
plot(Net)
The graph for the image at the beginning of this post can be obtained from the visualization above by following these steps:
filtering the nodes with frequency
smaller than 1000 (i.e. filtering out the pages that have been visited fewer times)
filtering the links with Haberman
smaller than 13 (where the Haberman
value indicates how strong the link is -it was proposed a long time ago in this paper and it is used by the netCoin
package)
setting that the size of the nodes show their frequency
setting that the color of the nodes show their degree
(i.e. how many connections the page has to other pages, how often it is visited along with other pages)
setting that the width of the links show their Haberman
Finally, a recap on the preparation of the dataset, described in more detail in a previous post. That is, in order to reproduce the results presented here, you should as an initial step download the Anonymous Microsoft Web Data Data Set and run the following code.
library(data.table)
msweb.orig <- read.csv("~/Downloads/anonymous-msweb.data",
header=FALSE,
sep=",",
col.names=c("V1", "V2", "V3", "V4", "V5", "V6"))
attribute.lines <- msweb.orig[msweb.orig$V1 == "A", c("V2", "V4", "V5")]
colnames(attribute.lines) <- c("id", "title", "url")
head(attribute.lines)
casevote.lines <- msweb.orig[msweb.orig$V1 == "C" | msweb.orig$V1 == "V", c("V1", "V2")]
head(casevote.lines)
casevote.lines$rownames <- as.numeric(rownames(casevote.lines)) # add indexes as a new column
dt <- data.table(casevote.lines[, !(names(casevote.lines)
%in% "parent")]) # convert to dataframe, excluding the parent column from a previos trial
head(dt)
dt[V1 == "C", parent:=rownames, by=rownames] # parents of C lines are themselves
difference <- 1
while (sum(is.na(dt$parent)) > 0) {
dt[, shiftV1 := shift(V1, difference)] # shift by "difference"
dt[shiftV1 == "C" & is.na(parent),
parent:=(rownames-difference), by=rownames] # set parent value of visits with n. of pages == "difference"
difference <- difference + 1
}
casevote.lines$parent = dt$parent
head(casevote.lines)
clicks <- merge(x=casevote.lines[casevote.lines$V1 == "C", c("V2", "parent")],
y=casevote.lines[casevote.lines$V1 == "V", c("V2", "parent")],
by="parent")
colnames(clicks) <- c("remove", "userId", "attributeId")
head(clicks)
msweb <- merge(x=clicks[,c("userId", "attributeId")],
y=attribute.lines,
by.x=c("attributeId"),
by.y=c("id"))
msweb <- msweb[order(msweb$userId),]
head(msweb)
saveRDS(msweb, "msweb.rds")