Introduction & purpose

Association Rules is quite a powerful technique for exploratory analysis of any dataset with categorical values. It finds categories that often happen at the same time. It’s been applied to market basket analysis -i.e. what is it that customers usually buy together. And in a previous post, it was applied to clickstream data to discover which web pages are usually visited together.

As powerful as Association Rules is a technique, its results are usually a bit difficult to communicate (at least, that’s the experience of the writer of this post, who has used the technique quite a few times in their professional life). However, at the Spanish R users event 2019, the netCoin package was presented in a practical lab and proved to be a good solution for generating a visualization for a frequent itemset analysis.

In this post, netCoin is applied to clickstream data from Microsoft in 1998 available at UCI ML repository, for which Association Rules was applied in a previous post.

Results: visualization of frequent itemsets in a graph using netCoin for clickstream data

When running Association Rules to a dataset, the results are rules, which are really descriptive and bring insights, but a bit cumbersome to read. As an example, let’s take a look to the 10 most important rules regarding frequent intemsets (i.e. pages that are visited together for the dataset of Microsoft’s visits in 1998).

## Apriori
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##          NA    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen            target   ext
##      10 frequent itemsets FALSE
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## Absolute minimum support count: 327
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[284 item(s), 32711 transaction(s)] done [0.01s].
## sorting and recoding items ... [47 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [150 set(s)] done [0.00s].
## creating S4 object  ... done [0.01s].
##      items                                    support    count
## [1]  {Free Downloads,Internet Explorer}       0.16080218 5260
## [2]  {Free Downloads,Windows Family of OSs}   0.07792486 2549
## [3]  {Free Downloads,isapi}                   0.07306411 2390
## [4]  {Free Downloads,Products }               0.06123322 2003
## [5]  {Free Downloads, Search}    0.06043838 1977
## [6]  {isapi,Support Desktop}                  0.05942955 1944
## [7]  {Knowledge Base,Support Desktop}         0.05521079 1806
## [8]  {Internet Explorer, Search} 0.05328483 1743
## [9]  { Search,Products }         0.04989147 1632
## [10] { Search,Support Desktop}   0.04857693 1589

(For details on how to perform this analysis, take a look to the previous post).

Imagine yourself trying to explain these to a customer. It is doable of course, but everything would certainly go smoother if, first of all, you could provide to them an image like the the picture below, which shows the results of applying netCoin to the same dataset.

The information is in essence the same that can be obtained from a frequent itemset analysis to the data (i.e. the results shown above), with the advantage that it can be easily grasped, and some qualitative conclusions may be deduced quickly. For example:

  • there are two main kind of users, the first seeking support on MS products (and free downloads? was that possible in 1998?), while the second are developers

  • there is additionally a smaller number of users interested in games

  • users looking for free downloads are mostly interested in Internet Explorer; when interested about a OS, they likely go to Windows95

  • users looking for products are interested either in Windows 95 or NT

  • developers are a smaller percent of users, but they very likely visit a subser of pages sequentially

Further steps

It looks natural to think about applying a clustering technique to the users of a web page. This could certainly be done by applying a community algorithm to the network graph that results of netCoin. The package is able to export the graph into an igraph and from there one could apply any community algorithm.

The only difficulty would be to add the users themselves into the graph. But let’s leave this for a futher (hopefully interesting) post.

R code

How to apply netCoin to the clickstream dataset

The dataset is described in a previous post and easily reproducible running the code at the bottom of this page. It contains users’ visits to webpages.

##      attributeId userId                  title      url
## 1           1000  10001                 regwiz  /regwiz
## 998         1001  10001        Support Desktop /support
## 5551        1002  10001 End User Produced View  /athome
## 1461        1001  10002        Support Desktop /support
## 8310        1003  10002         Knowledge Base      /kb
## 1463        1001  10003        Support Desktop /support

For applying netCoin, the first step is to convert this to adjancency matrix, in which:

  • all the pages of the website are the columns of the matrix

  • each row of the matrix represents a visit from a user

  • values are 1 if the user visited the page and 0 otherwise

msweb.adj <- reshape2::dcast(msweb, as.formula("userId~title"),
                             value.var="title", fun.aggregate=function(x) as.integer(length(x) >= 1))
value_names <- names(msweb.adj)[names(msweb.adj) != "userId"]
value_names <- make.names(value_names)
names(msweb.adj) <- c("userId", value_names)
head(msweb.adj[, c("End.User.Produced.View", "", "Middle.East", "misc", "MS.Access",
                   "Norway", "regwiz", "SQL.Support", "Support.Desktop")])
##   End.User.Produced.View Middle.East misc MS.Access Norway
## 1                      1                    0           0    0         0      0
## 2                      0                    0           0    0         0      0
## 3                      0                    1           0    0         0      0
## 4                      0                    0           0    0         0      1
## 5                      0                    0           0    1         0      0
## 6                      0                    1           0    0         0      0
##   regwiz SQL.Support Support.Desktop
## 1      1           0               1
## 2      0           0               1
## 3      0           0               1
## 4      0           0               0
## 5      0           0               0
## 6      0           0               0

This adjancency matrix is an input to netCoin for calculating the coincidence matrix. The coincidence matrix outputs how many times 2 web pages are visited at the same time. The diagonals just represent how many times a page is visited.


C <- coin(msweb.adj[, value_names]) # coincidence matrix
C[1:3, 1:3]
##                      About.Microsoft. Access.Development ActiveX.Data.Objects
## About.Microsoft.                  123                  0                    0
## Access.Development                  0                215                    0
## ActiveX.Data.Objects                0                  0                    7

From the coincidence matrix, it is easy to obtain a network that represents these events that happen together.

N <- asNodes(C) # node data frame
E <- edgeList(C) # edge data frame

Net <- netCoin(N, E) # network object

And the package provides a handy visualization tool -although the graph can also be exported to igraph quite easily. The following line opens the visualization in your web browser.


The graph for the image at the beginning of this post can be obtained from the visualization above by following these steps:

  • filtering the nodes with frequency smaller than 1000 (i.e. filtering out the pages that have been visited fewer times)

  • filtering the links with Haberman smaller than 13 (where the Haberman value indicates how strong the link is -it was proposed a long time ago in this paper and it is used by the netCoin package)

  • setting that the size of the nodes show their frequency

  • setting that the color of the nodes show their degree (i.e. how many connections the page has to other pages, how often it is visited along with other pages)

  • setting that the width of the links show their Haberman

Preparation of the dataset

Finally, a recap on the preparation of the dataset, described in more detail in a previous post. That is, in order to reproduce the results presented here, you should as an initial step download the Anonymous Microsoft Web Data Data Set and run the following code.


msweb.orig <- read.csv("~/Downloads/",
                       col.names=c("V1", "V2", "V3", "V4", "V5", "V6"))

attribute.lines <- msweb.orig[msweb.orig$V1 == "A", c("V2", "V4", "V5")]
colnames(attribute.lines) <- c("id", "title", "url")

casevote.lines <- msweb.orig[msweb.orig$V1 == "C" | msweb.orig$V1 == "V", c("V1", "V2")]

casevote.lines$rownames <- as.numeric(rownames(casevote.lines)) # add indexes as a new column
dt <- data.table(casevote.lines[, !(names(casevote.lines)
                                    %in% "parent")]) # convert to dataframe, excluding the parent column from a previos trial

dt[V1 == "C", parent:=rownames, by=rownames] # parents of C lines are themselves
difference <- 1
while (sum($parent)) > 0) {
  dt[, shiftV1 := shift(V1, difference)] # shift by "difference"
  dt[shiftV1 == "C" &,
     parent:=(rownames-difference), by=rownames] # set parent value of visits with n. of pages == "difference"
  difference <- difference + 1
casevote.lines$parent = dt$parent

clicks <- merge(x=casevote.lines[casevote.lines$V1 == "C", c("V2", "parent")],
                y=casevote.lines[casevote.lines$V1 == "V", c("V2", "parent")],
colnames(clicks) <- c("remove", "userId", "attributeId")

msweb <- merge(x=clicks[,c("userId", "attributeId")],
msweb <- msweb[order(msweb$userId),]

saveRDS(msweb, "msweb.rds")