Sisifo - DTW Clustering time series

Introduction & purpose

The purpose of this post is to perform a clustering of data of a plusioximeter, as a step of an exploratory analysis of the nighltly data of a patient.

When one starts using a device such a pulsioximeter, it is useful first of all for monitoring the quality of the sleep of the patient, and intervene if there is a problem. However, since the device is capturing data every night, one would also like to have an overview of what has happened for the last month (or year), if there is an improvement or a deteroriation… And eventually to start finding patterns on the data. A clustering looks like a good intial step, and for the series at hand, a distance based on a Dynamic Time Warping seems suitable.

The result, not really matching the expectations, is at least useful; this is a case in which a spectral clustering is a much better description of the data (compared to other linkage criteria like Ward), since it is not possible to obtain a nice and clean division of the series in clusters or groups which members are very similar to each other and different to the members of the other groups. That is, translated to the actual case, the patient is not having either very good or very bad nights, but rather a spectrum of nights that go from good, not so good, intermediate, and all the way to bad and awful.

(And yes, instead of the accelerometer of previous posts, we’ve now switched to a pulsioximeter!)

Dataset used

The dataset consists of a total of 41 nights of two people, captured mostly around Jan and Feb of 2017. 39 nights belong to the patient of interest, and 2 additional nights of a healthy subject were added as a sanity check.

All the data was captured through the application in a github repository for a pulsioximeter monitor. This monitor provides graphs that can be seen in a web browser in a different room in which the pacient is sleeping (very useful for monitoring at a home), and additionally it stores all the data in a MongoDB database.

References

There’s plenty of references on the web around the definition of Dynamic Time Warping as a distance. The prefect example is for voice recognition: it makes sense that the same word, even pronounced by the same person, twice in different moments in time will produce similar but not identical sound waves. An euclidean distance may be able to get close -it would mean comparing the wave levels at analogous moments in time for the two words. However, what happens if the two words were uttered at very different speeds? When evaluating the distance, DTW decides which are the optimal moments in time for which comparing the wave levels. A good reference is just the dtwclust package vignette, and if you want to take a look to an example in which a sine and a cosine (thus, essentially the same wave, but with some offset) are compared using the DTW distance, just look at the help with ?dtw

Additionally, for clustering with DTW, this one and this other one were a couple of questions in cross validated have been useful, however these are two among many others.

Clustering and DTW

Once the data is clean and prepared, and the distance has been added to proxy (see the R code at the bottom of the page), launching a hierarchical clustering is not really different to the usual.

The following code would be for Ward linkage (it takes a few minutes; evaluating the DTW distance combined with hierarchical clustering is a bit heavy):

library(dtwclust)
ks.v4 <- c(3L,8L,18L)
hc_ndtw.agg.ward.v4 <- tsclust(list_spo2.4clust.v4, type = "h", k = ks.v4,
                               distance = "nDTW",
                               control = hierarchical_control(method = "ward.D2"))
# ... silhouette
names(hc_ndtw.agg.ward.v4) <- paste0("k_", ks.v4)
cvis.v4 <- sapply(hc_ndtw.agg.ward.v4, cvi, type = "internal")

While the following would be for spectral:

hc_ndtw.agg.single.v4 <- tsclust(list_spo2.4clust.v4, type = "h",
                                 distance = "nDTW",
                                 control = hierarchical_control(method = "single"))

Besides, in other to visualize the results more easily, the funcions in this github R script are loaded into the environment.

Summary of results

If you go straight to the dendogram of the cluster using Ward linkage, you might think that the results are decent:

par(mfrow=c(1,1))
plot(hc_ndtw.agg.ward.v4$k_3)

From the plot, it looks like a height of 0.1, which selects 8 clusters, would give a good result. However, if you look at the Silhouette, the results are not so congruent (some of the series included in the same clustering are actually distant from each other).

cvis.v4["Sil",]

##        k_3        k_8       k_18
## 0.07014402 0.06799493 0.31301241

Which would indicate that only when the number of clusters is comparable with the number of elements, the resulting clusters start to be more cohesive -which of course is not really useful. -You may look at it using the following

browse_cutree_clustering(hc_ndtw.agg.ward.v4$k_3, list_spo2.clean, 0.1)

For example, these 4 series are included in the first cluster; the 1st 2 belong to relativelly bad nights, while the 2nd 2 belong to the healthy subject.

#plot_4series(list_spo2.clean,
#             'p_17-01-19', 'p_17-01-20', 'h_17-04-27', 'h_17-04-28',
#             "4 series in cluster 1")
plot(list_spo2.clean[['p_17-01-19']], type="l", xlab='p_17-01-19', ylim=c(75,100), ylab="spo2")
abline(a=90, b=0, col="red")

plot(list_spo2.clean[['p_17-01-20']], type="l", xlab='p_17-01-20', ylim=c(75,100), ylab="spo2")
abline(a=90, b=0, col="red")

plot(list_spo2.clean[['h_17-04-27']], type="l", xlab='h_17-04-27', ylim=c(75,100), ylab="spo2")
abline(a=90, b=0, col="red")

plot(list_spo2.clean[['h_17-04-28']], type="l", xlab='h_17-04-28', ylim=c(75,100), ylab="spo2")
abline(a=90, b=0, col="red")

However, when using the spectral clustering (i.e. single linkage), the results are far more congruent. This is the dendogram:

par(mfrow=c(1,1))
plot(hc_ndtw.agg.single.v4)

While you can take a look to the series one after the other using

browse_spectral_clustering(hc_ndtw.agg.single.v4, list_spo2.clean)

Note that the series are not fully ordered from best to worse nights -that’s not the purpose of the spectral clustering, which just selects first the series that is the most different to the other series. That said, there is some sensible ordering of the series, and the worse nights are left at the beginning of the spectrum (at the left of the dendogram) -there’s not so many very bad nights for this patient, most of them are better than those, thus these are the most different to the rest of nights from the DTW distance poing of view. These are those first 4:

#plot_4series(list_spo2.clean,
#             'p_17-02-06', 'p_17-01-23', 'p_17-02-12', 'p_17-02-01',
#             "4 last series in spectral clustering")
plot(list_spo2.clean[['p_17-02-01']], type="l", xlab='p_17-02-01', ylim=c(75,100), ylab="spo2")
abline(a=90, b=0, col="red")

plot(list_spo2.clean[['p_17-02-12']], type="l", xlab='p_17-02-12', ylim=c(75,100), ylab="spo2")
abline(a=90, b=0, col="red")

plot(list_spo2.clean[['p_17-01-23']], type="l", xlab='p_17-01-23', ylim=c(75,100), ylab="spo2")
abline(a=90, b=0, col="red")

plot(list_spo2.clean[['p_17-02-06']], type="l", xlab='p_17-02-06', ylim=c(75,100), ylab="spo2")
abline(a=90, b=0, col="red")

R Code

Finally the R code. For the data preparation, the first step is to download this RData file (13MB) and then load it into your RSession.

load("<your download path>/Pulsioximeter2Subjects.RData")

There are two dataframes there, history.sub1 contains 39 nights of data from the patient, while history.sub2 contains just 2 nights of a healthy subject.

library(chron)
summary(history.sub1)

##      date                spo2             bpm               pi
##  Length:1406065     Min.   : 60.00   Min.   : 25.00   Min.   : 0.06
##  Class :character   1st Qu.: 95.00   1st Qu.: 63.00   1st Qu.: 0.93
##  Mode  :character   Median : 97.00   Median : 79.00   Median : 1.40
##                     Mean   : 96.53   Mean   : 81.19   Mean   : 1.74
##                     3rd Qu.: 98.00   3rd Qu.: 98.00   3rd Qu.: 2.10
##                     Max.   :100.00   Max.   :162.00   Max.   :20.00
##    date_chron
##  Min.   :(16-12-07 21:20:35)
##  1st Qu.:(17-01-28 08:06:36)
##  Median :(17-02-07 23:57:13)
##  Mean   :(17-02-08 12:34:53)
##  3rd Qu.:(17-02-18 03:03:03)
##  Max.   :(17-04-28 07:30:17)

summary(history.sub2)

##      date                spo2            bpm             pi
##  Length:23939       Min.   :90.00   Min.   :33.00   Length:23939
##  Class :character   1st Qu.:95.00   1st Qu.:38.00   Class :character
##  Mode  :character   Median :96.00   Median :40.00   Mode  :character
##                     Mean   :95.79   Mean   :41.31
##                     3rd Qu.:97.00   3rd Qu.:42.00
##                     Max.   :99.00   Max.   :83.00
##    date_chron
##  Min.   :(17-04-18 17:46:48)
##  1st Qu.:(17-04-28 01:10:40)
##  Median :(17-04-28 22:48:36)
##  Mean   :(17-04-28 12:49:17)
##  3rd Qu.:(17-04-29 02:10:45)
##  Max.   :(17-04-29 05:32:54)

The columns of the dataframes that contain the measures are spo2 for oxigen saturation, bpm for the pulse and pi is a measure of the quality of the pulse signal, called perfusion index, which is only available for the patient. There is one datapoint per second for the patient, and one datapoint every two seconds for the healthy patient. (This happened since the healthy subject used another pulsioximeter, however the datasets are “equalized” in the code below; note also that for the current post only the column spo2 is used -no problem for the missing pi for the healthy subject).

The dataset history.sub1, for the patient, contains some tests in december of 2016, but the complete nights actually start on 19th Jan 2017, and finish on 25th Feb 2017, plus an additional night captured the 27th of April 2017 (this last night will become important since in this one the curve looks just similar to the any night of a healthy patient).

The dataset history.sub2 for the healthy patient contains just two nights, for the 27th and 28th of April 2017; the remaining data you see in the summary are also trials and tests. Both datasets are captured using the already mentioned pulsioximeter monitor repository, from a csv file with minimum postprocessing, see function below:

prepare_radevents_csv <- function(file_name) {
  h <- read.csv(file_name, header=TRUE, sep=",", stringsAsFactors=FALSE)
  # add date column as chron
  h$date_chron <- chron(sapply(strsplit(h$date, "T"), "[", 1),
                        substring(sapply(strsplit(h$date, "T"), "[", 2), 1, 8),
                        format=c('y-m-d', 'h:m:s'))
  # remove rows with error
  message("Number of rows in error:", nrow(h[h$spo2 == -1 | h$bpm == -1 | h$pi == -1,]),
          appendLF = TRUE)
  h_clean <- h[h$spo2 != -1 & h$bpm != -1 & h$pi != -1,]
  return(h_clean)
}

The two datasets have to be reshaped to perform the clustering; the library expects a list of elements, where each element is a vector containing the series to compare -in our case the spo2 data for one night.

The following function takes the datasets as they come from the .RData file, and converts them to the list of series, between the 2 dates provided as parameters. Each night is considered to start at 6pm, although of course both the patient and the healthy subject go to bed later. The only line of the function a bit trickier is the last one using tapply (not so tricky anyway).

# creates list of series between 2 dates, for clustering
# output is a list in which one element contains the series for one day
assign_long_series_to_dates <- function(df, min_date_string, max_date_string) {
  # input strings into chron types
  min_date <- dates(min_date_string, format=c('y-m-d'))
  max_date <- dates(max_date_string, format=c('y-m-d'))
  # get date range from dataframe and sort it
  lower <- chron(min_date, '18:00:00', format=c('y-m-d','h:m:s'))
  upper <- chron(max_date + 1, '18:00:00', format=c('y-m-d','h:m:s'))
  df_sorted_range <- df[df$date_chron >= lower & df$date_chron <= upper,]
  df_sorted_range <- df_sorted_range[order(df_sorted_range$date_chron),]
  # construct list of dates to which the data points will be assigned
  # i.e. from 0am till the morning are actually assigned to the previous day
  series_dates <- dates(df_sorted_range$date_chron - as.numeric(times("18:00:00")))
  # assign to different rows accoring to `series_dates`
  output_spo2 <- tapply(df_sorted_range$spo2, series_dates, list)
}

The two datasets are convented as follows, adding a p as the initial for the patient, and a h as the initial for the healthy subject:

list_spo2.sub1 <- assign_long_series_to_dates(history.sub1, '17-01-19', '17-04-28')
names(list_spo2.sub1) <- paste("p", names(list_spo2.sub1), sep="_")
list_spo2.sub2 <- assign_long_series_to_dates(history.sub2, '17-04-27', '17-04-28')
names(list_spo2.sub2) <- paste("h", names(list_spo2.sub2), sep="_")
list_spo2.clean <- c(list_spo2.sub1, list_spo2.sub2)
names(list_spo2.clean)

##  [1] "p_17-01-19" "p_17-01-20" "p_17-01-21" "p_17-01-22" "p_17-01-23"
##  [6] "p_17-01-24" "p_17-01-25" "p_17-01-26" "p_17-01-27" "p_17-01-28"
## [11] "p_17-01-29" "p_17-01-30" "p_17-01-31" "p_17-02-01" "p_17-02-02"
## [16] "p_17-02-03" "p_17-02-04" "p_17-02-05" "p_17-02-06" "p_17-02-07"
## [21] "p_17-02-08" "p_17-02-09" "p_17-02-10" "p_17-02-11" "p_17-02-12"
## [26] "p_17-02-13" "p_17-02-14" "p_17-02-15" "p_17-02-16" "p_17-02-17"
## [31] "p_17-02-18" "p_17-02-19" "p_17-02-20" "p_17-02-21" "p_17-02-22"
## [36] "p_17-02-23" "p_17-02-24" "p_17-02-25" "p_17-04-27" "h_17-04-27"
## [41] "h_17-04-28"

The dataset list_spo2.4clust thus combines 41 nights for the two subjects; 39 nights for p and 2 nights fof h. However, the series’ granularity is too high, with one data point per either 1 or 2 seconds (depending on the pulsioximeter used for each person); one datapoint every 30 seconds is more than enough for the current analysis, and makes everything run faster, clearer plots, etc. By using the function reinterpolate in the package dtwcust one reduces the granularity very easily.

An additional problem is that any value of spo2 above approximately 90 is not very relevant. Even the healthiest person has some variation of saturation values during the night, and the interest is only in the low values. The function interpol_transform.v4 below does the job.

library(dtwclust)
interpol_transform.v4 <- function(l, factor=30) {
  # everything above 90 goes to 95
  temp <- l
  temp[temp >= 90] <- 95
  # reinterpolate
  temp <- reinterpolate(temp, length(temp)/factor)
  return(temp)
}

# equalizing: subject 1 data comes with double rate with respect to subject 2
list_spo2.sub1_interpol.v4 <- sapply(list_spo2.sub1, interpol_transform.v4, factor=30)
list_spo2.sub2_interpol.v4 <- sapply(list_spo2.sub2, interpol_transform.v4, factor=15)
list_spo2.4clust.v4 <- c(list_spo2.sub1_interpol.v4, list_spo2.sub2_interpol.v4)

Note that after this transformation, the healthy subject’s nights are going to look like a straight line with value 95. Now, list_spo2.4clust may actually be used for clustering. The final preparation step is adding the nDTW function. This is very well explained in the dtwcust’s package vignette (launch vignette("dtwclust")).

# Normalized asymmetric DTW distance
ndtw <- function(x, y, ...) {
  dtw::dtw(x, y, step.pattern = asymmetric,
           distance.only = TRUE, ...)$normalizedDistance
}

# Registering the function with 'proxy'
if (!pr_DB$entry_exists("nDTW"))
  proxy::pr_DB$set_entry(FUN = ndtw, names=c("nDTW"),
                         loop = TRUE, type = "metric", distance = TRUE,
                         description = "Normalized asymmetric DTW")

Association Rules applied to Google Analytics data for insights on the structure of a web site
Although for a very differnet technique, this is another post in the site about clustering. When your data is categorical, clustering based on distance does not work well (except in some scenarios when using cosine distance) and you may consider Association Rules. In the post, Association Rules is applied to clickstream data from Google Analytics, as a way to understand how users visit a web site.

Clustering time series using Dynamic Time Warping: pulsioximeter data