The purpose of this post is to perform a clustering of data of a plusioximeter, as a step of an exploratory analysis of the nighltly data of a patient.
When one starts using a device such a pulsioximeter, it is useful first of all for monitoring the quality of the sleep of the patient, and intervene if there is a problem. However, since the device is capturing data every night, one would also like to have an overview of what has happened for the last month (or year), if there is an improvement or a deteroriation… And eventually to start finding patterns on the data. A clustering looks like a good intial step, and for the series at hand, a distance based on a Dynamic Time Warping seems suitable.
The result, not really matching the expectations, is at least useful; this is a case in which a spectral clustering is a much better description of the data (compared to other linkage criteria like Ward), since it is not possible to obtain a nice and clean division of the series in clusters or groups which members are very similar to each other and different to the members of the other groups. That is, translated to the actual case, the patient is not having either very good or very bad nights, but rather a spectrum of nights that go from good, not so good, intermediate, and all the way to bad and awful.
(And yes, instead of the accelerometer of previous posts, we’ve now switched to a pulsioximeter!)
The dataset consists of a total of 41 nights of two people, captured mostly around Jan and Feb of 2017. 39 nights belong to the patient of interest, and 2 additional nights of a healthy subject were added as a sanity check.
All the data was captured through the application in a github repository for a pulsioximeter monitor. This monitor provides graphs that can be seen in a web browser in a different room in which the pacient is sleeping (very useful for monitoring at a home), and additionally it stores all the data in a
There’s plenty of references on the web around the definition of Dynamic Time Warping as a distance. The prefect example is for voice recognition: it makes sense that the same word, even pronounced by the same person, twice in different moments in time will produce similar but not identical sound waves. An euclidean distance may be able to get close -it would mean comparing the wave levels at analogous moments in time for the two words. However, what happens if the two words were uttered at very different speeds? When evaluating the distance, DTW decides which are the optimal moments in time for which comparing the wave levels. A good reference is just the
dtwclust package vignette, and if you want to take a look to an example in which a sine and a cosine (thus, essentially the same wave, but with some offset) are compared using the DTW distance, just look at the help with
Once the data is clean and prepared, and the distance has been added to
proxy (see the R code at the bottom of the page), launching a hierarchical clustering is not really different to the usual.
The following code would be for
Ward linkage (it takes a few minutes; evaluating the DTW distance combined with hierarchical clustering is a bit heavy):
library(dtwclust) ks.v4 <- c(3L,8L,18L) hc_ndtw.agg.ward.v4 <- tsclust(list_spo2.4clust.v4, type = "h", k = ks.v4, distance = "nDTW", control = hierarchical_control(method = "ward.D2")) # ... silhouette names(hc_ndtw.agg.ward.v4) <- paste0("k_", ks.v4) cvis.v4 <- sapply(hc_ndtw.agg.ward.v4, cvi, type = "internal")
While the following would be for spectral:
hc_ndtw.agg.single.v4 <- tsclust(list_spo2.4clust.v4, type = "h", distance = "nDTW", control = hierarchical_control(method = "single"))
Besides, in other to visualize the results more easily, the funcions in this github R script are loaded into the environment.
If you go straight to the dendogram of the cluster using Ward linkage, you might think that the results are decent:
From the plot, it looks like a height of 0.1, which selects 8 clusters, would give a good result. However, if you look at the Silhouette, the results are not so congruent (some of the series included in the same clustering are actually distant from each other).
## k_3 k_8 k_18 ## 0.07014402 0.06799493 0.31301241
Which would indicate that only when the number of clusters is comparable with the number of elements, the resulting clusters start to be more cohesive -which of course is not really useful. -You may look at it using the following
browse_cutree_clustering(hc_ndtw.agg.ward.v4$k_3, list_spo2.clean, 0.1)
For example, these 4 series are included in the first cluster; the 1st 2 belong to relativelly bad nights, while the 2nd 2 belong to the healthy subject.
#plot_4series(list_spo2.clean, # 'p_17-01-19', 'p_17-01-20', 'h_17-04-27', 'h_17-04-28', # "4 series in cluster 1") plot(list_spo2.clean[['p_17-01-19']], type="l", xlab='p_17-01-19', ylim=c(75,100), ylab="spo2") abline(a=90, b=0, col="red")
plot(list_spo2.clean[['p_17-01-20']], type="l", xlab='p_17-01-20', ylim=c(75,100), ylab="spo2") abline(a=90, b=0, col="red")