The purpose of this post is to perform a clustering of data of a plusioximeter, as a step of an exploratory analysis of the nighltly data of a patient.
When one starts using a device such a pulsioximeter, it is useful first of all for monitoring the quality of the sleep of the patient, and intervene if there is a problem. However, since the device is capturing data every night, one would also like to have an overview of what has happened for the last month (or year), if there is an improvement or a deteroriation… And eventually to start finding patterns on the data. A clustering looks like a good intial step, and for the series at hand, a distance based on a Dynamic Time Warping seems suitable.
The result, not really matching the expectations, is at least useful; this is a case in which a spectral clustering is a much better description of the data (compared to other linkage criteria like Ward), since it is not possible to obtain a nice and clean division of the series in clusters or groups which members are very similar to each other and different to the members of the other groups. That is, translated to the actual case, the patient is not having either very good or very bad nights, but rather a spectrum of nights that go from good, not so good, intermediate, and all the way to bad and awful.
(And yes, instead of the accelerometer of previous posts, we’ve now switched to a pulsioximeter!)
The dataset consists of a total of 41 nights of two people, captured mostly around Jan and Feb of 2017. 39 nights belong to the patient of interest, and 2 additional nights of a healthy subject were added as a sanity check.
All the data was captured through the application in a github repository for a pulsioximeter monitor. This monitor provides graphs that can be seen in a web browser in a different room in which the pacient is sleeping (very useful for monitoring at a home), and additionally it stores all the data in a MongoDB
database.
There’s plenty of references on the web around the definition of Dynamic Time Warping as a distance. The prefect example is for voice recognition: it makes sense that the same word, even pronounced by the same person, twice in different moments in time will produce similar but not identical sound waves. An euclidean distance may be able to get close -it would mean comparing the wave levels at analogous moments in time for the two words. However, what happens if the two words were uttered at very different speeds? When evaluating the distance, DTW decides which are the optimal moments in time for which comparing the wave levels. A good reference is just the dtwclust package vignette
, and if you want to take a look to an example in which a sine and a cosine (thus, essentially the same wave, but with some offset) are compared using the DTW distance, just look at the help with ?dtw
Additionally, for clustering with DTW, this one and this other one were a couple of questions in cross validated
have been useful, however these are two among many others.
Once the data is clean and prepared, and the distance has been added to proxy
(see the R code at the bottom of the page), launching a hierarchical clustering is not really different to the usual.
The following code would be for Ward
linkage (it takes a few minutes; evaluating the DTW distance combined with hierarchical clustering is a bit heavy):
library(dtwclust)
ks.v4 <- c(3L,8L,18L)
hc_ndtw.agg.ward.v4 <- tsclust(list_spo2.4clust.v4, type = "h", k = ks.v4,
distance = "nDTW",
control = hierarchical_control(method = "ward.D2"))
# ... silhouette
names(hc_ndtw.agg.ward.v4) <- paste0("k_", ks.v4)
cvis.v4 <- sapply(hc_ndtw.agg.ward.v4, cvi, type = "internal")
While the following would be for spectral:
hc_ndtw.agg.single.v4 <- tsclust(list_spo2.4clust.v4, type = "h",
distance = "nDTW",
control = hierarchical_control(method = "single"))
Besides, in other to visualize the results more easily, the funcions in this github R script are loaded into the environment.
If you go straight to the dendogram of the cluster using Ward linkage, you might think that the results are decent:
par(mfrow=c(1,1))
plot(hc_ndtw.agg.ward.v4$k_3)
From the plot, it looks like a height of 0.1, which selects 8 clusters, would give a good result. However, if you look at the Silhouette, the results are not so congruent (some of the series included in the same clustering are actually distant from each other).
cvis.v4["Sil",]
## k_3 k_8 k_18
## 0.07014402 0.06799493 0.31301241
Which would indicate that only when the number of clusters is comparable with the number of elements, the resulting clusters start to be more cohesive -which of course is not really useful. -You may look at it using the following
browse_cutree_clustering(hc_ndtw.agg.ward.v4$k_3, list_spo2.clean, 0.1)
For example, these 4 series are included in the first cluster; the 1st 2 belong to relativelly bad nights, while the 2nd 2 belong to the healthy subject.
#plot_4series(list_spo2.clean,
# 'p_17-01-19', 'p_17-01-20', 'h_17-04-27', 'h_17-04-28',
# "4 series in cluster 1")
plot(list_spo2.clean[['p_17-01-19']], type="l", xlab='p_17-01-19', ylim=c(75,100), ylab="spo2")
abline(a=90, b=0, col="red")
plot(list_spo2.clean[['p_17-01-20']], type="l", xlab='p_17-01-20', ylim=c(75,100), ylab="spo2")
abline(a=90, b=0, col="red")
plot(list_spo2.clean[['h_17-04-27']], type="l", xlab='h_17-04-27', ylim=c(75,100), ylab="spo2")
abline(a=90, b=0, col="red")
plot(list_spo2.clean[['h_17-04-28']], type="l", xlab='h_17-04-28', ylim=c(75,100), ylab="spo2")
abline(a=90, b=0, col="red")
However, when using the spectral clustering (i.e. single linkage), the results are far more congruent. This is the dendogram:
par(mfrow=c(1,1))
plot(hc_ndtw.agg.single.v4)
While you can take a look to the series one after the other using
browse_spectral_clustering(hc_ndtw.agg.single.v4, list_spo2.clean)
Note that the series are not fully ordered from best to worse nights -that’s not the purpose of the spectral clustering, which just selects first the series that is the most different to the other series. That said, there is some sensible ordering of the series, and the worse nights are left at the beginning of the spectrum (at the left of the dendogram) -there’s not so many very bad nights for this patient, most of them are better than those, thus these are the most different to the rest of nights from the DTW distance poing of view. These are those first 4:
#plot_4series(list_spo2.clean,
# 'p_17-02-06', 'p_17-01-23', 'p_17-02-12', 'p_17-02-01',
# "4 last series in spectral clustering")
plot(list_spo2.clean[['p_17-02-01']], type="l", xlab='p_17-02-01', ylim=c(75,100), ylab="spo2")
abline(a=90, b=0, col="red")
plot(list_spo2.clean[['p_17-02-12']], type="l", xlab='p_17-02-12', ylim=c(75,100), ylab="spo2")
abline(a=90, b=0, col="red")
plot(list_spo2.clean[['p_17-01-23']], type="l", xlab='p_17-01-23', ylim=c(75,100), ylab="spo2")
abline(a=90, b=0, col="red")
plot(list_spo2.clean[['p_17-02-06']], type="l", xlab='p_17-02-06', ylim=c(75,100), ylab="spo2")
abline(a=90, b=0, col="red")
Finally the R code. For the data preparation, the first step is to download this RData file (13MB) and then load it into your RSession.
load("<your download path>/Pulsioximeter2Subjects.RData")
There are two dataframes there, history.sub1
contains 39 nights of data from the patient, while history.sub2
contains just 2 nights of a healthy subject.
library(chron)
summary(history.sub1)
## date spo2 bpm pi
## Length:1406065 Min. : 60.00 Min. : 25.00 Min. : 0.06
## Class :character 1st Qu.: 95.00 1st Qu.: 63.00 1st Qu.: 0.93
## Mode :character Median : 97.00 Median : 79.00 Median : 1.40
## Mean : 96.53 Mean : 81.19 Mean : 1.74
## 3rd Qu.: 98.00 3rd Qu.: 98.00 3rd Qu.: 2.10
## Max. :100.00 Max. :162.00 Max. :20.00
## date_chron
## Min. :(16-12-07 21:20:35)
## 1st Qu.:(17-01-28 08:06:36)
## Median :(17-02-07 23:57:13)
## Mean :(17-02-08 12:34:53)
## 3rd Qu.:(17-02-18 03:03:03)
## Max. :(17-04-28 07:30:17)
summary(history.sub2)
## date spo2 bpm pi
## Length:23939 Min. :90.00 Min. :33.00 Length:23939
## Class :character 1st Qu.:95.00 1st Qu.:38.00 Class :character
## Mode :character Median :96.00 Median :40.00 Mode :character
## Mean :95.79 Mean :41.31
## 3rd Qu.:97.00 3rd Qu.:42.00
## Max. :99.00 Max. :83.00
## date_chron
## Min. :(17-04-18 17:46:48)
## 1st Qu.:(17-04-28 01:10:40)
## Median :(17-04-28 22:48:36)
## Mean :(17-04-28 12:49:17)
## 3rd Qu.:(17-04-29 02:10:45)
## Max. :(17-04-29 05:32:54)
The columns of the dataframes that contain the measures are spo2
for oxigen saturation, bpm
for the pulse and pi
is a measure of the quality of the pulse signal, called perfusion index, which is only available for the patient. There is one datapoint per second for the patient, and one datapoint every two seconds for the healthy patient. (This happened since the healthy subject used another pulsioximeter, however the datasets are “equalized” in the code below; note also that for the current post only the column spo2
is used -no problem for the missing pi
for the healthy subject).
The dataset history.sub1
, for the patient, contains some tests in december of 2016, but the complete nights actually start on 19th Jan 2017, and finish on 25th Feb 2017, plus an additional night captured the 27th of April 2017 (this last night will become important since in this one the curve looks just similar to the any night of a healthy patient).
The dataset history.sub2
for the healthy patient contains just two nights, for the 27th and 28th of April 2017; the remaining data you see in the summary are also trials and tests. Both datasets are captured using the already mentioned pulsioximeter monitor repository, from a csv
file with minimum postprocessing, see function below:
prepare_radevents_csv <- function(file_name) {
h <- read.csv(file_name, header=TRUE, sep=",", stringsAsFactors=FALSE)
# add date column as chron
h$date_chron <- chron(sapply(strsplit(h$date, "T"), "[", 1),
substring(sapply(strsplit(h$date, "T"), "[", 2), 1, 8),
format=c('y-m-d', 'h:m:s'))
# remove rows with error
message("Number of rows in error:", nrow(h[h$spo2 == -1 | h$bpm == -1 | h$pi == -1,]),
appendLF = TRUE)
h_clean <- h[h$spo2 != -1 & h$bpm != -1 & h$pi != -1,]
return(h_clean)
}
The two datasets have to be reshaped to perform the clustering; the library expects a list of elements, where each element is a vector containing the series to compare -in our case the spo2
data for one night.
The following function takes the datasets as they come from the .RData
file, and converts them to the list of series, between the 2 dates provided as parameters. Each night is considered to start at 6pm, although of course both the patient and the healthy subject go to bed later. The only line of the function a bit trickier is the last one using tapply
(not so tricky anyway).
# creates list of series between 2 dates, for clustering
# output is a list in which one element contains the series for one day
assign_long_series_to_dates <- function(df, min_date_string, max_date_string) {
# input strings into chron types
min_date <- dates(min_date_string, format=c('y-m-d'))
max_date <- dates(max_date_string, format=c('y-m-d'))
# get date range from dataframe and sort it
lower <- chron(min_date, '18:00:00', format=c('y-m-d','h:m:s'))
upper <- chron(max_date + 1, '18:00:00', format=c('y-m-d','h:m:s'))
df_sorted_range <- df[df$date_chron >= lower & df$date_chron <= upper,]
df_sorted_range <- df_sorted_range[order(df_sorted_range$date_chron),]
# construct list of dates to which the data points will be assigned
# i.e. from 0am till the morning are actually assigned to the previous day
series_dates <- dates(df_sorted_range$date_chron - as.numeric(times("18:00:00")))
# assign to different rows accoring to `series_dates`
output_spo2 <- tapply(df_sorted_range$spo2, series_dates, list)
}
The two datasets are convented as follows, adding a p
as the initial for the patient, and a h
as the initial for the healthy subject:
list_spo2.sub1 <- assign_long_series_to_dates(history.sub1, '17-01-19', '17-04-28')
names(list_spo2.sub1) <- paste("p", names(list_spo2.sub1), sep="_")
list_spo2.sub2 <- assign_long_series_to_dates(history.sub2, '17-04-27', '17-04-28')
names(list_spo2.sub2) <- paste("h", names(list_spo2.sub2), sep="_")
list_spo2.clean <- c(list_spo2.sub1, list_spo2.sub2)
names(list_spo2.clean)
## [1] "p_17-01-19" "p_17-01-20" "p_17-01-21" "p_17-01-22" "p_17-01-23"
## [6] "p_17-01-24" "p_17-01-25" "p_17-01-26" "p_17-01-27" "p_17-01-28"
## [11] "p_17-01-29" "p_17-01-30" "p_17-01-31" "p_17-02-01" "p_17-02-02"
## [16] "p_17-02-03" "p_17-02-04" "p_17-02-05" "p_17-02-06" "p_17-02-07"
## [21] "p_17-02-08" "p_17-02-09" "p_17-02-10" "p_17-02-11" "p_17-02-12"
## [26] "p_17-02-13" "p_17-02-14" "p_17-02-15" "p_17-02-16" "p_17-02-17"
## [31] "p_17-02-18" "p_17-02-19" "p_17-02-20" "p_17-02-21" "p_17-02-22"
## [36] "p_17-02-23" "p_17-02-24" "p_17-02-25" "p_17-04-27" "h_17-04-27"
## [41] "h_17-04-28"
The dataset list_spo2.4clust
thus combines 41 nights for the two subjects; 39 nights for p
and 2 nights fof h
. However, the series’ granularity is too high, with one data point per either 1 or 2 seconds (depending on the pulsioximeter used for each person); one datapoint every 30 seconds is more than enough for the current analysis, and makes everything run faster, clearer plots, etc. By using the function reinterpolate
in the package dtwcust
one reduces the granularity very easily.
An additional problem is that any value of spo2
above approximately 90 is not very relevant. Even the healthiest person has some variation of saturation values during the night, and the interest is only in the low values. The function interpol_transform.v4
below does the job.
library(dtwclust)
interpol_transform.v4 <- function(l, factor=30) {
# everything above 90 goes to 95
temp <- l
temp[temp >= 90] <- 95
# reinterpolate
temp <- reinterpolate(temp, length(temp)/factor)
return(temp)
}
# equalizing: subject 1 data comes with double rate with respect to subject 2
list_spo2.sub1_interpol.v4 <- sapply(list_spo2.sub1, interpol_transform.v4, factor=30)
list_spo2.sub2_interpol.v4 <- sapply(list_spo2.sub2, interpol_transform.v4, factor=15)
list_spo2.4clust.v4 <- c(list_spo2.sub1_interpol.v4, list_spo2.sub2_interpol.v4)
Note that after this transformation, the healthy subject’s nights are going to look like a straight line with value 95. Now, list_spo2.4clust
may actually be used for clustering. The final preparation step is adding the nDTW
function. This is very well explained in the dtwcust
’s package vignette (launch vignette("dtwclust")
).
# Normalized asymmetric DTW distance
ndtw <- function(x, y, ...) {
dtw::dtw(x, y, step.pattern = asymmetric,
distance.only = TRUE, ...)$normalizedDistance
}
# Registering the function with 'proxy'
if (!pr_DB$entry_exists("nDTW"))
proxy::pr_DB$set_entry(FUN = ndtw, names=c("nDTW"),
loop = TRUE, type = "metric", distance = TRUE,
description = "Normalized asymmetric DTW")