Introduction & purpose

This post presents some promising results of an early alarm model for pulsioximeter data using Deep Learning, when applying the equivalent of data augmentation for time series.

First of all, let's describe the purpose of such an early alarm predictive model: detecting that a desaturation is likely to happen in the next 2 minutes. (A desaturation means a low blood oxygen concentration). Setting up this model is first of all a feasibility exercise; anyhow such a prediction would be useful on itself, at least as an initial step, for a system that improves the sleep quality of the patient (e.g. if the desaturation alarm was raised early enough, the carer of the patient could just change the position of the patient -assuming a patient who is sleeping and has very limited ability to move).

On the other hand, importantly for the purposes of this blog, it is a nice practical case of implementation of a Deep Learning model combining CNN and LSTM layers; the task at hand is certainly cumbersome, however there is a number of steps to follow that are pretty much common knowledge for the Deep Learning community, and after some effort, quite a lot of CPU time, and a few tricks, the results are quite satisfying.

Dataset used

The dataset has already been presented at this previous post.

It consists of data from a pulsioximeter (i.e. a device for measuring bood oxygen concentration and pulse rate). The meter was connected to a sick patient for 39 nights, plus 2 additional nights that were recorded for a healthy patient, as a sanity check.

The capturing itself was done using software in this repository.

References

There's so many references for Deep Learning. This and this are two parts of a quite a good tutorial for predicting a time series with LSTMs; may be a gentle introduction, also because the problem described in the tutorial is quite a good example of a time series to be predicted that can be reasonably dealt with the use of Deep Learning.

There are around many other examples of exercises that are indeed good for practising the technique; however the time series forecasting problem in such examples is always too difficult, e.g. predicting the stock exchange market. Those exercises usually try a prediction for just 1 timestep; although the prediction might be decent enough, it all becomes a lot harder when adding more timesteps (like in a real problem). Anyhow, a really useful example for this kind is this post, which tries to predict long term interest rates for the USA.

There has been quite a few attempts to use Deep Learning to predict time series related to body signals in similar kind of prediction setups; the implementation in this post is actually a refactor of this repository, called McFly, for Human Activity Recognition; the repository is an implementation of this paper, co-written by F Ordoñez -the youtube talk is in Spanish, sorry. This approach combines a LSTM for the time series xxx

The refactored code used in this post is available here, and mostly adds a Data Generator to McFly, for setting up the data more easily and, importantly, for controlling how data is fed to training the model.

If these in the list are still not enough, you could backup to couple of golden references on DL such as Chollet's bible or Andrew Ng's course

Summary of results

Only the more relevant results (together with the code to reproduce them) are presented in this post, with the intention of keeping it brief (i.e. less long) and useful. In further follow-up posts, other details will be covered in more depth, so that at the end, hopefully, the set of articles would be a summary on the steps to build up a Deep Learning model for a time series prediction.

Assuming that the right architecture and hyperparameters have been found already, in each of the following paragraphs the data is fed during training of the model in different ways:

  • Using data as it is (naïve approach)
  • Rebalancing data, since it is by far more interesting to detect big desaturations, rather than spurious variations of the oxygen levels
  • Augmenting the data, i.e. producing more data from the already existing data by means of simple variations

Finally the post incluides a brief wrap-up on what has been done, what is missing, and what will be covered in follow-up posts.

Deep Learning results for the most naïve approach

In this case, data is fed to the model "as is". The data is generated in this previous post, anyhow for the purposes here one can just use this csv file. The csv has 4 columns:

  • datetime
  • bpm: heart beat per minute
  • spo2: saturation level
  • name: string to identify each of the time series which correspond to a certain patient in a certain night

Note that the raw data from the pulsioximeter comes out every second (every 2 seconds for other devices); the data in the csv has been interpolated to have one timestep every 30 seconds, as a pre-processing step.

Fitting the model

Let's assume at this point that the right architecture has already been selected somehow. First step for training the model is importing library and custom utils.

In [1]:
import pandas as pd
  %matplotlib inline
  import numpy as np
  import matplotlib.pyplot as plt
  import pandas as pd
  import pickle
  from keras.losses import mean_absolute_percentage_error

  from utils.generate_models import generate_models, generate_DeepConvLSTM_model
  from utils.validate_models import find_best_architecture, evaluate_model, evaluate_plot, simple_prettify
  from utils.data_generator import DataGenerator
  from utils.get_dataset_pulsi import get_dataset_pulsi
  from utils.validate_models import mape
  
Using TensorFlow backend.
  

Importing and pre-processing the data.

In [2]:
columns = np.array(['bpm', 'spo2'])
  dataset_reduced_std, dataset_reduced = get_dataset_pulsi(columns,
                                                           filename='./utils/test_data/42nights.csv')
  

Setting up some parameters of the model: the target prediction is for 4 timesteps, i.e. 2 minutes, based on the data of the las 12 timesteps, i.e. the last 6 minutes.

In [3]:
window_size = 12
  number_of_predictions = 4
  target_variable = "spo2"
  

Some technical parameters: batch_size is the number of samples taken for each training step (where a sample is composed of the window of 12 timesteps plus the prediction of 4 timesteps) and the metric used for the optimization.

In [4]:
batch_size = 32
  metric = mean_absolute_percentage_error
  

Most of the job is done by the DataGenerator's, which control how the data is passed to the training process. More in-depth details on data generators will be given in follow-up posts; for the moment, it is important to note that they become very handy when one needs to control how the data is passed to the training process. For the generators instantiated below, the data is passed "as is", in batches of batch_size.

Some nights are reserved for the training data, others for validation (used during training), and a final test set is reserved for assessing the precision of the model.

In [5]:
train_names = np.array(['p_17-01-19', 'p_17-01-20', 'p_17-01-21', 'p_17-01-22', 'p_17-01-23', 'p_17-01-24',
                          'p_17-01-25', 'p_17-01-26', 'p_17-01-27', 'p_17-01-28', 'p_17-01-29', 'p_17-01-30',
                          'p_17-01-31', 'p_17-02-01', 'p_17-02-02', 'p_17-02-03', 'p_17-02-04', 'p_17-02-05',
                          'p_17-02-06', 'p_17-02-07', 'p_17-02-08', 'p_17-02-09', 'p_17-02-10'])
  val_names = np.array(['p_17-02-11', 'p_17-02-12', 'p_17-02-13', 'p_17-02-14', 'p_17-02-15', 'p_17-02-16',
                        'p_17-02-17', 'p_17-02-18'])
  test_names = np.array(['p_17-02-19', 'p_17-02-20', 'p_17-02-21', 'p_17-02-22', 'p_17-02-23', 'p_17-02-24',
                         'p_17-02-25', 'p_17-04-27'])
  train_gen = DataGenerator(dataset_reduced_std, train_names,
                            "spo2", batch_size=batch_size,
                            number_of_predictions=number_of_predictions,
                            window_size=window_size,
                            step_prediction_dates=1,
                            rebalance_data=False, debug=False)
  val_gen = DataGenerator(dataset_reduced_std, val_names,
                          "spo2", batch_size=batch_size,
                          number_of_predictions=number_of_predictions,
                          window_size=window_size,
                          step_prediction_dates=1,
                          rebalance_data=False)
  test_gen = DataGenerator(dataset_reduced_std, test_names,
                           "spo2", batch_size=batch_size,
                           number_of_predictions=number_of_predictions,
                           window_size=window_size,
                           step_prediction_dates=1,
                           rebalance_data=False)
  

The following step fits the model.

Note that, first, a pre-fixed set of values of hyperparameters id defined. Then, the model is instantiated. And finally, the call to find_best_architecture trains the model and checks its peformance.

All these would actually be done for different values of the model hyperparameters, for finding the best architecture -once again will, the subject will be covered in more depth in follow-up posts.

However there is a bit more that needs to be said about the architecture:

  • it uses 100 cells for the LSTM chain
  • plus a convolutional filter for "feture engineering" of 78 cells
  • and the rest of parameters have to do with regularization and dropout
In [8]:
hyperparameters_losses = {}
  regularization_rate_losses = 0.0002
  hyperparameters_losses['regularization_rate'] = regularization_rate_losses
  learning_rate_losses = 0.0001
  hyperparameters_losses['learning_rate'] = learning_rate_losses
  filters_losses = [33]
  hyperparameters_losses['filters'] = filters_losses
  lstm_dims_losses = [78]
  hyperparameters_losses['lstm_dims'] = lstm_dims_losses

  dropout_rnn_losses = 0.72
  dropout_cnn_losses = 0.81

  nrepochs_losses = 192

  dim_length = window_size
  dim_channels = 2         # spo2 and bpm
  output_dim = number_of_predictions

  model_naive = generate_DeepConvLSTM_model(dim_length, dim_channels, output_dim,
                                            filters_losses, lstm_dims_losses, learning_rate_losses,
                                            regularization_rate_losses, dropout=None,
                                            dropout_rnn=dropout_rnn_losses, dropout_cnn=dropout_cnn_losses,
                                            metrics=[mean_absolute_percentage_error])
  models_naive = [(model_naive, hyperparameters_losses)]

  best_model_naive, best_params_naive, best_model_metrics_naive, best_params_metrics_naive, debug = \
      find_best_architecture(train_gen, val_gen, test_gen,
                             verbose=False, number_of_models=None, nr_epochs=100, # let early stopping decide
                             early_stopping=False, batch_size=batch_size,
                             models=models_naive, metric=mean_absolute_percentage_error, use_testset=True,
                             debug=False, test_retrain=False, output_all=True)
  

Evaluating the model

The following are RMSE and MAPE numbers for the training, validation, and test set.

In [9]:
train_predict, val_predict, test_predict = \
      evaluate_model(train_gen, val_gen, test_gen,
                     best_model_naive)
  
training error = [0.0036666165237008155, 32698.206502582976]
  validation error = [0.003674298139863139, 5.048769897371327]
  testing error = [0.0032407409764187104, 4.717288560739471]
  

Thus MAPE of around 5% for test. That MAPE figure may sound reasonible, but it is actually meaningless for the early alarm system, which purpose is to detect a desaturation. MAPE is a good technical measure to fit the deep neural network, and a MAPE closed to 0 would obviously do the task, but it is going to be difficult to assess how good the system if only the MAPE is known.

The evaluation measure should actually change, and possibly even the task could be reframed as a classification problem (since anything that matters is detecting a big desaturation) -anyhow let's deal with this in a follow-up post.

For now, let's take a look to some predictions around a big desaturation. Indeed the detector does not even notice that anything is happening.

In [10]:
f, axs = plt.subplots(5, 3, figsize=(21, 30))
  initial_prediction = 368
  for index in range(0,15):
      evaluate_plot(dataset_reduced, test_gen, test_predict,
                    target_variable=target_variable, metric=mape,
                    prediction=initial_prediction+index,
                    add_title="time " + str(index),
                    ax=axs[int(index/3), index % 3])
  simple_prettify(target_variable, axs)
  

Deep Learning results when rebalancing data

When thinking about it, it is actually reasonible that the results are so poor. There's quite a small number of desaturations each night, only about a couple on average per night. In consequence, there is very little training data for the events that are actually interesting.

Feeding the data "as is" trains the model with a lot of cases that are more or less flat, and the few that are important go through the training process unnoticed.

Here, the DataGenerator's become really useful, since it is relativelly easy to rebalance the training data, showing the model a balanced number of interesting and uninteresting examples.

Fitting the data

Let's instantiate again the generators, however this time:

  • there is a threshold for which an event is considered an interesting (big) desaturation
  • the generator will always show to the model the same number of either normal or big desaturation events

The threshold is set to 0.5 for the normalized version of the spo2 curves, i.e. since normalization is min-max, a big desaturation is defined as a spo2 level below half of its range -in practice, anything below 90 approx.

Just take a look to the DataGenerator code if you want to see how this is done, although of course this could be the subject of an additional post.

In [11]:
train_gen_rebalanced = DataGenerator(dataset_reduced_std, train_names,
                                       "spo2", batch_size=batch_size,
                                       number_of_predictions=number_of_predictions,
                                       window_size=window_size,
                                       step_prediction_dates=1,
                                       rebalance_data=True, rebalance_threshold=0.5,
                                       debug=False)
  val_gen_rebalanced = DataGenerator(dataset_reduced_std, val_names,
                                     "spo2", batch_size=batch_size,
                                     number_of_predictions=number_of_predictions,
                                     window_size=window_size,
                                     step_prediction_dates=1,
                                     rebalance_data=True, rebalance_threshold=0.5,
                                     debug=False)
  test_gen_rebalanced = DataGenerator(dataset_reduced_std, test_names,
                                      "spo2", batch_size=batch_size,
                                      number_of_predictions=number_of_predictions,
                                      window_size=window_size,
                                      step_prediction_dates=1,
                                      rebalance_data=False, debug=False)
  

At the code cell below, a model is fitted with the same architecture as in the paragraphs above for the naïve approach.

In [12]:
model_rebalcd = generate_DeepConvLSTM_model(dim_length, dim_channels, output_dim,
                                              filters_losses, lstm_dims_losses, learning_rate_losses,
                                              regularization_rate_losses, dropout=None,
                                              dropout_rnn=dropout_rnn_losses, dropout_cnn=dropout_cnn_losses,
                                              metrics=[mean_absolute_percentage_error])
  models_rebalcd = [(model_rebalcd, hyperparameters_losses)]

  best_model_rebalanced, best_params_rebalanced, best_model_metrics_rebalanced, best_params_metrics_rebalanced, debug = \
      find_best_architecture(train_gen_rebalanced, val_gen_rebalanced, test_gen_rebalanced,
                             verbose=False, number_of_models=None, nr_epochs=500, # let early stopping decide
                             early_stopping=True, batch_size=batch_size,
                             models=models_rebalcd, metric=mean_absolute_percentage_error, use_testset=True,
                             debug=False, test_retrain=False, output_all=True)
  

Evaluating the model

These are the MAPE results -the model has improved a bit for the test set, although the validation error is high, a sign of overfitting.

In [13]:
train_predict_rebalanced, val_predict_rebalanced, test_predict_rebalanced = \
      evaluate_model(train_gen_rebalanced, val_gen_rebalanced, test_gen_rebalanced,
                     best_model_rebalanced)
  
training error = [0.026577735581315747, 1491278.3583159319]
  validation error = [0.019838075322861023, 17.076708507858946]
  testing error = [0.004912695818288557, 5.775952235111922]
  

Let's take a look to the same area as above to visually evaluate the results for a desaturation: still the model is not even noticing the event.

In [14]:
f, axs = plt.subplots(5, 3, figsize=(21, 30))
  initial_prediction = 368
  for index in range(0,15):
      evaluate_plot(dataset_reduced, test_gen_rebalanced, test_predict_rebalanced,
                    target_variable=target_variable, metric=mape,
                    prediction=initial_prediction+index,
                    add_title="time " + str(index),
                    ax=axs[int(index/3), index % 3])
  simple_prettify(target_variable, axs)
  

Deep Learning results when augmenting the data

In order to improve accuracy, data augmentation is a common practice for Deep Learning models. The concept is quite easy to understand in the context of e.g. image recognition, where you want to detect a relevant object in a photo regardless of its position, relative size, etc. Data augmentation in this context means minor modifications in each of the images of the training set, including shifts, rotations, zooms, change of dimensions, etc.

Fitting the data

In [15]:
dataset_aug_std, dataset_aug = get_dataset_pulsi(columns,
                                                   filename='./utils/test_data/42nights_shifted.csv')

  train_names_aug = [n+"_"+str(i) for n in train_names for i in range(0,15)]
  val_names_aug = [n+"_"+str(i) for n in val_names for i in range(0,15)]
  test_names_aug = [n+"_"+str(i) for n in test_names for i in range(0,15)]
  
In [16]:
%%script false
  train_gen_aug = DataGenerator(dataset_aug_std, train_names_aug,
                                "spo2", batch_size=batch_size,
                                number_of_predictions=number_of_predictions,
                                window_size=window_size,
                                step_prediction_dates=1,
                                rebalance_data=True, rebalance_threshold=0.5,
                                debug=False)
  val_gen_aug = DataGenerator(dataset_aug_std, val_names_aug,
                              "spo2", batch_size=batch_size,
                              number_of_predictions=number_of_predictions,
                              window_size=window_size,
                              step_prediction_dates=1,
                              rebalance_data=True, rebalance_threshold=0.5,
                              debug=False)
  test_gen_aug = DataGenerator(dataset_aug_std, test_names_aug,
                               "spo2", batch_size=batch_size,
                               number_of_predictions=number_of_predictions,
                               window_size=window_size,
                               step_prediction_dates=1,
                               rebalance_data=False,
                               debug=False)
  
In [17]:
with open('./train_gen.pkl', 'rb') as f:
      train_gen_aug = pickle.load(f)

  with open('./val_gen.pkl', 'rb') as f:
      val_gen_aug = pickle.load(f)

  with open('./test_gen.pkl', 'rb') as f:
      test_gen_aug = pickle.load(f)
  
In [18]:
model_aug = generate_DeepConvLSTM_model(dim_length, dim_channels, output_dim,
                                          filters_losses, lstm_dims_losses, learning_rate_losses,
                                          regularization_rate_losses, dropout=None,
                                          dropout_rnn=dropout_rnn_losses, dropout_cnn=dropout_cnn_losses,
                                          metrics=[mean_absolute_percentage_error])
  models_aug = [(model_aug, hyperparameters_losses)]

  best_model_aug, best_params_aug, best_model_metrics_aug, best_params_metrics_aug, debug = \
      find_best_architecture(train_gen_aug, val_gen_aug, test_gen_aug,
                             verbose=False, number_of_models=None, nr_epochs=500, # let early stopping decide
                             early_stopping=True, batch_size=batch_size,
                             models=models_aug, metric=mean_absolute_percentage_error, use_testset=True,
                             debug=False, test_retrain=False, output_all=True)
  
In [19]:
train_predict_aug, val_predict_aug, test_predict_aug = \
      evaluate_model(train_gen_aug, val_gen_aug, test_gen_aug,
                     best_model_aug)
  
training error = [0.003009021183724756, 54896.83746286145]
  validation error = [0.022678880981605264, 16.961307646746473]
  testing error = [0.0028090036815085805, 3.6640849429978166]
  
In [61]:
for initial_prediction in range(27,28):
      f, axs = plt.subplots(5, 3, figsize=(21, 30))
      for index in range(0,15):
          evaluate_plot(dataset_aug, test_gen_aug, test_predict_aug,
                        target_variable=target_variable, metric=mape,
                        prediction=(initial_prediction*15)+index,
                        add_title="time " + str(index),
                        ax=axs[int(index/3), index % 3])
      simple_prettify(target_variable, axs)
      plt.show()