1. 程式人生 > >How to Handle Missing Timesteps in Sequence Prediction Problems with Python

How to Handle Missing Timesteps in Sequence Prediction Problems with Python

It is common to have missing observations from sequence data.

Data may be corrupt or unavailable, but it is also possible that your data has variable length sequences by definition. Those sequences with fewer timesteps may be considered to have missing values.

In this tutorial, you will discover how you can handle data with missing values for sequence prediction problems in Python with the Keras deep learning library.

After completing this tutorial, you will know:

  • How to remove rows that contain a missing timestep.
  • How to mark missing timesteps and force the network to learn their meaning.
  • How to mask missing timesteps and exclude them from calculations in the model.

Let’s get started.

A Gentle Introduction to Linear Algebra

A Gentle Introduction to Linear Algebra
Photo by Steve Corey, some rights reserved.

Overview

This section is divided into 3 parts; they are:

  1. Echo Sequence Prediction Problem
  2. Handling Missing Sequence Data
  3. Learning With Missing Sequence Values

Environment

This tutorial assumes you have a Python SciPy environment installed. You can use either Python 2 or 3 with this example.

This tutorial assumes you have Keras (v2.0.4+) installed with either the TensorFlow (v1.1.0+) or Theano (v0.9+) backend.

This tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

If you need help setting up your Python environment, see this post:

Echo Sequence Prediction Problem

The echo problem is a contrived sequence prediction problem where the objective is to remember and predict an observation at a fixed prior timestep, called a lag observation.

For example, the simplest case is to predict the observation from the previous timestep that is, echo it back. For example:

1234 Time 1: Input 45Time 2: Input 23, Output 45Time 3: Input 73, Output 23...

The question is, what do we do about timestep 1?

We can implement the echo sequence prediction problem in Python.

This involves two steps: the generation of random sequences and the transformation of random sequences into a supervised learning problem.

Generate Random Sequence

We can generate sequences of random values between 0 and 1 using the random() function in the random module.

We can put this in a function called generate_sequence() that will generate a sequence of random floating point values for the desired number of timesteps.

This function is listed below.

123 # generate a sequence of random valuesdef generate_sequence(n_timesteps):return[random()for_inrange(n_timesteps)]

Need help with Deep Learning for Time Series?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Frame as Supervised Learning

Sequences must be framed as a supervised learning problem when using neural networks.

That means the sequence needs to be divided into input and output pairs.

The problem can be framed as making a prediction based on a function of the current and previous timesteps.

Or more formally:

1 y(t) = f(X(t), X(t-1))

Where y(t) is the desired output for the current timestep, f() is the function we are seeking to approximate with our neural network, and X(t) and X(t-1) are the observations for the current and previous timesteps.

The output could be equal to the previous observation, for example, y(t) = X(t-1), but it could as easily be y(t) = X(t). The model that we train on this problem does not know the true formulation and must learn this relationship.

This mimics real sequence prediction problems where we specify the model as a function of some fixed set of sequenced timesteps, but we don’t know the actual functional relationship from past observations to the desired output value.

We can implement this framing of an echo problem as a supervised learning problem in python.

The Pandas shift() function can be used to create a shifted version of the sequence that can be used to represent the observations at the prior timestep. This can be concatenated with the raw sequence to provide the X(t-1) and X(t) input values.

12 df=DataFrame(sequence)df=concat([df.shift(1),df],axis=1)

We can then take the values from the Pandas DataFrame as the input sequence (X) and use the first column as the output sequence (y).

12 # specify input and output dataX,y=values,values[:,0]

Putting this all together, we can define a function that takes the number of timesteps as an argument and returns X,y data for sequence learning called generate_data().

123456789101112 # generate data for the lstmdef generate_data(n_timesteps):# generate sequencesequence=generate_sequence(n_timesteps)sequence=array(sequence)# create lagdf=DataFrame(sequence)df=concat([df.shift(1),df],axis=1)values=df.values# specify input and output dataX,y=values,values[:,0]returnX,y

Sequence Problem Demonstration

We can tie the generate_sequence() and generate_data() code together into a worked example.

The complete example is listed below.

12345678910111213141516171819202122232425262728 from random import randomfrom numpy import arrayfrom pandas import concatfrom pandas import DataFrame# generate a sequence of random valuesdef generate_sequence(n_timesteps):return[random()for_inrange(n_timesteps)]# generate data for the lstmdef generate_data(n_timesteps):# generate sequencesequence=generate_sequence(n_timesteps)sequence=array(sequence)# create lagdf=DataFrame(sequence)df=concat([df.shift(1),df],axis=1)values=df.values# specify input and output dataX,y=values,values[:,0]returnX,y# generate sequencen_timesteps=10X,y=generate_data(n_timesteps)# print sequenceforiinrange(n_timesteps):print(X[i],'=>',y[i])

Running this example generates a sequence, converts it to a supervised representation, and prints each X,y pair.

12345678910 [ nan 0.18961404] => nan[ 0.18961404 0.25956078] => 0.189614044109[ 0.25956078 0.30322084] => 0.259560776929[ 0.30322084 0.72581287] => 0.303220844801[ 0.72581287 0.02916655] => 0.725812865047[ 0.02916655 0.88711086] => 0.0291665472554[ 0.88711086 0.34267107] => 0.88711086298[ 0.34267107 0.3844453 ] => 0.342671068373[ 0.3844453 0.89759621] => 0.384445299683[ 0.89759621 0.95278264] => 0.897596208691

We can see that we have NaN values on the first row.

This is because we do not have a prior observation for the first value in the sequence. We have to fill that space with something.

But we cannot fit a model with NaN inputs.

Handling Missing Sequence Data

There are two main ways to handle missing sequence data.

They are to remove rows with missing data and to fill the missing timesteps with another value.

For more general methods for handling missing data, see the post:

The best approach for handling missing sequence data will depend on your problem and your chosen network configuration. I would recommend exploring each method and see what works best.

Remove Missing Sequence Data

In the case where we are echoing the observation in the previous timestep, the first row of data does not contain any useful information.

That is, in the example above, given the input:

1 [        nan  0.18961404]

and the output:

1 nan

There is nothing meaningful that can be learned or predicted.

The best case here is to delete this row.

We can do this during the formulation of the sequence as a supervised learning problem by removing all rows that contain a NaN value. Specifically, the dropna() function can be called prior to splitting the data into X and y components.

The complete example is listed below:

123456789101112131415161718192021222324252627282930 from random import randomfrom numpy import arrayfrom pandas import concatfrom pandas import DataFrame# generate a sequence of random valuesdef generate_sequence(n_timesteps):return[random()for_inrange(n_timesteps)]# generate data for the lstmdef generate_data(n_timesteps):# generate sequencesequence=generate_sequence(n_timesteps)sequence=array(sequence)# create lagdf=DataFrame(sequence)df=concat([df.shift(1),df],axis=1)# remove rows with missing valuesdf.dropna(inplace=True)values=df.values# specify input and output dataX,y=values,values[:,0]returnX,y# generate sequencen_timesteps=10X,y=generate_data(n_timesteps)# print sequenceforiinrange(len(X)):print(X[i],'=>',y[i])

Running the example results in 9 X,y pairs instead of 10, with the first row removed.

123456789 [ 0.60619475  0.24408238] => 0.606194746194[ 0.24408238  0.44873712] => 0.244082383195[ 0.44873712  0.92939547] => 0.448737123424[ 0.92939547  0.74481645] => 0.929395472523[ 0.74481645  0.69891311] => 0.744816453809[ 0.69891311  0.8420314 ] => 0.69891310578[ 0.8420314   0.58627624] => 0.842031399202[ 0.58627624  0.48125348] => 0.586276240292[ 0.48125348  0.75057094] => 0.481253484036

Replace Missing Sequence Data

In the case when the echo problem is configured to echo the observation at the current timestep, then the first row will contain meaningful information.

For example, we can change the definition of y from values[:, 0] to values[:, 1] and re-run the demonstration to produce a sample of this problem, as follows:

12345678910 [        nan  0.50513289] => 0.505132894821[ 0.50513289  0.22879667] => 0.228796667421[ 0.22879667  0.66980995] => 0.669809946421[ 0.66980995  0.10445146] => 0.104451463568[ 0.10445146  0.70642423] => 0.70642422679[ 0.70642423  0.10198636] => 0.101986362328[ 0.10198636  0.49648033] => 0.496480332278[ 0.49648033  0.06201137] => 0.0620113728356[ 0.06201137  0.40653087] => 0.406530870804[ 0.40653087  0.63299264] => 0.632992635565

We can see that the first row is given the input:

1 [        nan  0.50513289]

and the output:

1 0.505132894821

Which could be learned from the input.

The problem is, we still have a NaN value to handle.

Instead of removing the rows with NaN values, we can replace all NaN values with a specific value that does not appear naturally in the input, such as -1. To do this, we can use the fillna() Pandas function.

The complete example is listed below: