How to Develop Autoregressive Forecasting Models for Multi

Real-world time series forecasting is challenging for a whole host of reasons not limited to problem features such as having multiple input variables, the requirement to predict multiple time steps, and the need to perform the same type of prediction for multiple physical sites.

The EMC Data Science Global Hackathon dataset, or the ‘Air Quality Prediction’ dataset for short, describes weather conditions at multiple sites and requires a prediction of air quality measurements over the subsequent three days.

Before diving into sophisticated machine learning and deep learning methods for time series forecasting, it is important to find the limits of classical methods, such as developing autoregressive models using the AR or ARIMA method.

In this tutorial, you will discover how to develop autoregressive models for multi-step time series forecasting for a multivariate air pollution time series.

After completing this tutorial, you will know:

How to analyze and impute missing values for time series data.
How to develop and evaluate an autoregressive model for multi-step time series forecasting.

How to improve an autoregressive model using alternate data imputation methods.

Letâ€™s get started.

Impact of Dataset Size on Deep Learning Model Skill And Performance EstimatesPhoto by Eneas De Troya, some rights reserved.

Tutorial Overview

This tutorial is divided into six parts; they are:

Problem Description
Model Evaluation
Data Analysis
Develop an Autoregressive Model
Autoregressive Model with Global Impute Strategy

Problem Description

The Air Quality Prediction dataset describes weather conditions at multiple sites and requires a prediction of air quality measurements over the subsequent three days.

Specifically, weather observations such as temperature, pressure, wind speed, and wind direction are provided hourly for eight days for multiple sites. The objective is to predict air quality measurements for the next 3 days at multiple sites. The forecast lead times are not contiguous; instead, specific lead times must be forecast over the 72 hour forecast period. They are:

+1, +2, +3, +4, +5, +10, +17, +24, +48, +72

1	+1, +2, +3, +4, +5, +10, +17, +24, +48, +72

Further, the dataset is divided into disjoint but contiguous chunks of data, with eight days of data followed by three days that require a forecast.

Not all observations are available at all sites or chunks and not all output variables are available at all sites and chunks. There are large portions of missing data that must be addressed.

The dataset was used as the basis for a short duration machine learning competition (or hackathon) on the Kaggle website in 2012.

Submissions for the competition were evaluated against the true observations that were withheld from participants and scored using Mean Absolute Error (MAE). Submissions required the value of -1,000,000 to be specified in those cases where a forecast was not possible due to missing data. In fact, a template of where to insert missing values was provided and required to be adopted for all submissions (what a pain).

A winning entrant achieved a MAE of 0.21058 on the withheld test set (private leaderboard) using random forest on lagged observations. A writeup of this solution is available in the post:

In this tutorial, we will explore how to develop naive forecasts for the problem that can be used as a baseline to determine whether a model has skill on the problem or not.

Need help with Deep Learning for Time Series?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Model Evaluation

Before we can evaluate naive forecasting methods, we must develop a test harness.

This includes at least how the data will be prepared and how forecasts will be evaluated.

Load Dataset

The first step is to download the dataset and load it into memory.

The dataset can be downloaded for free from the Kaggle website. You may have to create an account and log in, in order to be able to download the dataset.

Download the entire dataset, e.g. “Download All” to your workstation and unzip the archive in your current working directory with the folder named ‘AirQualityPrediction‘.

Our focus will be the ‘TrainingData.csv‘ file that contains the training dataset, specifically data in chunks where each chunk is eight contiguous days of observations and target variables.

We can load the data file into memory using the Pandas read_csv() function and specify the header row on line 0.

# load dataset
dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0)

12	# load datasetdataset=read_csv('AirQualityPrediction/TrainingData.csv',header=0)

We can group data by the ‘chunkID’ variable (column index 1).

First, let’s get a list of the unique chunk identifiers.

chunk_ids = unique(values[:, 1])

1	chunk_ids=unique(values[:,1])

We can then collect all rows for each chunk identifier and store them in a dictionary for easy access.

chunks = dict()
# sort rows by chunk id
for chunk_id in chunk_ids:
	selection = values[:, chunk_ix] == chunk_id
	chunks[chunk_id] = values[selection, :]

12345

chunks=dict()# sort rows by chunk idforchunk_id inchunk_ids:selection=values[:,chunk_ix]==chunk_idchunks[chunk_id]=values[selection,:]

Below defines a function named to_chunks() that takes a NumPy array of the loaded data and returns a dictionary of chunk_id to rows for the chunk.

# split the dataset by 'chunkID', return a dict of id to rows
def to_chunks(values, chunk_ix=1):
	chunks = dict()
	# get the unique chunk ids
	chunk_ids = unique(values[:, chunk_ix])
	# group rows by chunk id
	for chunk_id in chunk_ids:
		selection = values[:, chunk_ix] == chunk_id
		chunks[chunk_id] = values[selection, :]
	return chunks

12345678910

# split the dataset by 'chunkID', return a dict of id to rowsdef to_chunks(values,chunk_ix=1):chunks=dict()# get the unique chunk idschunk_ids=unique(values[:,chunk_ix])# group rows by chunk idforchunk_id inchunk_ids:selection=values[:,chunk_ix]==chunk_idchunks[chunk_id]=values[selection,:]returnchunks

The complete example that loads the dataset and splits it into chunks is listed below.

# load data and split into chunks
from numpy import unique
from pandas import read_csv

# split the dataset by 'chunkID', return a dict of id to rows
def to_chunks(values, chunk_ix=1):
	chunks = dict()
	# get the unique chunk ids
	chunk_ids = unique(values[:, chunk_ix])
	# group rows by chunk id
	for chunk_id in chunk_ids:
		selection = values[:, chunk_ix] == chunk_id
		chunks[chunk_id] = values[selection, :]
	return chunks

# load dataset
dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0)
# group data by chunks
values = dataset.values
chunks = to_chunks(values)
print('Total Chunks: %d' % len(chunks))

123456789101112131415161718192021

# load data and split into chunksfrom numpy import uniquefrom pandas import read_csv# split the dataset by 'chunkID', return a dict of id to rowsdef to_chunks(values,chunk_ix=1):chunks=dict()# get the unique chunk idschunk_ids=unique(values[:,chunk_ix])# group rows by chunk idforchunk_id inchunk_ids:selection=values[:,chunk_ix]==chunk_idchunks[chunk_id]=values[selection,:]returnchunks# load datasetdataset=read_csv('AirQualityPrediction/TrainingData.csv',header=0)# group data by chunksvalues=dataset.valueschunks=to_chunks(values)print('Total Chunks: %d'%len(chunks))

Running the example prints the number of chunks in the dataset.

Total Chunks: 208

1	Total Chunks: 208

Data Preparation

Now that we know how to load the data and split it into chunks, we can separate into train and test datasets.

Each chunk covers an interval of eight days of hourly observations, although the number of actual observations within each chunk may vary widely.

We can split each chunk into the first five days of observations for training and the last three for test.

Each observation has a row called ‘position_within_chunk‘ that varies from 1 to 192 (8 days * 24 hours). We can therefore take all rows with a value in this column that is less than or equal to 120 (5 * 24) as training data and any values more than 120 as test data.

Further, any chunks that don’t have any observations in the train or test split can be dropped as not viable.

When working with the naive models, we are only interested in the target variables, and none of the input meteorological variables. Therefore, we can remove the input data and have the train and test data only comprised of the 39 target variables for each chunk, as well as the position within chunk and hour of observation.

The split_train_test() function below implements this behavior; given a dictionary of chunks, it will split each into a list of train and test chunk data.

# split each chunk into train/test sets
def split_train_test(chunks, row_in_chunk_ix=2):
	train, test = list(), list()
	# first 5 days of hourly observations for train
	cut_point = 5 * 24
	# enumerate chunks
	for k,rows in chunks.items():
		# split chunk rows by 'position_within_chunk'
		train_rows = rows[rows[:,row_in_chunk_ix] <= cut_point, :]
		test_rows = rows[rows[:,row_in_chunk_ix] > cut_point, :]
		if len(train_rows) == 0 or len(test_rows) == 0:
			print('>dropping chunk=%d: train=%s, test=%s' % (k, train_rows.shape, test_rows.shape))
			continue
		# store with chunk id, position in chunk, hour and all targets
		indices = [1,2,5] + [x for x in range(56,train_rows.shape[1])]
		train.append(train_rows[:, indices])
		test.append(test_rows[:, indices])
	return train, test

123456789101112131415161718

# split each chunk into train/test setsdef split_train_test(chunks,row_in_chunk_ix=2):train,test=list(),list()# first 5 days of hourly observations for traincut_point=5*24# enumerate chunksfork,rows inchunks.items():# split chunk rows by 'position_within_chunk'train_rows=rows[rows[:,row_in_chunk_ix]<=cut_point,:]test_rows=rows[rows[:,row_in_chunk_ix]>cut_point,:]iflen(train_rows)==0orlen(test_rows)==0:print('>dropping chunk=%d: train=%s, test=%s'%(k,train_rows.shape,test_rows.shape))continue# store with chunk id, position in chunk, hour and all targetsindices=[1,2,5]+[xforxinrange(56,train_rows.shape[1])]train.append(train_rows[:,indices])test.append(test_rows[:,indices])returntrain,test

We do not require the entire test dataset; instead, we only require the observations at specific lead times over the three day period, specifically the lead times:

+1, +2, +3, +4, +5, +10, +17, +24, +48, +72

1	+1, +2, +3, +4, +5, +10, +17, +24, +48, +72

Where, each lead time is relative to the end of the training period.

First, we can put these lead times into a function for easy reference:

# return a list of relative forecast lead times
def get_lead_times():
	return [1, 2 ,3, 4, 5, 10, 17, 24, 48, 72]

123	# return a list of relative forecast lead timesdef get_lead_times():return[1,2,3,4,5,10,17,24,48,72]

Next, we can reduce the test dataset down to just the data at the preferred lead times.

We can do that by looking at the ‘position_within_chunk‘ column and using the lead time as an offset from the end of the training dataset, e.g. 120 + 1, 120 +2, etc.

If we find a matching row in the test set, it is saved, otherwise a row of NaN observations is generated.

The function to_forecasts() below implements this and returns a NumPy array with one row for each forecast lead time for each chunk.

# convert the rows in a test chunk to forecasts
def to_forecasts(test_chunks, row_in_chunk_ix=1):
	# get lead times
	lead_times = get_lead_times()
	# first 5 days of hourly observations for train
	cut_point = 5 * 24
	forecasts = list()
	# enumerate each chunk
	for rows in test_chunks:
		chunk_id = rows[0, 0]
		# enumerate each lead time
		for tau in lead_times:
			# determine the row in chunk we want for the lead time
			offset = cut_point + tau
			# retrieve data for the lead time using row number in chunk
			row_for_tau = rows[rows[:,row_in_chunk_ix]==offset, :]
			# check if we have data
			if len(row_for_tau) == 0:
				# create a mock row [chunk, position, hour] + [nan...]
				row = [chunk_id, offset, nan] + [nan for _ in range(39)]
				forecasts.append(row)
			else:
				# store the forecast row
				forecasts.append(row_for_tau[0])
	return array(forecasts)

12345678910111213141516171819202122232425

# convert the rows in a test chunk to forecastsdef to_forecasts(test_chunks,row_in_chunk_ix=1):# get lead timeslead_times=get_lead_times()# first 5 days of hourly observations for traincut_point=5*24forecasts=list()# enumerate each chunkforrows intest_chunks:chunk_id=rows[0,0]# enumerate each lead timefortau inlead_times:# determine the row in chunk we want for the lead timeoffset=cut_point+tau# retrieve data for the lead time using row number in chunkrow_for_tau=rows[rows[:,row_in_chunk_ix]==offset,:]# check if we have dataiflen(row_for_tau)==0:# create a mock row [chunk, position, hour] + [nan...]row=[chunk_id,offset,nan]+[nan for_inrange(39)]forecasts.append(row)else:# store the forecast rowforecasts.append(row_for_tau[0])returnarray(forecasts)

We can tie all of this together and split the dataset into train and test sets and save the results to new files.

The complete code example is listed below.

# split data into train and test sets
from numpy import unique
from numpy import nan
from numpy import array
from numpy import savetxt
from pandas import read_csv

# split the dataset by 'chunkID', return a dict of id to rows
def to_chunks(values, chunk_ix=1):
	chunks = dict()
	# get the unique chunk ids
	chunk_ids = unique(values[:, chunk_ix])
	# group rows by chunk id
	for chunk_id in chunk_ids:
		selection = values[:, chunk_ix] == chunk_id
		chunks[chunk_id] = values[selection, :]
	return chunks

# split each chunk into train/test sets
def split_train_test(chunks, row_in_chunk_ix=2):
	train, test = list(), list()
	# first 5 days of hourly observations for train
	cut_point = 5 * 24
	# enumerate chunks
	for k,rows in chunks.items():
		# split chunk rows by 'position_within_chunk'
		train_rows = rows[rows[:,row_in_chunk_ix] <= cut_point, :]
		test_rows = rows[rows[:,row_in_chunk_ix] > cut_point, :]
		if len(train_rows) == 0 or len(test_rows) == 0:
			print('>dropping chunk=%d: train=%s, test=%s' % (k, train_rows.shape, test_rows.shape))
			continue
		# store with chunk id, position in chunk, hour and all targets
		indices = [1,2,5] + [x for x in range(56,train_rows.shape[1])]
		train.append(train_rows[:, indices])
		test.append(test_rows[:, indices])
	return train, test

# return a list of relative forecast lead times
def get_lead_times():
	return [1, 2 ,3, 4, 5, 10, 17, 24, 48, 72]

# convert the rows in a test chunk to forecasts
def to_forecasts(test_chunks, row_in_chunk_ix=1):
	# get lead times
	lead_times = get_lead_times()
	# first 5 days of hourly observations for train
	cut_point = 5 * 24
	forecasts = list()
	# enumerate each chunk
	for rows in test_chunks:
		chunk_id = rows[0, 0]
		# enumerate each lead time
		for tau in lead_times:
			# determine the row in chunk we want for the lead time
			offset = cut_point + tau
			# retrieve data for the lead time using row number in chunk
			row_for_tau = rows[rows[:,row_in_chunk_ix]==offset, :]
			# check if we have data
			if len(row_for_tau) == 0:
				# create a mock row [chunk, position, hour] + [nan...]
				row = [chunk_id, offset, nan] + [nan for _ in range(39)]
				forecasts.append(row)
			else:
				# store the forecast row
				forecasts.append(row_for_tau[0])
	return array(forecasts)

# load dataset
dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0)
# group data by chunks
values = dataset.values
chunks = to_chunks(values)
# split into train/test
train, test = split_train_test(chunks)
# flatten training chunks to rows
train_rows = array([row for rows in train for row in rows])
# print(train_rows.shape)
print('Train Rows: %s' % str(train_rows.shape))
# reduce train to forecast lead times only
test_rows = to_forecasts(test)
print('Test Rows: %s' % str(test_rows.shape))
# save datasets
savetxt('AirQualityPrediction/naive_train.csv', train_rows, delimiter=',')
savetxt('AirQualityPrediction/naive_test.csv', test_rows, delimiter=',')

1234567891011121314151617181920212223242526272829303132

How to Develop Autoregressive Forecasting Models for Multi

Tutorial Overview

Problem Description

Need help with Deep Learning for Time Series?

Model Evaluation

Load Dataset

Data Preparation

How to Develop Autoregressive Forecasting Models for Multi

How to develop Android UI Component for React Native

How to Develop and Evaluate Naive Methods for Forecasting Household Electricity Consumption

Android Developers Blog: how to develop audio apps for android

How to Create an ARIMA Model for Time Series Forecasting in Python

How to develop on Xcode

How to Choose a Blockchain Platform for Your Business

Ask HN: How to build beautiful SVG graphics for websites?

How to create role based accounts for your Saas App using FEAN? (Part 1)

How to Build Exponential Smoothing Models Using Python: Simple Exponential Smoothing, Holt, and…

Ask HN: How to develop a core competency?

How to use DeepLab in TensorFlow for object segmentation using Deep Learning

[Tutorial, Part 1] How to develop Go gRPC microservice with HTTP/REST endpoint, middleware…

How to use Python on microcontrollers for Blockchain and IoT applications

Subclassed: How to implement custom BotStorage class for Microsoft BotFramework

How to Develop a Neural Machine Translation System from Scratch

How to Develop a Reusable Framework to Spot

How to Develop a Deep Learning Photo Caption Generator from Scratch

How to choose the best channel for your chatbot

How to create beautiful text stickers for Android

How to Develop Autoregressive Forecasting Models for Multi

Tutorial Overview

Problem Description

Need help with Deep Learning for Time Series?

Model Evaluation

Load Dataset

Data Preparation

相關推薦