How to Develop Autoregressive Forecasting Models for Multi
Real-world time series forecasting is challenging for a whole host of reasons not limited to problem features such as having multiple input variables, the requirement to predict multiple time steps, and the need to perform the same type of prediction for multiple physical sites.
The EMC Data Science Global Hackathon dataset, or the ‘Air Quality Prediction’ dataset for short, describes weather conditions at multiple sites and requires a prediction of air quality measurements over the subsequent three days.
Before diving into sophisticated machine learning and deep learning methods for time series forecasting, it is important to find the limits of classical methods, such as developing autoregressive models using the AR or ARIMA method.
In this tutorial, you will discover how to develop autoregressive models for multi-step time series forecasting for a multivariate air pollution time series.
After completing this tutorial, you will know:
- How to analyze and impute missing values for time series data.
- How to develop and evaluate an autoregressive model for multi-step time series forecasting.
- How to improve an autoregressive model using alternate data imputation methods.
Let’s get started.
Tutorial Overview
This tutorial is divided into six parts; they are:
- Problem Description
- Model Evaluation
- Data Analysis
- Develop an Autoregressive Model
- Autoregressive Model with Global Impute Strategy
Problem Description
The Air Quality Prediction dataset describes weather conditions at multiple sites and requires a prediction of air quality measurements over the subsequent three days.
Specifically, weather observations such as temperature, pressure, wind speed, and wind direction are provided hourly for eight days for multiple sites. The objective is to predict air quality measurements for the next 3 days at multiple sites. The forecast lead times are not contiguous; instead, specific lead times must be forecast over the 72 hour forecast period. They are:
1 | +1, +2, +3, +4, +5, +10, +17, +24, +48, +72 |
Further, the dataset is divided into disjoint but contiguous chunks of data, with eight days of data followed by three days that require a forecast.
Not all observations are available at all sites or chunks and not all output variables are available at all sites and chunks. There are large portions of missing data that must be addressed.
The dataset was used as the basis for a short duration machine learning competition (or hackathon) on the Kaggle website in 2012.
Submissions for the competition were evaluated against the true observations that were withheld from participants and scored using Mean Absolute Error (MAE). Submissions required the value of -1,000,000 to be specified in those cases where a forecast was not possible due to missing data. In fact, a template of where to insert missing values was provided and required to be adopted for all submissions (what a pain).
A winning entrant achieved a MAE of 0.21058 on the withheld test set (private leaderboard) using random forest on lagged observations. A writeup of this solution is available in the post:
In this tutorial, we will explore how to develop naive forecasts for the problem that can be used as a baseline to determine whether a model has skill on the problem or not.
Need help with Deep Learning for Time Series?
Take my free 7-day email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Model Evaluation
Before we can evaluate naive forecasting methods, we must develop a test harness.
This includes at least how the data will be prepared and how forecasts will be evaluated.
Load Dataset
The first step is to download the dataset and load it into memory.
The dataset can be downloaded for free from the Kaggle website. You may have to create an account and log in, in order to be able to download the dataset.
Download the entire dataset, e.g. “Download All” to your workstation and unzip the archive in your current working directory with the folder named ‘AirQualityPrediction‘.
Our focus will be the ‘TrainingData.csv‘ file that contains the training dataset, specifically data in chunks where each chunk is eight contiguous days of observations and target variables.
We can load the data file into memory using the Pandas read_csv() function and specify the header row on line 0.
12 | # load datasetdataset=read_csv('AirQualityPrediction/TrainingData.csv',header=0) |
We can group data by the ‘chunkID’ variable (column index 1).
First, let’s get a list of the unique chunk identifiers.
1 | chunk_ids=unique(values[:,1]) |
We can then collect all rows for each chunk identifier and store them in a dictionary for easy access.
12345 | chunks=dict()# sort rows by chunk idforchunk_id inchunk_ids:selection=values[:,chunk_ix]==chunk_idchunks[chunk_id]=values[selection,:] |
Below defines a function named to_chunks() that takes a NumPy array of the loaded data and returns a dictionary of chunk_id to rows for the chunk.
12345678910 | # split the dataset by 'chunkID', return a dict of id to rowsdef to_chunks(values,chunk_ix=1):chunks=dict()# get the unique chunk idschunk_ids=unique(values[:,chunk_ix])# group rows by chunk idforchunk_id inchunk_ids:selection=values[:,chunk_ix]==chunk_idchunks[chunk_id]=values[selection,:]returnchunks |
The complete example that loads the dataset and splits it into chunks is listed below.
123456789101112131415161718192021 | # load data and split into chunksfrom numpy import uniquefrom pandas import read_csv# split the dataset by 'chunkID', return a dict of id to rowsdef to_chunks(values,chunk_ix=1):chunks=dict()# get the unique chunk idschunk_ids=unique(values[:,chunk_ix])# group rows by chunk idforchunk_id inchunk_ids:selection=values[:,chunk_ix]==chunk_idchunks[chunk_id]=values[selection,:]returnchunks# load datasetdataset=read_csv('AirQualityPrediction/TrainingData.csv',header=0)# group data by chunksvalues=dataset.valueschunks=to_chunks(values)print('Total Chunks: %d'%len(chunks)) |
Running the example prints the number of chunks in the dataset.
1 | Total Chunks: 208 |
Data Preparation
Now that we know how to load the data and split it into chunks, we can separate into train and test datasets.
Each chunk covers an interval of eight days of hourly observations, although the number of actual observations within each chunk may vary widely.
We can split each chunk into the first five days of observations for training and the last three for test.
Each observation has a row called ‘position_within_chunk‘ that varies from 1 to 192 (8 days * 24 hours). We can therefore take all rows with a value in this column that is less than or equal to 120 (5 * 24) as training data and any values more than 120 as test data.
Further, any chunks that don’t have any observations in the train or test split can be dropped as not viable.
When working with the naive models, we are only interested in the target variables, and none of the input meteorological variables. Therefore, we can remove the input data and have the train and test data only comprised of the 39 target variables for each chunk, as well as the position within chunk and hour of observation.
The split_train_test() function below implements this behavior; given a dictionary of chunks, it will split each into a list of train and test chunk data.
123456789101112131415161718 | # split each chunk into train/test setsdef split_train_test(chunks,row_in_chunk_ix=2):train,test=list(),list()# first 5 days of hourly observations for traincut_point=5*24# enumerate chunksfork,rows inchunks.items():# split chunk rows by 'position_within_chunk'train_rows=rows[rows[:,row_in_chunk_ix]<=cut_point,:]test_rows=rows[rows[:,row_in_chunk_ix]>cut_point,:]iflen(train_rows)==0orlen(test_rows)==0:print('>dropping chunk=%d: train=%s, test=%s'%(k,train_rows.shape,test_rows.shape))continue# store with chunk id, position in chunk, hour and all targetsindices=[1,2,5]+[xforxinrange(56,train_rows.shape[1])]train.append(train_rows[:,indices])test.append(test_rows[:,indices])returntrain,test |
We do not require the entire test dataset; instead, we only require the observations at specific lead times over the three day period, specifically the lead times:
1 | +1, +2, +3, +4, +5, +10, +17, +24, +48, +72 |
Where, each lead time is relative to the end of the training period.
First, we can put these lead times into a function for easy reference:
123 | # return a list of relative forecast lead timesdef get_lead_times():return[1,2,3,4,5,10,17,24,48,72] |
Next, we can reduce the test dataset down to just the data at the preferred lead times.
We can do that by looking at the ‘position_within_chunk‘ column and using the lead time as an offset from the end of the training dataset, e.g. 120 + 1, 120 +2, etc.
If we find a matching row in the test set, it is saved, otherwise a row of NaN observations is generated.
The function to_forecasts() below implements this and returns a NumPy array with one row for each forecast lead time for each chunk.
12345678910111213141516171819202122232425 | # convert the rows in a test chunk to forecastsdef to_forecasts(test_chunks,row_in_chunk_ix=1):# get lead timeslead_times=get_lead_times()# first 5 days of hourly observations for traincut_point=5*24forecasts=list()# enumerate each chunkforrows intest_chunks:chunk_id=rows[0,0]# enumerate each lead timefortau inlead_times:# determine the row in chunk we want for the lead timeoffset=cut_point+tau# retrieve data for the lead time using row number in chunkrow_for_tau=rows[rows[:,row_in_chunk_ix]==offset,:]# check if we have dataiflen(row_for_tau)==0:# create a mock row [chunk, position, hour] + [nan...]row=[chunk_id,offset,nan]+[nan for_inrange(39)]forecasts.append(row)else:# store the forecast rowforecasts.append(row_for_tau[0])returnarray(forecasts) |
We can tie all of this together and split the dataset into train and test sets and save the results to new files.
The complete code example is listed below.
1234567891011121314151617181920212223242526272829303132相關推薦How to Develop Autoregressive Forecasting Models for MultiTweet Share Share Google Plus Real-world time series forecasting is challenging for a whole host How to develop Android UI Component for React NativeIn one of our project that we developed in React Native, we faced a problem. We wanted to use a video player with the text overlay. Though there are lots o How to Develop and Evaluate Naive Methods for Forecasting Household Electricity ConsumptionTweet Share Share Google Plus Given the rise of smart electricity meters and the wide adoption o Android Developers Blog: how to develop audio apps for androidPosted by Don Turner, Developer Advocate, Android Audio Framework This week we released the first production-ready version of Oboe - a C++ library f How to Create an ARIMA Model for Time Series Forecasting in PythonTweet Share Share Google Plus A popular and widely used statistical method for time series forec How to develop on Xcodecreate a mac app,click button to change label text? create project,click left window button on Main.storyboard,click ViewController on the left,drag pus How to Choose a Blockchain Platform for Your BusinessHow to Choose a Blockchain Platform for Your BusinessThe growing popularity of crypto investments has aroused a keen interest in blockchain technologies an Ask HN: How to build beautiful SVG graphics for websites?I am really curious if anyone knows about courses / tutorials or other material I could use to learn how to do graphics such as the ones at:a) www.stripe.c How to create role based accounts for your Saas App using FEAN? (Part 1)Setup firebase in your angular app and express js// Front-endng new exampleAppcd exampleApp && cd exampleApp// For adding firebase to angular appng How to Build Exponential Smoothing Models Using Python: Simple Exponential Smoothing, Holt, and…How to Build Exponential Smoothing Models Using Python: Simple Exponential Smoothing, Holt, and Holt-WintersHow many iPhone XS will be sold in first 12 mon Ask HN: How to develop a core competency?I have always been obsessed with being an alpha programmer (someone that can do everything). Its been five years and I have been able to do a few but have How to use DeepLab in TensorFlow for object segmentation using Deep LearningHow to use DeepLab in TensorFlow for object segmentation using Deep LearningModifying the DeepLab code to train on your own dataset for object segmentation [Tutorial, Part 1] How to develop Go gRPC microservice with HTTP/REST endpoint, middleware…[Tutorial, Part 1] How to develop Go gRPC microservice with HTTP/REST endpoint, middleware, Kubernetes deployment, etc.There are a lot of article how to cr How to use Python on microcontrollers for Blockchain and IoT applicationsThis tutorial will be exploring the potential of combining IoT and blockchain using simple Python directly on microcontrollers, thanks to Zerynth t Subclassed: How to implement custom BotStorage class for Microsoft BotFrameworkSince launch, the MS BotFramework has been changing very rapidly. So rapidly, in fact, that I recently gave up trying to keep up with my handrolled Python How to Develop a Neural Machine Translation System from ScratchTweet Share Share Google Plus Develop a Deep Learning Model to Automatically Translate from Germ How to Develop a Reusable Framework to SpotTweet Share Share Google Plus Spot-checking algorithms is a technique in applied machine learnin How to Develop a Deep Learning Photo Caption Generator from ScratchTweet Share Share Google Plus Develop a Deep Learning Model to Automatically Describe Photograph How to choose the best channel for your chatbotUse of cookies: We our own and third-party cookies to personalise our services and collect statistical information. If you continue browsing the site, you How to create beautiful text stickers for AndroidHow to create beautiful text stickers for AndroidIn this article, you’ll learn how to draw text on canvas, position and update it in real time based on use |