Time Series Forecast Case Study with Python: Annual Water Usage in Baltimore
Time series forecasting is a process, and the only way to get good forecasts is to practice this process.
In this tutorial, you will discover how to forecast the annual water usage in Baltimore with Python.
Working through this tutorial will provide you with a framework for the steps and the tools for working through your own time series forecasting problems.
After completing this tutorial, you will know:
- How to confirm your Python environment and carefully define a time series forecasting problem.
- How to create a test harness for evaluating models, develop a baseline forecast, and better understand your problem with the tools of time series analysis.
- How to develop an autoregressive integrated moving average model, save it to file, and later load it to make predictions for new time steps.
Let’s get started.
Overview
In this tutorial, we will work through a time series forecasting project from end-to-end, from downloading the dataset and defining the problem to training a final model and making predictions.
This project is not exhaustive, but shows how you can get good results quickly by working through a time series forecasting problem systematically.
The steps of this project that we will work through are as follows.
- Environment.
- Problem Description.
- Test Harness.
- Persistence.
- Data Analysis.
- ARIMA Models.
- Model Validation.
This will provide a template for working through a time series prediction problem that you can use on your own dataset.
Stop learning Time Series Forecasting the slow way!
Take my free 7-day email course and discover data prep, modeling and more (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
1. Environment
This tutorial assumes an installed and working SciPy environment and dependencies, including:
- SciPy
- NumPy
- Matplotlib
- Pandas
- scikit-learn
- statsmodels
If you need help installing Python and the SciPy environment on your workstation, consider the Anaconda distribution that manages much of it for you.
This script will help you check your installed versions of these libraries.
123456789101112131415161718 | # scipyimport scipyprint('scipy: %s'%scipy.__version__)# numpyimport numpyprint('numpy: %s'%numpy.__version__)# matplotlibimport matplotlibprint('matplotlib: %s'%matplotlib.__version__)# pandasimport pandasprint('pandas: %s'%pandas.__version__)# scikit-learnimport sklearnprint('sklearn: %s'%sklearn.__version__)# statsmodelsimport statsmodelsprint('statsmodels: %s'%statsmodels.__version__) |
The results on my workstation used to write this tutorial are as follows:
123456 | scipy: 0.18.1numpy: 1.11.2matplotlib: 1.5.3pandas: 0.19.1sklearn: 0.18.1statsmodels: 0.6.1 |
2. Problem Description
The problem is to predict annual water usage.
The dataset provides the annual water usage in Baltimore from 1885 to 1963, or 79 years of data.
The values are in the units of liters per capita per day, and there are 79 observations.
The dataset is credited to Hipel and McLeod, 1994.
Download the dataset as a CSV file and place it in your current working directory with the filename “water.csv“.
3. Test Harness
We must develop a test harness to investigate the data and evaluate candidate models.
This involves two steps:
- Defining a Validation Dataset.
- Developing a Method for Model Evaluation.
3.1 Validation Dataset
The dataset is not current. This means that we cannot easily collect updated data to validate the model.
Therefore, we will pretend that it is 1953 and withhold the last 10 years of data from analysis and model selection.
This final decade of data will be used to validate the final model.
The code below will load the dataset as a Pandas Series and split into two, one for model development (dataset.csv) and the other for validation (validation.csv).
1234567 | from pandas import Seriesseries=Series.from_csv('water.csv',header=0)split_point=len(series)-10dataset,validation=series[0:split_point],series[split_point:]print('Dataset %d, Validation %d'%(len(dataset),len(validation)))dataset.to_csv('dataset.csv')validation.to_csv('validation.csv') |
Running the example creates two files and prints the number of observations in each.
1 | Dataset 69, Validation 10 |
The specific contents of these files are:
- dataset.csv: Observations from 1885 to 1953 (69 observations).
- validation.csv: Observations from 1954 to 1963 (10 observations).
The validation dataset is about 12% of the original dataset.
Note that the saved datasets do not have a header line, therefore we do not need to cater to this when working with these files later.
3.2. Model Evaluation
Model evaluation will only be performed on the data in dataset.csv prepared in the previous section.
Model evaluation involves two elements:
- Performance Measure.
- Test Strategy.
3.2.1 Performance Measure
We will evaluate the performance of predictions using the root mean squared error (RMSE). This will give more weight to predictions that are grossly wrong and will have the same units as the original data.
Any transforms to the data must be reversed before the RMSE is calculated and reported to make the performance between different methods directly comparable.
We can calculate the RMSE using the helper function from the scikit-learn library mean_squared_error() that calculates the mean squared error between a list of expected values (the test set) and the list of predictions. We can then take the square root of this value to give us a RMSE score.
For example:
12345678 | from sklearn.metrics import mean_squared_errorfrom math import sqrt...test=...predictions=...mse=mean_squared_error(test,predictions)rmse=sqrt(mse)print('RMSE: %.3f'%rmse) |
3.2.2 Test Strategy
Candidate models will be evaluated using walk-forward validation.
This is because a rolling-forecast type model is required from the problem definition. This is where one-step forecasts are needed given all available data.
The walk-forward validation will work as follows:
- The first 50% of the dataset will be held back to train the model.
- The remaining 50% of the dataset will be iterated and test the model.
- For each step in the test dataset:
- A model will be trained.
- A one-step prediction made and the prediction stored for later evaluation.
- The actual observation from the test dataset will be added to the training dataset for the next iteration.
- The predictions made during the enumeration of the test dataset will be evaluated and an RMSE score reported.
Given the small size of the data, we will allow a model to be re-trained given all available data prior to each prediction.
We can write the code for the test harness using simple NumPy and Python code.
Firstly, we can split the dataset into train and test sets directly. We’re careful to always convert a loaded dataset to float32 in case the loaded data still has some String or Integer data types.
12345 | # prepare dataX=series.valuesX=X.astype('float32')train_size=int(len(X)*0.50)train,test=X[0:train_size],X[train_size:] |
Next, we can iterate over the time steps in the test dataset. The train dataset is stored in a Python list as we need to easily append a new observation each iteration and NumPy array concatenation feels like overkill.
The prediction made by the model is called yhat for convention, as the outcome or observation is referred to as y and yhat (a ‘y‘ with a mark above) is the mathematical notation for the prediction of the y variable.
The prediction and observation are printed each observation for a sanity check prediction in case there are issues with the model.
1234567891011 | # walk-forward validationhistory=[xforxintrain]predictions=list()foriinrange(len(test)):# predictyhat=...predictions.append(yhat)# observationobs=test[i]history.append(obs)print('>Predicted=%.3f, Expected=%3.f'%(yhat,obs)) |
4. Persistence
The first step before getting bogged down in data analysis and modeling is to establish a baseline of performance.
This will provide both a template for evaluating models using the proposed test harness and a performance measure by which all more elaborate predictive models can be compared.
The baseline prediction for time series forecasting is called the naive forecast, or persistence.
This is where the observation from the previous time step is used as the prediction for the observation at the next time step.
We can plug this directly into the test harness defined in the previous section.
The complete code listing is provided below.
12345678910111213141516171819202122232425 | from pandas import Seriesfrom sklearn.metrics import mean_squared_errorfrom math import sqrt# load dataseries=Series.from_csv('dataset.csv')# prepare dataX=series.valuesX=X.astype('float32')train_size=int(len(X)*0.50)train,test=X[0:train_size],X[train_size:]# walk-forward validationhistory=[xforxintrain]predictions=list()foriinrange(len(test)):# predictyhat=history[-1]predictions.append(yhat)# observationobs=test[i]history.append(obs)print('>Predicted=%.3f, Expected=%3.f'%(yhat,obs))# report performancemse=mean_squared_error(test,predictions)rmse=sqrt(mse)print('RMSE: %.3f'%rmse) |
Running the test harness prints the prediction and observation for each iteration of the test dataset.
The example ends by printing the RMSE for the model.
In this case, we can see that the persistence model achieved an RMSE of 21.975. This means that on average, the model was wrong by about 22 liters per capita per day for each prediction made.
1234567 | ...>Predicted=613.000, Expected=598>Predicted=598.000, Expected=575>Predicted=575.000, Expected=564>Predicted=564.000, Expected=549>Predicted=549.000, Expected=538RMSE: 21.975 |
We now have a baseline prediction method and performance; now we can start digging into our data.
5. Data Analysis
We can use summary statistics and plots of the data to quickly learn more about the structure of the prediction problem.
In this section, we will look at the data from four perspectives:
- Summary Statistics.
- Line Plot.
- Density Plots.
- Box and Whisker Plot.
5.1. Summary Statistics
Summary statistics provide a quick look at the limits of observed values. It can help to get a quick idea of what we are working with.
The example below calculates and prints summary statistics for the time series.
123 | from pandas import Seriesseries=Series.from_csv('dataset.csv')print(series.describe()) |
Running the example provides a number of summary statistics to review.
Some observations from these statistics include:
- The number of observations (count) matches our expectation, meaning we are handling the data correctly.
- The mean is about 500, which we might consider our level in this series.
- The standard deviation and percentiles suggest a reasonably tight spread around the mean.
12345678 | count 69.000000mean 500.478261std 73.901685min 344.00000025% 458.00000050% 492.00000075% 538.000000max 662.000000 |
5.2. Line Plot
A line plot of a time series dataset can provide a lot of insight into the problem.
The example below creates and shows a line plot of the dataset.
12345 | from pandas import Seriesfrom matplotlib import pyplotseries=Series.from_csv('dataset.csv')series.plot()pyplot.show() |
Run the example and review the plot. Note any obvious temporal structures in the series.
Some observations from the plot include:
- There looks to be an increasing trend in water usage over time.
- There do not appear to be any obvious outliers, although there are some large fluctuations.
- There is a downward trend for the last few years of the series.
There may be some benefit in explicitly modeling the trend component and removing it. You may also explore using differencing with one or two levels in order to make the series stationary.
5.3. Density Plot
Reviewing plots of the density of observations can provide further insight into the structure of the data.
The example below creates a histogram and density plot of the observations without any temporal structure.
123456相關推薦Time Series Forecast Case Study with Python: Annual Water Usage in BaltimoreTweet Share Share Google Plus Time series forecasting is a process, and the only way to get good Time Series Forecast Study with Python: Monthly Sales of French ChampagneTweet Share Share Google Plus Time series forecasting is a process, and the only way to get good How to Use Power Transforms for Time Series Forecast Data with PythonTweet Share Share Google Plus Data transforms are intended to remove noise and improve the signa How to Visualize Time Series Residual Forecast Errors with PythonTweet Share Share Google Plus Forecast errors on time series regression problems are called resi A Multivariate Time Series Guide to Forecasting and Modeling (with Python codes)But I'll give you a quick refresher of what a univariate time series is, before going into the details of a multivariate time series. Let's look at them on 愉快的學習就從翻譯開始吧_Multivariate Time Series Forecasting with LSTMs in Keras_3_Multivariate LSTM Forecast3. Multivariate LSTM Forecast Model/多變數LSTM預測模型In this section, we will fit an LSTM to the problem.本章,我們將一個LSTM擬合到這個問題LSTM Data Preparatio Introduction to Time Series Forecasting With PythonI believe my books offer thousands of dollars of education for tens of dollars each. They are months if not years of experience distilled into a few hundre Feature Selection for Time Series Forecasting with PythonTweet Share Share Google Plus The use of machine learning methods on time series data requires f Time Series Data Visualization with PythonTweet Share Share Google Plus 6 Ways to Plot Your Time Series Data with Python Time series lends How to Model Residual Errors to Correct Time Series Forecasts with PythonTweet Share Share Google Plus The residual errors from forecasts on a time series provide anothe AWS Case Study: Time Inc. Evolves with Confidence Using AWS SupportTime Inc. is one of the world’s most influential media companies with a portfolio of 90 iconic brands—including People, In Style, and Time—that [Python] Statistical analysis of time serieswin symbols values with nts pre pyplot lose val Global Statistics: Common seen methods as such 1. Mean 2. Median 3. Standard deviatio Multivariate Time Series Forecasting with LSTMs in Keras 中文版翻譯像長期短期記憶(LSTM)神經網路的神經網路能夠模擬多個輸入變數的問題。這在時間序列預測中是一個很大的益處,其中古典線性方法難以適應多變數或多輸入預測問題。 在本教程中,您將發現如何在Keras深度學習庫中開發多變數時間序列預測的LSTM模型。 完成本教程後,您將知道: 如何 【原始碼】時間序列分析與預測工具箱(Time Series Analysis and Forecast,TSAF)時間序列是一組隨時間變化而收集的定量型變數觀測值。比如:道瓊斯工業股價指數、線上銷售、庫存、客戶數量、利率、費用等歷史資料都屬於時間序列。 預測時間序列變數對於企業準確掌控運營狀態非常有用。通常,獨立變數不能用來建立時間序列變數的迴歸模型。 時間序列分析的特點: python資料分析:時間序列分析(Time series analysis)何為時間序列分析: 時間序列經常通過折線圖繪製。時間序列用於統計,訊號處理,模式識別,計量經濟學,數學金融,天氣預報,地震預測,腦電圖,控制工程,天文學,通訊工程,以及主要涉及時間測量的任何應用科學和工程領域。 時間序列分析包括用於分析時間序列資料的方法,以便提取有意義的統計資料 Machine Learning with Time Series DataAs with any data science problem, exploring the data is the most important process before stating a solution. The dataset collected had data on Chicago wea step Time Series Forecasting with Machine Learning for Household Electricity ConsumptionGiven the rise of smart electricity meters and the wide adoption of electricity generation technology like solar panels, there is a wealth of electricity How to Create an ARIMA Model for Time Series Forecasting in PythonTweet Share Share Google Plus A popular and widely used statistical method for time series forec 5 Top Books on Time Series Forecasting With RTweet Share Share Google Plus Time series forecasting is a difficult problem. Unlike classificat Multivariate Time Series Forecasting with LSTMs in KerasTweet Share Share Google Plus Neural networks like Long Short-Term Memory (LSTM) recurrent neura |