一個預測自行車租賃的簡單例子

阿新 • • 發佈：2018-12-15

自行車資料集給出了2015年8月每天的自行車租賃的數目，每隔3小時統計一次，要求預測給定日期和時間，出租自行車的數目。

1.載入資料

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import mglearn

citibike = mglearn.datasets.load_citibike()

print("Citibike data:\n{}".format(citibike.head()))

下面繪出整個月租車數目的視覺化圖形：

plt.figure(figsize=(10, 3))
xticks = pd.date_range(start=citibike.index.min(), end=citibike.index.max(),
                       freq='D')
plt.xticks(xticks.astype("int"), xticks.strftime("%a %m-%d"), rotation=90, ha="left")
plt.plot(citibike, linewidth=1)
plt.xlabel("Date")
plt.ylabel("Rentals")

在對時間序列的預測任務進行評估時，我們通常希望從過去學習並預測未來，也就是說，在劃分訓練集和測試集時，我們希望使用某個特定日期之前的所有資料作為訓練集，該日期之後的所有資料作為測試集。在這裡，我們使用前184個數據點（對應前23天0）作為訓練集，剩餘的64個數據點（對應於後8天）作為測試集。

在我們的預測任務中，我們使用的唯一特徵就是某一天租車數量對應的日期和時間，在計算機上儲存日期常用的方式是POSIX時間，它是從1970年1月1日00:00:00起至現在的總秒數。

2.採用POSIX特徵訓練模型

首先匯入資料：

# extract the target values (number of rentals)
y = citibike.values
# convert to POSIX time by dividing by 10**9
X = citibike.index.astype("int64").values.reshape(-1, 1) // 10**9

然後定義一個函式eval_features(features,target,regressor)將資料劃分成訓練集和測試集，構建模型並將結果視覺化。

# use the first 184 data points for training, the rest for testing
n_train = 184

# function to evaluate and plot a regressor on a given feature set
def eval_on_features(features, target, regressor):
    # split the given features into a training and a test set
    X_train, X_test = features[:n_train], features[n_train:]
    # also split the target array 
    y_train, y_test = target[:n_train], target[n_train:]
    regressor.fit(X_train, y_train)
    print("Test-set R^2: {:.2f}".format(regressor.score(X_test, y_test)))
    y_pred = regressor.predict(X_test)
    y_pred_train = regressor.predict(X_train)
    plt.figure(figsize=(10, 3))

    plt.xticks(range(0, len(X), 8), xticks.strftime("%a %m-%d"), rotation=90,
               ha="left")

    plt.plot(range(n_train), y_train, label="train")
    plt.plot(range(n_train, len(y_test) + n_train), y_test, '-', label="test")
    plt.plot(range(n_train), y_pred_train, '--', label="prediction train")

    plt.plot(range(n_train, len(y_test) + n_train), y_pred, '--',
             label="prediction test")
    plt.legend(loc=(1.01, 0))
    plt.xlabel("Date")
    plt.ylabel("Rentals")

接著，以POSIX時間特徵，在隨機森林上訓練模型：

from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=100, random_state=0)
eval_on_features(X, y, regressor)

在訓練集上預測結果相當好，這符合隨機森林通常的表現。但對於測試集來說，預測結果是一條常數直線，說明什麼都沒有學到。問題在於特徵和隨機森林的組合。測試集中POSIX時間特徵的值超出了訓練集中特徵的取值範圍：測試集中的資料點的時間戳要晚於訓練集中的所有資料點。樹以及隨機森林無法外推到訓練集之外的特徵範圍。結果就是模型只能預測訓練集中最近資料點的目標值，即最後一天觀測到資料的時間。

通過觀察訓練集中租車數量的影象，我們發現兩個因素非常重要：一天內的時間與一週的星期幾。因此，我們來研究這兩個特徵。

3.使用每天時刻作為特徵

X_hour = citibike.index.hour.values.reshape(-1, 1)
eval_on_features(X_hour, y, regressor)

4.新增一週星期幾作為特徵

X_hour_week = np.hstack([citibike.index.dayofweek.values.reshape(-1, 1),
                         citibike.index.hour.values.reshape(-1, 1)])
eval_on_features(X_hour_week, y, regressor)

此時模型的預測效果很好，模型學到的內容可能是8月前23天中星期幾與時刻每種組合的平均租車數量。這實際上不需要像隨機森林那樣複雜的模型，我們嘗試一下LinearRegression：

一個預測自行車租賃的簡單例子

一個預測自行車租賃的簡單例子

使用Eclipse+maven3外掛開發一個Servlet3.0的簡單例子

WebRTC：一個視訊聊天的簡單例子

用socket.io實現websocket的一個簡單例子

C語言多線程的一個簡單例子

一個使用Spring的AspectJ LTW的簡單例子

一個登陸網站驗證身份的簡單例子

netsh interface portproxy的一個簡單例子

使用java實現快速排序的一個簡單例子

一個簡單例子解釋工廠模式建立物件

Java產生死鎖的一個簡單例子

一個Lua指令碼操作Redis的簡單例子

Cmake 簡單例子---生成一個Visual Studio 或者Xcode 工程

一個基於MINA框架應用的最簡單例子

Spring學習筆記（一）：眼見為實，先上一個簡單例子

初識spring,一個spring的簡單例子出現的錯誤

什麼是量子計算機？用一個簡單例子來解釋

Go語言入門——從一個簡單例子入門

多型的一個簡單例子

wxPython的一個簡單例子

一個預測自行車租賃的簡單例子

相關推薦