04】tsfresh:一種“提取時間序列特徵”的包
Install
-
假設你的PC已經裝了python開發環境:
## 使用pip直接安裝
pip install tsfresh
## 測試是否安裝成功
from tsfresh import extract_features
requests>=2.9.1 numpy>=1.10.4 pandas>=0.20.3 scipy>=0.17.0 statsmodels>=0.8.0 ## 基於 statsmodels 框架 patsy>=0.4.1 scikit-learn>=0.17.1 future>=0.16.0 six>=1.10.0 tqdm>=4.10.0 ipaddress>=1.0.18; python_version <= '2.7' dask>=0.15.2 distributed>=1.18.3
基本步驟
-
準備資料:需要處理的時間序列資料,女裝專案就是時間與gmv的資料;
-
特徵提取:extract_features
-
特徵過濾:過濾掉沒有意義的值(NaN),保留有意義的特徵;降維;
-
特徵提取和過濾同時進行:extract_relevant_features(timeseries, y, column_id='id', column_sort='time')
案例
原始碼中的案例
-
https://github.com/blue-yonder/tsfresh/tree/master/notebooks
available tasks
-
time series classification
-
compression
-
forecasting
Time Series Forecasting - jupyter notebook
-
tsfresh.utilities.dataframe_functions.make_forecasting_frame(x, kind, max_timeshift, rolling_direction)
-
x (np.array or pd.Series) – the singular time series;歷史資料,
-
kind (str) – the kind of the time series;
-
max_timeshift (int)
-
rolling_direction (int) – The sign decides, if to roll backwards (if sign is positive) or forwards in "time";
-
Returns:time series container df, target vector y;
說明:df_shift, y = make_forecasting_frame(class_df_all['y'], kind="gmv", max_timeshift=24, rolling_direction=1)make_forecasting_frame() 函式的滑動過程如上圖所示,假如:len(class_df_all['y']) = 59,max_timeshift = 10。
- (max_timeshift + 1)*(max_timeshift/2) + (len(y) - max_timeshift)*max_timeshift
當rolling_direction = 1,那麼返回的 df_shift 將是一個545行的組合資料,過程如下:
id = 1:feature_matrix, time = 0
id = 2:feature_matrix, time = 0,1,
id = 3:feature_matrix, time = 0,1,2
... ...
id = 10:feature_matrix, time = 0,1,2,3,4,5,6,7,8,9 ##
id = 11:feature_matrix, time = 0,1,2,3,4,5,6,7,8,9 ## 由於 max_timeshift =10,限制了最大長度為10
id = 12:feature_matrix, time = 0,1,2,3,4,5,6,7,8,9 ## 由於 max_timeshift =10,限制了最大長度為10
... ...
id = 58:feature_matrix, time = 0,1,2,3,4,5,6,7,8,9
所以:545 = (1+10)*10/2 + (59-10)*10
當 rolling_direction = -1 時,過程如下:
id = 1:feature_matrix, time = 0,1,2,3,4,5,6,7,8,9 ## 由於 max_timeshift =10,限制了最大長度為10
id = 2:feature_matrix, time = 0,1,2,3,4,5,6,7,8,9
id = 3:feature_matrix, time = 0,1,2,3,4,5,6,7,8,9
... ...
id = 57:feature_matrix, time = 0,1
id = 58:feature_matrix, time = 0
683·380
-
extract_features,特徵提取:根據上述滑動組合得到的 df_shift 資料,提取特徵:X = extract_features(df_shift, column_id="id", column_sort="time", column_value="value", impute_function=impute, show_warnings=False) ## 在 spyder 上無法work,而在 jupyter notebook 可以 work;
-
得到的特徵:[59 rows x 794 columns] --> 794 維的特徵,59行樣本數
(794維特徵,class ComprehensiveFCParameters)
-
extract_features 提取特徵的物件:
1)a pandas.DataFrame containing the different time series;
2)a dictionary of pandas.DataFrame each containing one type of time series;
- extract_relevant_features:過濾掉部分特徵
思路問題
迴歸模型
-
輸入:特徵向量 - feature
-
輸出:預測值(迴歸值)
-
問題:gmv是目標值,如果資料僅僅是(ds,gmv),是否不適用迴歸模型?
-
分析:迴歸模型的輸入是特徵,如果需要預測未來2個月的gmv值,那麼需要知道未來2個月各自對應的特徵向量 feature,並將 feature 作為模型的輸入,得到對應的預測值。
Script - 20180717
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
from tsfresh import extract_features
from tsfresh.utilities.dataframe_functions import make_forecasting_frame
from sklearn.ensemble import AdaBoostRegressor
from tsfresh.utilities.dataframe_functions import impute
import warnings
warnings.filterwarnings('ignore')
## load dateset
month_list = ["Jan","Feb","Mar","Apr","May","June",
"July","Aug","Sept","Oct","Nov","Dec"]
all_leaf_class_name_dict = {cate_id: cate_name}
df = pd.read_csv('./cate_by_month_histroy.csv', header=0, encoding='gbk')
df.columns = ['ds', 'cate_id', 'cate_name', 'y']
class_df_all = df[df.cate_name.str.startswith(cate_name)].reset_index(drop=True)
class_df_all = class_df_all[['ds', 'y']]
class_df_all = class_df_all[:60]
# print(class_df_all.head())
## plot
fig = plt.figure(facecolor='white')
ax = fig.add_subplot(111)
ax.plot(class_df_all['ds'], class_df_all['y'])
for tick in ax.get_xticklabels():
tick.set_rotation(90)
fig.set_size_inches(18, 8)
plt.legend()
## make_forecasting_frame
df_shift, y = make_forecasting_frame(class_df_all['y'], kind="gmv", max_timeshift=24, rolling_direction=1)
# print(df_shift)
# print(y)
## extract_features
X = extract_features(df_shift, column_id="id", column_sort="time", column_value="value", impute_function=impute,
show_warnings=False)
## 迴歸模型
ada = AdaBoostRegressor()
y_pred = [0] * len(y)
# print(y_pred)
y_pred[0] = y.iloc[0]
# print(y_pred[0])
ada.fit(X.iloc[:], y[:])
y_pred = ada.predict(X.iloc[:])
print((X.iloc[:]).shape)
# for i in range(1, len(y)):
# ada.fit(X.iloc[:i], y[:i])
# # print(len(X.iloc[i, :]))
# y_pred[i] = ada.predict(X.iloc[i, :])
y_pred = pd.Series(data=y_pred, index=y.index)
plt.figure(figsize=(15, 6))
plt.plot(y, label="true")
plt.plot(y_pred, label="predicted")
plt.legend()
plt.show()
問題彙總
-
ImportError: cannot import name 'is_list_like':https://stackoverflow.com/questions/50394873/import-pandas-datareader-gives-importerror-cannot-import-name-is-list-like
-
extract_features:Anaconda-spyder 執行到 extract_features 命令時,跑不動(編譯器問題?),如下圖所示:
-
extract_features:使用 jupyter notebook 就能順利跑動,如下圖所示: