pandas資料處理實踐四(時間序列date_range、資料分箱cut、分組技術GroupBy)
時間序列:
關鍵函式
pandas.
date_range
(start = None,end = None,periods = None,freq = None,tz = None,normalize = False,name = None,closed = None,** kwargs )
引數: |
start:str或datetime-like,可選
end:str或datetime-like,可選
periods:整數,可選
freq
tz:str或tzinfo,可選
normalize:bool,預設為False
name:str,預設無
closed:{無,'左','右'},可選
** kwargs
|
---|
返回固定頻率DatetimeIndex。
時間序列生成的幾種方式和取樣:
from datetime import datetime # 匯入時間序列^M ...: t1 = datetime(2009,10,20) # 直接定義 ...: ...: In [105]: t1 Out[105]: datetime.datetime(2009, 10, 20, 0, 0) In [106]: # 通過列表^M ...: date_list = [^M ...: datetime(2018,10,1),^M ...: datetime(2018,10,2),^M ...: datetime(2018,10,5),^M ...: datetime(2018,10,7)^M ...: ] In [107]: date_list Out[107]: [datetime.datetime(2018, 10, 1, 0, 0), datetime.datetime(2018, 10, 2, 0, 0), datetime.datetime(2018, 10, 5, 0, 0), datetime.datetime(2018, 10, 7, 0, 0)] In [108]: s1 = Series(np.random.randn(4),index=date_list) # 給時間序列賦 In [109]: s1 Out[109]: 2018-10-01 0.433032 2018-10-02 -1.180358 2018-10-05 -1.583058 2018-10-07 -1.200917 dtype: float64 In [110]: s1.values Out[110]: array([ 0.43303189, -1.1803582 , -1.58305798, -1.20091707]) In [111]: s1.index Out[111]: DatetimeIndex(['2018-10-01', '2018-10-02', '2018-10-05', '2018-10-07'], dtype='datetime64[ns]', freq=None) In [112]: # 快速生成時間序列:pd.date_range In [113]: data_list_new = pd.date_range('2018-01-01',periods=100,freq='H') # 預設是從週日開始 In [114]: len(data_list_new) Out[114]: 100 In [115]: s2 = Series(np.random.rand(100),index=data_list_new) In [116]: s2.head() Out[116]: 2018-01-01 00:00:00 0.891556 2018-01-01 01:00:00 0.953536 2018-01-01 02:00:00 0.321705 2018-01-01 03:00:00 0.150378 2018-01-01 04:00:00 0.180122 Freq: H, dtype: float64 In [117]: t_range = pd.date_range('20180101','20181231') In [118]: t_range Out[118]: DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08', '2018-01-09', '2018-01-10', ... '2018-12-22', '2018-12-23', '2018-12-24', '2018-12-25', '2018-12-26', '2018-12-27', '2018-12-28', '2018-12-29', '2018-12-30', '2018-12-31'], dtype='datetime64[ns]', length=365, freq='D') In [119]: s1 = Series(np.random.randn(len(t_range)),index=t_range) In [120]: s1.head() Out[120]: 2018-01-01 0.442134 2018-01-02 1.726818 2018-01-03 -1.157719 2018-01-04 1.179449 2018-01-05 0.974630 Freq: D, dtype: float64 In [121]: # 對時間序列取樣 In [122]: s1['2018-01'].mean() Out[122]: 0.03117062119001378 In [123]: s1_month = s1.resample('M').mean() #按月進行取樣 In [124]: s1_month.index Out[124]: DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30', '2018-05-31', '2018-06-30', '2018-07-31', '2018-08-31', '2018-09-30', '2018-10-31', '2018-11-30', '2018-12-31'], dtype='datetime64[ns]', freq='M') In [125]: s1.resample('H').bfill().head() Out[125]: 2018-01-01 00:00:00 0.442134 2018-01-01 01:00:00 1.726818 2018-01-01 02:00:00 1.726818 2018-01-01 03:00:00 1.726818 2018-01-01 04:00:00 1.726818 Freq: H, dtype: float64
資料分箱技術Binning:
pd.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise')
該函式的用處是把分散的資料化為分段資料,例如學生的分數,從0到100分,可以分為(0,59],(60,79],(80,90],(90,100],還有就是年齡也可以分段,因此該函式就是為此而生的,同時返回的還是原始資料,只是已經是分箱過的資料,同時可以新增新標籤,下面給出例子:
把學生分數分箱
In [1]: import numpy as np^M
...: import pandas as pd^M
...: from pandas import Series,DataFrame
...:
...:
In [2]: score_list = np.random.randint(0,100,size=100) # 隨機建立100個學生分數,分數從0
# 到100
In [3]: score_list
Out[3]:
array([56, 80, 89, 3, 45, 56, 65, 48, 12, 20, 13, 37, 1, 85, 64, 50, 72,
43, 8, 15, 9, 16, 63, 41, 68, 98, 2, 18, 78, 83, 54, 90, 81, 64,
98, 48, 52, 67, 1, 7, 24, 98, 83, 57, 57, 36, 90, 48, 59, 72, 4,
8, 2, 26, 16, 91, 26, 9, 66, 92, 22, 3, 91, 72, 90, 28, 74, 88,
89, 79, 13, 91, 57, 98, 63, 68, 63, 73, 33, 33, 99, 55, 18, 87, 60,
53, 24, 77, 85, 70, 57, 58, 75, 86, 88, 43, 52, 4, 71, 16])
In [4]: bins = [0,59,79,89,100] # 分數分段區間即0,59],(60,79],(80,90],(90,100]
In [5]: score_cut = pd.cut(score_list,bins) # 通過pd.cut()函式把分數按照bins進行分割
In [18]: len(score_cut) # 返回還是100個分數,只是這些分數已經分箱了,可以新增標籤等
Out[18]: 100
In [6]: score_cut # 返回的資料型別為pandas.core.arrays.categorical.Categorical
Out[6]:
[(0, 59], (79, 89], (79, 89], (0, 59], (0, 59], ..., (0, 59], (0, 59], (0, 59], (59, 79], (0, 59]]
Length: 100
Categories (4, interval[int64]): [(0, 59] < (59, 79] < (79, 89] < (89, 100]]
In [7]: type(score_cut)
Out[7]: pandas.core.arrays.categorical.Categorical
In [8]: pd.value_counts(score_cut) # 檢視每個區間的人數
Out[8]:
(0, 59] 54
(59, 79] 22
(89, 100] 12
(79, 89] 12
dtype: int64
# 為後續處理做準備
Dataframe資料進行分箱
還是引用上面的資料進行實踐
In [9]: df = DataFrame() # 建立一個空Dataframe資料
In [10]: df['score_list'] = score_list # 把資料填充進去
In [11]: df.head() # 檢視前5行
Out[11]:
score_list
0 56
1 80
2 89
3 3
4 45
In [12]: df['name'] = [pd.util.testing.rands(3) for i in range(100)]
...: # pandas提供pd.util.testing.rands()函式 隨機生成字串作為學生姓名並填充進去
In [13]: df.head() # 顯示前5個人的資料
Out[13]:
score_list name
0 56 puk
1 80 VUL
2 89 cwz
3 3 uVb
4 45 sRN
In [14]: # 把分箱結果作為一個columns
In [15]: # 把分箱結果作為一個columns,並把分數段分等級:low,0k,good,great
In [16]: df['Categories'] = pd.cut(df['score_list'],bins,labels=['low','ok','g
...: ood','great'])
In [17]: df.head(10)
Out[17]:
score_list name Categories
0 56 puk low
1 80 VUL good
2 89 cwz good
3 3 uVb low
4 45 sRN low
5 56 3vM low
6 65 wp8 ok
7 48 lSF low
8 12 AkT low
9 20 tgb low
分組技術GroupBy
DataFrame.
groupby
(by = None,axis = 0,level = None,as_index = True,sort = True,group_keys = True,squeeze = False,observe = False,** kwargs )
該函式的主要處理分組問題,例如從資料中有兩個特徵感興趣,可以單獨拿出來供我們處理,例如:
date city temperature wind
0 03/01/2016 BJ 8 5
1 17/01/2016 BJ 12 2
2 31/01/2016 BJ 19 2
3 14/02/2016 BJ -3 3
4 28/02/2016 BJ 19 2
5 13/03/2016 BJ 5 3
6 27/03/2016 SH -4 4
7 10/04/2016 SH 19 3
8 24/04/2016 SH 20 3
9 08/05/2016 SH 17 3
10 22/05/2016 SH 4 2
11 05/06/2016 SH -10 4
12 19/06/2016 SH 0 5
13 03/07/2016 SH -9 5
14 17/07/2016 GZ 10 2
15 31/07/2016 GZ -1 5
16 14/08/2016 GZ 1 5
17 28/08/2016 GZ 25 4
18 11/09/2016 SZ 20 1
19 25/09/2016 SZ -10 4
從資料中我們看到主要有四個城市的天氣記錄,只是通過這個表格我們不容易處理資料,例如各城市的均值和最大值、最小值、畫圖等,以此可以針對‘city’進行分組,然後對其處理,再利用分組後的屬性對資料進一步處理,其中一些屬性有:
gb.median gb.ngroups gb.plot gb.rank gb.std gb.transform
gb.aggregate gb.count gb.cumprod gb.dtype gb.first gb.groups gb.hist gb.max gb.min gb.nth gb.prod gb.resample gb.sum gb.var
gb.apply gb.cummax gb.cumsum gb.fillna gb.gender gb.head gb.indices gb.mean gb.name gb.ohlc gb.quantile gb.size gb.tail gb.weight
從中我們可以看出有很多屬性函式給我們處理資料,還具有畫圖功能,下面給出具體資料處理程式碼示例:
In [59]: import numpy as np
...: import pandas as pd
...: from pandas import Series,DataFrame
...:
...:
In [60]: df = pd.read_csv('city_weather.csv')
In [61]: df.head()
Out[61]:
date city temperature wind
0 03/01/2016 BJ 8 5
1 17/01/2016 BJ 12 2
2 31/01/2016 BJ 19 2
3 14/02/2016 BJ -3 3
4 28/02/2016 BJ 19 2
In [62]: gb = df.groupby(df['city'],) # 以城市為準分組,可分為BJ,GZ,SH,SZ
g.<tab> # 有很多屬性可用
gb.agg gb.boxplot gb.cummin gb.describe gb.filter gb.get_group gb.height gb.last gb.median gb.ngroups gb.plot gb.rank gb.std gb.transform
gb.aggregate gb.count gb.cumprod gb.dtype gb.first gb.groups gb.hist gb.max gb.min gb.nth gb.prod gb.resample gb.sum gb.var
gb.apply gb.cummax gb.cumsum gb.fillna gb.gender gb.head gb.indices gb.mean gb.name gb.ohlc gb.quantile gb.size gb.tail gb.weight
In [65]: gb.groups # 組成員和每組的索引
Out[65]:
{'BJ': Int64Index([0, 1, 2, 3, 4, 5], dtype='int64'),
'GZ': Int64Index([14, 15, 16, 17], dtype='int64'),
'SH': Int64Index([6, 7, 8, 9, 10, 11, 12, 13], dtype='int64'),
'SZ': Int64Index([18, 19], dtype='int64')}
In [67]: gb.get_group('BJ').mean() # 獲得BJ的temperature和wind的均值
Out[67]:
temperature 10.000000
wind 2.833333
dtype: float64
In [69]: gb.max()
Out[69]:
date temperature wind
city
BJ 31/01/2016 19 5
GZ 31/07/2016 25 5
SH 27/03/2016 20 5
SZ 25/09/2016 20 4
gb.plot()
其他功能參考pandas官方文件