1. 程式人生 > >pandas資料處理實踐四(時間序列date_range、資料分箱cut、分組技術GroupBy)

pandas資料處理實踐四(時間序列date_range、資料分箱cut、分組技術GroupBy)

時間序列:

關鍵函式

pandas.date_rangestart = Noneend = Noneperiods = Nonefreq = Nonetz = Nonenormalize = Falsename = Noneclosed = None** kwargs 

引數:

start:str或datetime-like,可選

生成日期的左邊界。

end:str或datetime-like,可選

生成日期的權利。

periods:整數,可選

要生成的週期數。

freq

:str或DateOffset,預設為'D'(每日日曆)

頻率串可以有倍數,例如'5H'。

tz:str或tzinfo,可選

返回本地化DatetimeIndex的時區名稱,例如“Asia / Hong_Kong”。預設情況下,生成的DatetimeIndex是暫時的。

normalize:bool,預設為False

在生成日期範圍之前將開始/結束日期標準化為午夜。

name:str,預設無

生成的DatetimeIndex的名稱。

closed:{無,'左','右'},可選

使間隔相對於給定頻率關閉到“左”,“右”或兩側(無,預設)。

** kwargs

為了相容性。對結果沒有影響。

返回固定頻率DatetimeIndex。

時間序列生成的幾種方式和取樣:

 from datetime import datetime # 匯入時間序列^M
     ...: t1 = datetime(2009,10,20) # 直接定義
     ...:
     ...:

In [105]: t1
Out[105]: datetime.datetime(2009, 10, 20, 0, 0)

In [106]: # 通過列表^M
     ...: date_list = [^M
     ...:     datetime(2018,10,1),^M
     ...:     datetime(2018,10,2),^M
     ...:     datetime(2018,10,5),^M
     ...:     datetime(2018,10,7)^M
     ...: ]

In [107]: date_list
Out[107]:
[datetime.datetime(2018, 10, 1, 0, 0),
 datetime.datetime(2018, 10, 2, 0, 0),
 datetime.datetime(2018, 10, 5, 0, 0),
 datetime.datetime(2018, 10, 7, 0, 0)]

In [108]: s1 = Series(np.random.randn(4),index=date_list) # 給時間序列賦

In [109]: s1
Out[109]:
2018-10-01    0.433032
2018-10-02   -1.180358
2018-10-05   -1.583058
2018-10-07   -1.200917
dtype: float64

In [110]: s1.values
Out[110]: array([ 0.43303189, -1.1803582 , -1.58305798, -1.20091707])

In [111]: s1.index
Out[111]: DatetimeIndex(['2018-10-01', '2018-10-02', '2018-10-05', '2018-10-07'], dtype='datetime64[ns]', freq=None)

In [112]: # 快速生成時間序列:pd.date_range

In [113]: data_list_new = pd.date_range('2018-01-01',periods=100,freq='H') # 預設是從週日開始

In [114]: len(data_list_new)
Out[114]: 100

In [115]: s2 = Series(np.random.rand(100),index=data_list_new)

In [116]: s2.head()
Out[116]:
2018-01-01 00:00:00    0.891556
2018-01-01 01:00:00    0.953536
2018-01-01 02:00:00    0.321705
2018-01-01 03:00:00    0.150378
2018-01-01 04:00:00    0.180122
Freq: H, dtype: float64

In [117]: t_range = pd.date_range('20180101','20181231')

In [118]: t_range
Out[118]:
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
               '2018-01-09', '2018-01-10',
               ...
               '2018-12-22', '2018-12-23', '2018-12-24', '2018-12-25',
               '2018-12-26', '2018-12-27', '2018-12-28', '2018-12-29',
               '2018-12-30', '2018-12-31'],
              dtype='datetime64[ns]', length=365, freq='D')

In [119]: s1 = Series(np.random.randn(len(t_range)),index=t_range)

In [120]: s1.head()
Out[120]:
2018-01-01    0.442134
2018-01-02    1.726818
2018-01-03   -1.157719
2018-01-04    1.179449
2018-01-05    0.974630
Freq: D, dtype: float64

In [121]: # 對時間序列取樣

In [122]: s1['2018-01'].mean()
Out[122]: 0.03117062119001378

In [123]: s1_month = s1.resample('M').mean() #按月進行取樣

In [124]: s1_month.index
Out[124]:
DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
               '2018-05-31', '2018-06-30', '2018-07-31', '2018-08-31',
               '2018-09-30', '2018-10-31', '2018-11-30', '2018-12-31'],
              dtype='datetime64[ns]', freq='M')

In [125]: s1.resample('H').bfill().head()
Out[125]:
2018-01-01 00:00:00    0.442134
2018-01-01 01:00:00    1.726818
2018-01-01 02:00:00    1.726818
2018-01-01 03:00:00    1.726818
2018-01-01 04:00:00    1.726818
Freq: H, dtype: float64

資料分箱技術Binning:

pd.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise')

該函式的用處是把分散的資料化為分段資料,例如學生的分數,從0到100分,可以分為(0,59],(60,79],(80,90],(90,100],還有就是年齡也可以分段,因此該函式就是為此而生的,同時返回的還是原始資料,只是已經是分箱過的資料,同時可以新增新標籤,下面給出例子:

   把學生分數分箱

In [1]: import numpy as np^M
   ...: import pandas as pd^M
   ...: from pandas import Series,DataFrame
   ...:
   ...:

In [2]: score_list = np.random.randint(0,100,size=100) # 隨機建立100個學生分數,分數從0    
                                                       # 到100 

In [3]: score_list
Out[3]:
array([56, 80, 89,  3, 45, 56, 65, 48, 12, 20, 13, 37,  1, 85, 64, 50, 72,
       43,  8, 15,  9, 16, 63, 41, 68, 98,  2, 18, 78, 83, 54, 90, 81, 64,
       98, 48, 52, 67,  1,  7, 24, 98, 83, 57, 57, 36, 90, 48, 59, 72,  4,
        8,  2, 26, 16, 91, 26,  9, 66, 92, 22,  3, 91, 72, 90, 28, 74, 88,
       89, 79, 13, 91, 57, 98, 63, 68, 63, 73, 33, 33, 99, 55, 18, 87, 60,
       53, 24, 77, 85, 70, 57, 58, 75, 86, 88, 43, 52,  4, 71, 16])

In [4]: bins = [0,59,79,89,100] # 分數分段區間即0,59],(60,79],(80,90],(90,100]

In [5]: score_cut = pd.cut(score_list,bins) # 通過pd.cut()函式把分數按照bins進行分割

In [18]: len(score_cut)  # 返回還是100個分數,只是這些分數已經分箱了,可以新增標籤等
Out[18]: 100

In [6]: score_cut # 返回的資料型別為pandas.core.arrays.categorical.Categorical
Out[6]:
[(0, 59], (79, 89], (79, 89], (0, 59], (0, 59], ..., (0, 59], (0, 59], (0, 59], (59, 79], (0, 59]]
Length: 100
Categories (4, interval[int64]): [(0, 59] < (59, 79] < (79, 89] < (89, 100]]

In [7]: type(score_cut)
Out[7]: pandas.core.arrays.categorical.Categorical

In [8]: pd.value_counts(score_cut) # 檢視每個區間的人數
Out[8]:
(0, 59]      54
(59, 79]     22
(89, 100]    12
(79, 89]     12
dtype: int64
# 為後續處理做準備

Dataframe資料進行分箱

還是引用上面的資料進行實踐

In [9]: df = DataFrame() # 建立一個空Dataframe資料

In [10]: df['score_list'] = score_list # 把資料填充進去
 
In [11]: df.head() # 檢視前5行
Out[11]:
   score_list
0          56
1          80
2          89
3           3
4          45

In [12]: df['name'] = [pd.util.testing.rands(3) for i in range(100)]
    ...: # pandas提供pd.util.testing.rands()函式 隨機生成字串作為學生姓名並填充進去

In [13]: df.head() # 顯示前5個人的資料
Out[13]:
   score_list name
0          56  puk
1          80  VUL
2          89  cwz
3           3  uVb
4          45  sRN

In [14]: # 把分箱結果作為一個columns

In [15]: # 把分箱結果作為一個columns,並把分數段分等級:low,0k,good,great

In [16]: df['Categories'] = pd.cut(df['score_list'],bins,labels=['low','ok','g
    ...: ood','great'])

In [17]: df.head(10)
Out[17]:
   score_list name Categories
0          56  puk        low
1          80  VUL       good
2          89  cwz       good
3           3  uVb        low
4          45  sRN        low
5          56  3vM        low
6          65  wp8         ok
7          48  lSF        low
8          12  AkT        low
9          20  tgb        low

分組技術GroupBy

DataFrame.groupbyby = Noneaxis = 0level = Noneas_index = Truesort = Truegroup_keys = Truesqueeze = Falseobserve = False** kwargs 

該函式的主要處理分組問題,例如從資料中有兩個特徵感興趣,可以單獨拿出來供我們處理,例如:

	date	city	temperature	wind
0	03/01/2016	BJ	8	5
1	17/01/2016	BJ	12	2
2	31/01/2016	BJ	19	2
3	14/02/2016	BJ	-3	3
4	28/02/2016	BJ	19	2
5	13/03/2016	BJ	5	3
6	27/03/2016	SH	-4	4
7	10/04/2016	SH	19	3
8	24/04/2016	SH	20	3
9	08/05/2016	SH	17	3
10	22/05/2016	SH	4	2
11	05/06/2016	SH	-10	4
12	19/06/2016	SH	0	5
13	03/07/2016	SH	-9	5
14	17/07/2016	GZ	10	2
15	31/07/2016	GZ	-1	5
16	14/08/2016	GZ	1	5
17	28/08/2016	GZ	25	4
18	11/09/2016	SZ	20	1
19	25/09/2016	SZ	-10	4

從資料中我們看到主要有四個城市的天氣記錄,只是通過這個表格我們不容易處理資料,例如各城市的均值和最大值、最小值、畫圖等,以此可以針對‘city’進行分組,然後對其處理,再利用分組後的屬性對資料進一步處理,其中一些屬性有:

gb.median     gb.ngroups    gb.plot       gb.rank       gb.std        gb.transform
gb.aggregate  gb.count      gb.cumprod    gb.dtype      gb.first      gb.groups     gb.hist       gb.max        gb.min        gb.nth        gb.prod       gb.resample   gb.sum        gb.var
gb.apply      gb.cummax     gb.cumsum     gb.fillna     gb.gender     gb.head       gb.indices    gb.mean       gb.name       gb.ohlc       gb.quantile   gb.size       gb.tail       gb.weight

從中我們可以看出有很多屬性函式給我們處理資料,還具有畫圖功能,下面給出具體資料處理程式碼示例:


In [59]: import numpy as np
    ...: import pandas as pd
    ...: from pandas import Series,DataFrame
    ...:
    ...:

In [60]: df = pd.read_csv('city_weather.csv')

In [61]: df.head()
Out[61]:
         date city  temperature  wind
0  03/01/2016   BJ            8     5
1  17/01/2016   BJ           12     2
2  31/01/2016   BJ           19     2
3  14/02/2016   BJ           -3     3
4  28/02/2016   BJ           19     2

In [62]: gb = df.groupby(df['city'],) # 以城市為準分組,可分為BJ,GZ,SH,SZ

g.<tab> # 有很多屬性可用

gb.agg        gb.boxplot    gb.cummin     gb.describe   gb.filter     gb.get_group  gb.height     gb.last       gb.median     gb.ngroups    gb.plot       gb.rank       gb.std        gb.transform
gb.aggregate  gb.count      gb.cumprod    gb.dtype      gb.first      gb.groups     gb.hist       gb.max        gb.min        gb.nth        gb.prod       gb.resample   gb.sum        gb.var
gb.apply      gb.cummax     gb.cumsum     gb.fillna     gb.gender     gb.head       gb.indices    gb.mean       gb.name       gb.ohlc       gb.quantile   gb.size       gb.tail       gb.weight

In [65]: gb.groups # 組成員和每組的索引
Out[65]:
{'BJ': Int64Index([0, 1, 2, 3, 4, 5], dtype='int64'),
 'GZ': Int64Index([14, 15, 16, 17], dtype='int64'),
 'SH': Int64Index([6, 7, 8, 9, 10, 11, 12, 13], dtype='int64'),
 'SZ': Int64Index([18, 19], dtype='int64')}

In [67]: gb.get_group('BJ').mean() # 獲得BJ的temperature和wind的均值
Out[67]:
temperature    10.000000
wind            2.833333
dtype: float64

In [69]: gb.max()
Out[69]:
            date  temperature  wind
city
BJ    31/01/2016           19     5
GZ    31/07/2016           25     5
SH    27/03/2016           20     5
SZ    25/09/2016           20     4

gb.plot()

 

 

其他功能參考pandas官方文件