DataFrame分組級運算和轉換
阿新 • • 發佈:2018-12-16
目錄
前言
假設我們為DataFrame新增用於存放各索引分組平均值的列,一個辦法是先聚合在合併。
>>> k1_means = df.groupby('key1').mean().add_prefix('mean_') >>> k1_means mean_data1 mean_data2 key1 a -0.380460 -0.332537 b -0.314586 -0.605574 >>> pd.merge(df,k1_means,left_on='key1',right_index=True) data1 data2 key1 key2 mean_data1 mean_data2 0 -0.291328 0.257737 a one -0.380460 -0.332537 1 -1.390843 -1.081238 a two -0.380460 -0.332537 4 0.540790 -0.174112 a one -0.380460 -0.332537 2 0.574857 0.202979 b one -0.314586 -0.605574 3 -1.204029 -1.414127 b two -0.314586 -0.605574
這次我們在GroupBy上使用transForm方法。
transForm或將一個函式應用到各個分組,然後將結果放到適當的位置上。
>>> people = DataFrame(np.random.randn(5, 5), columns=['a', 'b', 'c', 'd', 'e'], ... index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis']) >>> people.groupby(key).mean() a b c d e one 0.684081 0.110111 -0.122685 -0.392944 0.676586 two 0.295614 -0.488849 0.111023 -0.452018 -0.593795 >>> people.groupby(key).transform(np.mean) a b c d e Joe 0.684081 0.110111 -0.122685 -0.392944 0.676586 Steve 0.295614 -0.488849 0.111023 -0.452018 -0.593795 Wes 0.684081 0.110111 -0.122685 -0.392944 0.676586 Jim 0.295614 -0.488849 0.111023 -0.452018 -0.593795 Travis 0.684081 0.110111 -0.122685 -0.392944 0.676586
假如你希望從各組中減去平均值,為此我們先建立一個距平化函式,然後將其傳給transform
>>> def demean(arr): ... return arr - arr.mean() ... >>> demeaned =people.groupby(key).transform(demean) >>> demeaned a b c d e Joe -0.779960 0.893851 -1.448675 -0.091887 -0.162785 Steve -0.323736 0.072072 0.659981 -0.131960 -0.498387 Wes 0.305050 -1.817776 0.450697 -0.454107 -0.952844 Jim 0.323736 -0.072072 -0.659981 0.131960 0.498387 Travis 0.474909 0.923925 0.997978 0.545994 1.115629
你可以檢查一下demeaned各組的平均值是否為0
apply:一般性的‘拆份-應用-合併’
假設你想要根據分組選出5個最高的tip_pct值,首先先寫一個指定列具有最大值的行的函式
>>> def top(df,n=5,columns='tip_pct'):
... return df.sort_index(by=columns)[-n:]
>>> top(tips,n=6)
total_bill tip smoker day time size tip_pct
109 14.31 4.00 Yes Sat Dinner 2 0.279525
183 23.17 6.50 Yes Sun Dinner 4 0.280535
232 11.61 3.39 No Sat Dinner 2 0.291990
67 3.07 1.00 Yes Sat Dinner 1 0.325733
178 9.60 4.00 Yes Sun Dinner 2 0.416667
172 7.25 5.15 Yes Sun Dinner 2 0.710345
top涵數在DataFrame的個個片段上呼叫,最後由pandas.concat組裝到一起。
>>> tips.groupby(['smoker','day']).apply(top)
total_bill tip smoker day time size tip_pct
smoker day
No Fri 99 12.46 1.50 No Fri Dinner 2 0.120385
94 22.75 3.25 No Fri Dinner 2 0.142857
91 22.49 3.50 No Fri Dinner 2 0.155625
223 15.98 3.00 No Fri Lunch 3 0.187735
Sat 228 13.28 2.72 No Sat Dinner 2 0.204819
108 18.24 3.76 No Sat Dinner 2 0.206140
110 14.00 3.00 No Sat Dinner 2 0.214286
20 17.92 4.08 No Sat Dinner 2 0.227679
232 11.61 3.39 No Sat Dinner 2 0.291990
Sun 46 22.23 5.00 No Sun Dinner 2 0.224921
17 16.29 3.71 No Sun Dinner 3 0.227747
6 8.77 2.00 No Sun Dinner 2 0.228050
185 20.69 5.00 No Sun Dinner 5 0.241663
51 10.29 2.60 No Sun Dinner 2 0.252672
Thur 81 16.66 3.40 No Thur Lunch 2 0.204082
139 13.16 2.75 No Thur Lunch 2 0.208967
87 18.28 4.00 No Thur Lunch 2 0.218818
88 24.71 5.85 No Thur Lunch 2 0.236746
149 7.51 2.00 No Thur Lunch 2 0.266312
Yes Fri 226 10.09 2.00 Yes Fri Lunch 2 0.198216
100 11.35 2.50 Yes Fri Dinner 2 0.220264
222 8.58 1.92 Yes Fri Lunch 1 0.223776
221 13.42 3.48 Yes Fri Lunch 2 0.259314
93 16.32 4.30 Yes Fri Dinner 2 0.263480
Sat 171 15.81 3.16 Yes Sat Dinner 2 0.199873
63 18.29 3.76 Yes Sat Dinner 4 0.205577
214 28.17 6.50 Yes Sat Dinner 3 0.230742
109 14.31 4.00 Yes Sat Dinner 2 0.279525
67 3.07 1.00 Yes Sat Dinner 1 0.325733
Sun 174 16.82 4.00 Yes Sun Dinner 2 0.237812
181 23.33 5.65 Yes Sun Dinner 2 0.242177
183 23.17 6.50 Yes Sun Dinner 4 0.280535
178 9.60 4.00 Yes Sun Dinner 2 0.416667
172 7.25 5.15 Yes Sun Dinner 2 0.710345
Thur 204 20.53 4.00 Yes Thur Lunch 4 0.194837
205 16.47 3.23 Yes Thur Lunch 3 0.196114
191 19.81 4.19 Yes Thur Lunch 2 0.211509
200 18.71 4.00 Yes Thur Lunch 3 0.213789
194 16.58 4.00 Yes Thur Lunch 2 0.241255
禁止分組建
>>> tips.groupby('smoker',group_keys=False).apply(top)
total_bill tip smoker day time size tip_pct
88 24.71 5.85 No Thur Lunch 2 0.236746
185 20.69 5.00 No Sun Dinner 5 0.241663
51 10.29 2.60 No Sun Dinner 2 0.252672
149 7.51 2.00 No Thur Lunch 2 0.266312
232 11.61 3.39 No Sat Dinner 2 0.291990
109 14.31 4.00 Yes Sat Dinner 2 0.279525
183 23.17 6.50 Yes Sun Dinner 4 0.280535
67 3.07 1.00 Yes Sat Dinner 1 0.325733
178 9.60 4.00 Yes Sun Dinner 2 0.416667
172 7.25 5.15 Yes Sun Dinner 2 0.710345
分位數和桶分析
>>> frame = DataFrame({'data1':np.random.randn(1000),'data2':np.random.randn(1000)})
>>> factor = pd.cut(frame.data1,4)
>>> factor[:10]
0 (-1.6, -0.026]
1 (-1.6, -0.026]
2 (1.548, 3.123]
3 (-1.6, -0.026]
4 (-0.026, 1.548]
5 (-0.026, 1.548]
6 (-1.6, -0.026]
7 (-0.026, 1.548]
8 (-0.026, 1.548]
9 (-1.6, -0.026]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-3.181, -1.6] < (-1.6, -0.026] < (-0.026, 1.548] <
(1.548, 3.123]]
>>> def get_stats(group):
... return {'min':group.min(),'max':group.max(),'count':group.count(),'mean':group.mean()}
...
>>> grouped = frame.data2.groupby(factor)
>>> grouped.apply(get_stats).unstack()
count max mean min
data1
(-3.181, -1.6] 47.0 1.560586 0.067778 -3.094980
(-1.6, -0.026] 431.0 2.920156 -0.031899 -2.778233
(-0.026, 1.548] 460.0 2.339734 -0.057856 -2.739892
(1.548, 3.123] 62.0 1.728365 -0.143399 -2.449822
>>> grouping = pd.qcut(frame.data1,10,labels=False)
>>> grouped = frame.data2.groupby(grouping)
>>> grouped.apply(get_stats).unstack()
count max mean min
data1
0 100.0 2.248114 0.069002 -3.094980
1 100.0 1.923236 -0.237785 -2.743977
2 100.0 2.920156 0.115480 -2.778233
3 100.0 2.481512 -0.060810 -2.581747
4 100.0 2.793314 0.030760 -2.595131
5 100.0 2.337741 -0.142877 -2.332392
6 100.0 2.339734 -0.046468 -2.589412
7 100.0 2.275533 -0.008744 -2.588843
8 100.0 1.901215 -0.095933 -2.739892
9 100.0 2.229256 -0.083296 -2.449822
示例:用特定於分組的值填充缺失值
用平均值填充NA值:
>>> s = Series(np.random.randn(6))
>>> s[::2] = np.nan
>>> s
0 NaN
1 -1.430336
2 NaN
3 0.937739
4 NaN
5 0.236223
dtype: float64
>>> s.fillna(s.mean())
0 -0.085458
1 -1.430336
2 -0.085458
3 0.937739
4 -0.085458
5 0.236223
dtype: float64
假設你想根據分組填充不同資料,只需要將資料分組,並使用apply和一個能夠對個數據塊呼叫的fillna對的函式即可
>>> states = ['Ohio','New York','Vermont','Florida','Oregon','Nevada','California','Idaho']
>>> group_key = ['East']*4 + ['West'] *4
>>> data = Series(np.random.randn(8),index=states)
>>> data[['Vermont','Nevada','Idaho']] = np.nan
>>> data
Ohio -0.734886
New York 1.573174
Vermont NaN
Florida -1.172843
Oregon 0.988466
Nevada NaN
California -1.872393
Idaho NaN
dtype: float64
>>> data.groupby(group_key).mean()
East -0.111518
West -0.441964
dtype: float64
我們利用分組平均值去填充NA值
>>> fill_mean = lambda g: g.fillna(g.mean())
>>> data.groupby(group_key).apply(fill_mean)
Ohio -0.734886
New York 1.573174
Vermont -0.111518
Florida -1.172843
Oregon 0.988466
Nevada -0.441964
California -1.872393
Idaho -0.441964
dtype: float64
我們也可以在程式碼中預定義各組的填充值
>>> fill_values = {'East':0.5,'West':-1}
>>> fill_func = lambda g: g.fillna(fill_values[g.name])
>>> data.groupby(group_key).apply(fill_func)
Ohio -0.734886
New York 1.573174
Vermont 0.500000
Florida -1.172843
Oregon 0.988466
Nevada -1.000000
California -1.872393
Idaho -1.000000
dtype: float64
示例:隨機取樣和佇列
np.random.permutation(N),N為完整資料大小
>>> suits = ['H','S','C','D']
>>> card_val = (range(1,11)+[10]*3)*4
>>> card_val
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10]
>>> base_names = ["A"] + range(2,11) + ['J','K','Q']
>>> base_names
['A', 2, 3, 4, 5, 6, 7, 8, 9, 10, 'J', 'K', 'Q']
>>> carda = []
>>> for suit in suits:
... carda.extend(str(num) + suit for num in base_names)
>>> deck = Series(card_val,index=carda)
>>> deck[:13]
AH 1
2H 2
3H 3
4H 4
5H 5
6H 6
7H 7
8H 8
9H 9
10H 10
JH 10
KH 10
QH 10
dtype: int64
>>> def draw(deck,n=5):
... return deck.take(np.random.permutation(len(deck))[:n])
...
>>> draw(deck)
3H 3
KS 10
QC 10
JS 10
10C 10
dtype: int64
>>> get_suit = lambda card: card[-1]
>>> deck.groupby(get_suit).apply(draw,n=2)
C 9C 9
JC 10
D 4D 4
AD 1
H 4H 4
10H 10
S 3S 3
KS 10
dtype: int64
>>> deck.groupby(get_suit,group_keys=False).apply(draw,n=2)
7C 7
4C 4
AD 1
5D 5
9H 9
4H 4
6S 6
KS 10
dtype: int64
示例:分組加權平均數和相關係數
>>> df = DataFrame({'category':['a','a','a','a','b','b','b','b'],'data':np.random.randn(8),'weights':np.random.rand(8)})
>>> df
category data weights
0 a -1.493554 0.300840
1 a -2.008278 0.693407
2 a 1.006548 0.736280
3 a -1.226051 0.128157
4 b -0.981050 0.327538
5 b -0.487632 0.201700
6 b -1.262182 0.201121
7 b -0.205049 0.206801
>>> grouped = df.groupby('category')
>>> get_wavg = lambda g: np.average(g['data'],weights = g['weights'])
>>> grouped.apply(get_wavg)
category
a -0.676769
b -0.763949
dtype: float64