1. 程式人生 > >pandas中的分組技術

pandas中的分組技術

red 他會 axis 數據 了吧 組成 區別 con cal

目錄

  • 1 分組操作
    • 1.1 按照列進行分組
    • 1.2 按照字典進行分組
    • 1.3 根據函數進行分組
    • 1.4 按照list組合
    • 1.5 按照索引級別進行分組
  • 2 分組運算
    • 2.1 agg
    • 2.2 transform
    • 2.3 apply
  • 3 利用groupby技術多進程處理DataFrame

我們在這裏要講一個很常用的技術, 就是所謂的分組技術, 這個在數據庫中是非常常用的, 要去求某些分組的統計量, 那麽我們需要知道在pandas裏面, 這些分組技術是怎麽實現的.

分組操作

我們這裏要來聊聊在pandas中實現分組運算, 大致上可以按照列, 字典或者Series, 函數, 索引級別進行分組, 我們會逐漸來介紹.

按照列進行分組

import pandas as pd
from pandas import DataFrame, Series
import numpy as np

sep = "---------------------------------------------------------------------------"
data = DataFrame({"key1": [‘a‘, ‘a‘, ‘b‘, ‘b‘, ‘a‘], "key2": [‘one‘, ‘two‘, ‘one‘, ‘two‘, ‘one‘], ‘data1‘: np.random.randn(5), ‘data2‘
: np.random.randn(5)}) print(data)
      data1     data2 key1 key2
0  0.733951  0.000379    a  one
1  1.039029  0.852930    a  two
2  0.921413 -1.644942    b  one
3  0.294560  0.521525    b  two
4  0.286072 -0.074574    a  one

data1按照key1分組為:

groups = data[‘data1‘].groupby(data[‘key1‘])

我們發現得到了一個SeriesGroupBy 對象, 現在我們對這個對象進行叠代

:

for name, group in groups:
    print(name)
    print(sep)
    print(group)
    print(sep)
a
---------------------------------------------------------------------------
0    0.733951
1    1.039029
4    0.286072
Name: data1, dtype: float64
---------------------------------------------------------------------------
b
---------------------------------------------------------------------------
2    0.921413
3    0.294560
Name: data1, dtype: float64
---------------------------------------------------------------------------

我們發現, groups有(key, Series)對組成, key根據什麽來分組的元素, Series(DataFrame)是分組的元素, Series(DataFrame)的name還是原來的列名.

對你分組進行叠代, 用:

for name, group in groups

groups = data.groupby(data[‘key1‘])
for name, group in groups:
    print(name)
    print(sep)
    print(group)
    print(sep)
a
---------------------------------------------------------------------------
      data1     data2 key1 key2
0  0.733951  0.000379    a  one
1  1.039029  0.852930    a  two
4  0.286072 -0.074574    a  one
---------------------------------------------------------------------------
b
---------------------------------------------------------------------------
      data1     data2 key1 key2
2  0.921413 -1.644942    b  one
3  0.294560  0.521525    b  two
---------------------------------------------------------------------------

groupby就是按照某個值來分組, 無論是對series還是dataframe, 都成立.

我們可以在分好組的對象上調用統計函數.

data.groupby(data[‘key1‘]).mean()
data1 data2
key1
a 0.686351 0.259578
b 0.607986 -0.561709

在每個分組上分別對每個每一列求均值, 如果是非數字列, 或默認剔除.

作業1:在每個分組上分別對每個每一行求均值.

提示: data.groupby(data[‘key1‘]).mean(axis=1)是行不通的.

對於多個列進行分組, 分組的key是對應分組元素的元組.

作業2:對DataFrame用多個列進行分組.

下面其我們來看一個語法糖:

data.groupby([data[‘key1‘], data[‘key2‘]])
<pandas.core.groupby.DataFrameGroupBy object at 0x000001D080230278>

它等價於:

data.groupby([‘key1‘, ‘key2‘])
<pandas.core.groupby.DataFrameGroupBy object at 0x000001D080230630>

我們來驗證一下:

groups =data.groupby([data[‘key1‘], data[‘key2‘]])
for name, group in groups:
    print(name)
    print(sep)
    print(group)
    print(sep)
(‘a‘, ‘one‘)
---------------------------------------------------------------------------
      data1     data2 key1 key2
0  0.733951  0.000379    a  one
4  0.286072 -0.074574    a  one
---------------------------------------------------------------------------
(‘a‘, ‘two‘)
---------------------------------------------------------------------------
      data1    data2 key1 key2
1  1.039029  0.85293    a  two
---------------------------------------------------------------------------
(‘b‘, ‘one‘)
---------------------------------------------------------------------------
      data1     data2 key1 key2
2  0.921413 -1.644942    b  one
---------------------------------------------------------------------------
(‘b‘, ‘two‘)
---------------------------------------------------------------------------
     data1     data2 key1 key2
3  0.29456  0.521525    b  two
---------------------------------------------------------------------------
groups = data.groupby([‘key1‘, ‘key2‘])
for name, group in groups:
    print(name)
    print(sep)
    print(group)
    print(sep)
(‘a‘, ‘one‘)
---------------------------------------------------------------------------
      data1     data2 key1 key2
0  0.733951  0.000379    a  one
4  0.286072 -0.074574    a  one
---------------------------------------------------------------------------
(‘a‘, ‘two‘)
---------------------------------------------------------------------------
      data1    data2 key1 key2
1  1.039029  0.85293    a  two
---------------------------------------------------------------------------
(‘b‘, ‘one‘)
---------------------------------------------------------------------------
      data1     data2 key1 key2
2  0.921413 -1.644942    b  one
---------------------------------------------------------------------------
(‘b‘, ‘two‘)
---------------------------------------------------------------------------
     data1     data2 key1 key2
3  0.29456  0.521525    b  two
---------------------------------------------------------------------------

我們發現輸出結果是一模一樣, 總結一下:

data.groupby([data[‘key1‘], data[‘key2‘]])等價於data.groupby([‘key1‘, ‘key2‘])

進一步:

data[‘data1‘].groupby([data[‘key1‘], data[‘key2‘]])等價於data.groupby([‘key1‘, ‘key2‘])[‘data1‘]

作業3: 驗證data[‘data1‘].groupby([data[‘key1‘], data[‘key2‘]])等價於data.groupby([‘key1‘, ‘key2‘])[‘data1‘].

data.groupby([‘key1‘, ‘key2‘])[‘data1‘]
<pandas.core.groupby.SeriesGroupBy object at 0x000001D0FCD95D68>
data.groupby([‘key1‘, ‘key2‘])[[‘data1‘]]
<pandas.core.groupby.DataFrameGroupBy object at 0x000001D080232898>

我不知道大家發現沒有, 這兩個返回的數據類型是有區別的, 我們仔細來看看:

data[[‘data1‘]] # 這是一個DataFrame
data1
0 0.733951
1 1.039029
2 0.921413
3 0.294560
4 0.286072
data[‘data1‘] # 這是一個Series
0    0.733951
1    1.039029
2    0.921413
3    0.294560
4    0.286072
Name: data1, dtype: float64

那麽這裏的區別就不言而喻了吧

groups = data.groupby([‘key1‘, ‘key2‘])[[‘data1‘]]

for name, group in groups:
    print(name)
    print(sep)
    print(group)
    print(sep)
(‘a‘, ‘one‘)
---------------------------------------------------------------------------
<class ‘pandas.core.frame.DataFrame‘>
---------------------------------------------------------------------------
(‘a‘, ‘two‘)
---------------------------------------------------------------------------
<class ‘pandas.core.frame.DataFrame‘>
---------------------------------------------------------------------------
(‘b‘, ‘one‘)
---------------------------------------------------------------------------
<class ‘pandas.core.frame.DataFrame‘>
---------------------------------------------------------------------------
(‘b‘, ‘two‘)
---------------------------------------------------------------------------
<class ‘pandas.core.frame.DataFrame‘>
---------------------------------------------------------------------------

結果是一樣的.

data.groupby([‘key1‘, ‘key2‘])[[‘data1‘]].mean()
data1
key1 key2
a one 0.510012
two 1.039029
b one 0.921413
two 0.294560
data.groupby([‘key1‘, ‘key2‘])[‘data1‘].mean()
key1  key2
a     one     0.510012
      two     1.039029
b     one     0.921413
      two     0.294560
Name: data1, dtype: float64

在做數據聚合的時候就發現了不同,

[[‘data1‘]]得到的是一個DataFrame, 而[‘data1‘]得到的是Series.

按照字典進行分組

我們來看一個按照字典進行分組的例子:

data = DataFrame(np.random.randn(5, 5), columns=[‘a‘, ‘b‘, ‘c‘, ‘d‘, ‘e‘], index=[‘joe‘, ‘steve‘, ‘wes‘, ‘jim‘, ‘Travis‘])
data
a b c d e
joe -0.089597 1.239307 2.173063 -0.519295 -1.783812
steve 0.539109 0.724553 -0.041899 0.787494 0.394633
wes -0.055417 0.384068 -0.594006 -0.451587 0.722761
jim -0.056767 0.398863 2.140669 -1.060791 -0.953756
Travis 0.245142 -0.468819 -0.863372 -0.151966 1.185567
# 定義一個分組的字典, a, b, c --> red, d, e --> blue
mapping = {‘a‘:‘red‘, ‘b‘:‘red‘, ‘c‘: ‘red‘, ‘d‘:‘blue‘, ‘e‘: ‘blue‘}
data.groupby(mapping, axis=1).mean()   # 對每一個分組求平均
blue red
joe -1.151554 1.107591
steve 0.591063 0.407255
wes 0.135587 -0.088452
jim -1.007273 0.827589
Travis 0.516800 -0.362350

作業4:自己設計一個index的mapping, 按axis=0進行分組.

根據函數進行分組

話不多說, 直接來看例子:

data.groupby(len).mean()
a b c d e
3 -0.067260 0.674079 1.239909 -0.677224 -0.671602
5 0.539109 0.724553 -0.041899 0.787494 0.394633
6 0.245142 -0.468819 -0.863372 -0.151966 1.185567

我們發現, 字典和函數都是作用到索引上的.

按照list組合

這個例子非常簡單:

data.groupby([‘1‘, ‘1‘, ‘1‘, ‘2‘, ‘2‘]).mean()
a b c d e
1 0.131365 0.782643 0.512386 -0.061130 -0.222139
2 0.094188 -0.034978 0.638649 -0.606378 0.115905

他會自動判斷是按照列還是list.

按照索引級別進行分組

作業5: 自己學習按索引級別進行分組.

分組運算

分組運算主要設計到3個函數, agg, transform和apply.

我們一個一個來看.

agg

data = DataFrame({"key1": [‘a‘, ‘a‘, ‘b‘, ‘b‘, ‘a‘], "key2": [‘one‘, ‘two‘, ‘one‘, ‘two‘, ‘one‘], ‘data1‘: np.random.randn(5), ‘data2‘: np.random.randn(5)})
data
data1 data2 key1 key2
0 0.441278 -0.848457 a one
1 1.843375 -0.522482 a two
2 -1.435176 -0.191682 b one
3 -2.700772 -0.832993 b two
4 -1.430386 -1.910834 a one
data.groupby("key2").agg(np.mean)
data1 data2
key2
one -0.808095 -0.983658
two -0.428699 -0.677738

當然, 這個等價於:

data.groupby("key2").mean()
data1 data2
key2
one -0.808095 -0.983658
two -0.428699 -0.677738

原理: 聚合函數會在group後的每個切片上(group後的每一行或每一列)計算出值.

我們也可以自定義函數:

data.groupby("key2").agg(lambda x: x.max() - x.min())
data1 data2
key2
one 1.876454 1.719153
two 4.544147 0.310511

他會在每個分組的每個列上調用這個函數.

data.groupby("key2").agg([np.mean, np.max,np.min])
data1 data2
mean amax amin mean amax amin
key2
one -0.808095 0.441278 -1.435176 -0.983658 -0.191682 -1.910834
two -0.428699 1.843375 -2.700772 -0.677738 -0.522482 -0.832993
data.groupby("key2").agg([("平均值:", np.mean), ("最大值",np.max), ("最小值",np.min)]).rename({"one": "第一組", "two":"第二組"})
data1 data2
平均值: 最大值 最小值 平均值: 最大值 最小值
key2
第一組 -0.808095 0.441278 -1.435176 -0.983658 -0.191682 -1.910834
第二組 -0.428699 1.843375 -2.700772 -0.677738 -0.522482 -0.832993
# 對不同列用不同的分組函數 
data.groupby("key2").agg({"data1":[("平均值:", np.mean), ("最大值",np.max)], "data2":[("最小值",np.min)]}).rename({"one": "第一組", "two":"第二組"})
data2 data1
最小值 平均值: 最大值
key2
第一組 -1.910834 -0.808095 0.441278
第二組 -0.832993 -0.428699 1.843375

transform

transform是一個矢量化的函數, 如果最後我們得到的值和分組切片不一致, 會進行廣播:

data
data1 data2 key1 key2
0 0.441278 -0.848457 a one
1 1.843375 -0.522482 a two
2 -1.435176 -0.191682 b one
3 -2.700772 -0.832993 b two
4 -1.430386 -1.910834 a one
data.groupby("key1").transform(np.mean)
data1 data2
0 0.284756 -1.093924
1 0.284756 -1.093924
2 -2.067974 -0.512338
3 -2.067974 -0.512338
4 0.284756 -1.093924

仔細看, 0,1, 4一組, 2,3一組, 發生了廣播.

現在有個需求,按分組減去均值.

data.groupby("key1").transform(lambda x: x - x.mean())
data1 data2
0 0.156523 0.245468
1 1.558619 0.571442
2 0.632798 0.320656
3 -0.632798 -0.320656
4 -1.715142 -0.816910

a, b分組的各列都減去了他們的均值, 不信, 來看:

data.groupby("key1").transform(lambda x: x - x.mean()).groupby([1, 1, 0,0, 1]).mean()
data1 data2
0 1.110223e-16 -5.551115e-17
1 7.401487e-17 -1.110223e-16

apply

這個函數是transform的加強版, transform只能返回和原來切片大小一樣大的, 但apply是可以任意的. 其實我們之前就用過apply函數, 我們知道, apply是作用在列(行)上的, applymap是作用在函數上的.

data = DataFrame({"key1": [‘a‘, ‘a‘, ‘b‘, ‘b‘, ‘a‘], "key2": [‘one‘, ‘two‘, ‘one‘, ‘two‘, ‘one‘], ‘data1‘: np.random.randn(5), ‘data2‘: np.random.randn(5)})
data
data1 data2 key1 key2
0 -0.312694 0.073574 a one
1 -0.902065 -0.854249 a two
2 -0.440915 0.228551 b one
3 -0.406243 -0.878505 b two
4 1.812926 -0.114598 a one

如果我們要找出one, 和two分組中選出data2最大的前兩個呢?

data.groupby(‘key2‘).apply(lambda x: x.sort_values(by=‘data2‘)[-2:])
data1 data2 key1 key2
key2
one 0 -0.312694 0.073574 a one
2 -0.440915 0.228551 b one
two 3 -0.406243 -0.878505 b two
1 -0.902065 -0.854249 a two

去掉group層次索引:

data.groupby(‘key2‘, group_keys=False).apply(lambda x: x.sort_values(by=‘data2‘)[-2:])
data1 data2 key1 key2
0 -0.312694 0.073574 a one
2 -0.440915 0.228551 b one
3 -0.406243 -0.878505 b two
1 -0.902065 -0.854249 a two

總結一下: apply就是把分完組的切片挨個(按行, 按列, 或者整體)調用我們的函數, 最後再把結果合並起來.

利用groupby技術多進程處理DataFrame

我們這裏要教大家用一種groupby技術, 來實現對DataFrame並行處理.

pip install joblib

因為我們windows系統的限制, 我們的代碼是在linux上運行的:


import math
from joblib import Parallel, delayed
from pandas import DataFrame
import pandas as pd
import numpy as np
import time

begin = time.time()
test = DataFrame(np.random.randn(10000000, 10))
test_other = test.copy()
groups = test.groupby(lambda x: x % 8)

def func(x):
    return x.applymap(lambda y: math.pow(y, 4))

pd.concat(Parallel(n_jobs=8)(delayed(func)(group) for name, group in groups))
print(time.time() - begin)


begin = time.time()
test_other.applymap(lambda x: math.pow(x, 4))
print(time.time() - begin)

運算結果為:
23.35878014564514
62.76386260986328

速度大概提升了2.5倍, 還是很不錯的.

pandas中的分組技術