1. 程式人生 > >資料分析工具之Pandas(二)轉載

資料分析工具之Pandas(二)轉載

一、Pandas統計計算和描述

示例程式碼:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randn(5,4), columns = ['a', 'b', 'c', 'd'])
print(df)

執行結果:

          a         b         c         d
0  1.469682  1.948965  1.373124 -0.564129
1 -1.466670 -0.494591  0.467787 -2.007771
2  1.368750  0.532142
0.487862 -1.130825 3 -0.758540 -0.479684 1.239135 1.073077 4 -0.007470 0.997034 2.669219 0.742070

1、常用的統計計算

sum, mean, max, min…

axis=0 按列統計,axis=1按行統計

skipna 排除缺失值, 預設為True

示例程式碼:

print('--------sum-------')
print(df.sum())
print('--------max-------')
print(df.max())
print('--------min------'
) print(df.min(axis=1, skipna=False))

執行結果:

2、常用的統計描述

describe 產生多個統計資料

示例程式碼:

print('----describe-------')
print(df.describe())

執行結果:

----describe-------
              a         b         c         d
count  5.000000  5.000000  5.000000  5.000000
mean   0.110632  0.306937 -0.081782 -0.382677
std    1.578243  0.767683
0.902212 1.251316 min -1.491773 -0.682076 -1.475137 -2.151721 25% -0.946890 0.020311 -0.478508 -0.619704 50% 0.091479 0.068958 0.275529 -0.543112 75% 0.290808 0.845925 0.595664 0.106920 max 2.609536 1.281567 0.673540 1.294231

常用的方法說明:

img

二、Pandas分組與聚合

(一)、分組 (groupby)

  • 對資料集進行分組,然後對每組進行統計分析
  • SQL能夠對資料進行過濾,分組聚合
  • pandas能利用groupby進行更加複雜的分組運算
  • 分組運算過程:split->apply->combine
    1. 拆分:進行分組的根據
    2. 應用:每個分組執行的計算規則
    3. 合併:把每個分組的計算結果合併起來

img

示例程式碼:

import pandas as pd
import numpy as np

dict_obj = {'key1' : ['a', 'b', 'a', 'b', 
                      'a', 'b', 'a', 'a'],
            'key2' : ['one', 'one', 'two', 'three',
                      'two', 'two', 'one', 'three'],
            'data1': np.random.randn(8),
            'data2': np.random.randn(8)}
df_obj = pd.DataFrame(dict_obj)
print(df_obj)

執行結果:

      data1     data2 key1   key2
0  0.974685 -0.672494    a    one
1 -0.214324  0.758372    b    one
2  1.508838  0.392787    a    two
3  0.522911  0.630814    b  three
4  1.347359 -0.177858    a    two
5 -0.264616  1.017155    b    two
6 -0.624708  0.450885    a    one
7 -1.019229 -1.143825    a  three

1、GroupBy物件:

DataFrameGroupBy,SeriesGroupBy

1.1. 分組操作

groupby()進行分組,GroupBy物件沒有進行實際運算,只是包含分組的中間資料

按列名分組:obj.groupby(‘label’)

示例程式碼:

# dataframe根據key1進行分組
print(type(df_obj.groupby('key1')))

# dataframe的 data1 列根據 key1 進行分組
print(type(df_obj['data1'].groupby(df_obj['key1'])))

執行結果:

<class 'pandas.core.groupby.DataFrameGroupBy'>
<class 'pandas.core.groupby.SeriesGroupBy'>

1.2. 分組運算

對GroupBy物件進行分組運算/多重分組運算,如mean()

非數值資料不進行分組運算

示例程式碼:

# 分組運算
grouped1 = df_obj.groupby('key1')
print(grouped1.mean())

grouped2 = df_obj['data1'].groupby(df_obj['key1'])
print(grouped2.mean())

執行結果:

         data1     data2
key1                    
a     0.437389 -0.230101
b     0.014657  0.802114
key1
a    0.437389
b    0.014657
Name: data1, dtype: float64

size() 返回每個分組的元素個數

示例程式碼:

# size
print(grouped1.size())
print(grouped2.size())

執行結果:

key1
a    5
b    3
dtype: int64
key1
a    5
b    3
dtype: int64

1.3. 按自定義的key分組

obj.groupby(self_def_key)

自定義的key可為列表或多層列表

obj.groupby([‘label1’, ‘label2’])->多層dataframe

示例程式碼:

# 按自定義key分組,列表
self_def_key = [0, 1, 2, 3, 3, 4, 5, 7]                 
print(df_obj.groupby(self_def_key).size())

# 按自定義key分組,多層列表
self_key = [df_obj['key1'], df_obj['key2']]
print(df_obj.groupby(self_key).size())

# 按多個列多層分組
grouped2 = df_obj.groupby(['key1', 'key2'])
print(grouped2.size())

# 多層分組按key的順序進行
grouped3 = df_obj.groupby(['key2', 'key1'])
print(grouped3.mean())
# unstack可以將多層索引的結果轉換成單層的dataframe
print(grouped3.mean().unstack())

2、GroupBy物件支援迭代操作

每次迭代返回一個元組 (group_name, group_data)

可用於分組資料的具體運算

2.1. 單層分組

示例程式碼:

# 單層分組,根據key1
for name, data in grouped1:
    print(name)
    print(data)

執行結果:

a
      data1     data2 key1   key2
0  0.974685 -0.672494    a    one
2  1.508838  0.392787    a    two
4  1.347359 -0.177858    a    two
6 -0.624708  0.450885    a    one
7 -1.019229 -1.143825    a  three

b
      data1     data2 key1   key2
1 -0.214324  0.758372    b    one
3  0.522911  0.630814    b  three
5 -0.264616  1.017155    b    two

2.2. 多層分組

示例程式碼:

# 多層分組,根據key1 和 key2
for group_name, group_data in grouped2:
    print(group_name)
    print(group_data)

執行結果:

('a', 'one')
      data1     data2 key1 key2
0  0.974685 -0.672494    a  one
6 -0.624708  0.450885    a  one

('a', 'three')
      data1     data2 key1   key2
7 -1.019229 -1.143825    a  three

('a', 'two')
      data1     data2 key1 key2
2  1.508838  0.392787    a  two
4  1.347359 -0.177858    a  two

('b', 'one')
      data1     data2 key1 key2
1 -0.214324  0.758372    b  one

('b', 'three')
      data1     data2 key1   key2
3  0.522911  0.630814    b  three

('b', 'two')
      data1     data2 key1 key2
5 -0.264616  1.017155    b  two

3、GroupBy物件轉換成列表或字典

示例程式碼:

# GroupBy物件轉換list
print(list(grouped1))

# GroupBy物件轉換dict
print(dict(list(grouped1)))

執行結果:

[('a',       data1     data2 key1   key2
0  0.974685 -0.672494    a    one
2  1.508838  0.392787    a    two
4  1.347359 -0.177858    a    two
6 -0.624708  0.450885    a    one
7 -1.019229 -1.143825    a  three), 
('b',       data1     data2 key1   key2
1 -0.214324  0.758372    b    one
3  0.522911  0.630814    b  three
5 -0.264616  1.017155    b    two)]

{'a':       data1     data2 key1   key2
0  0.974685 -0.672494    a    one
2  1.508838  0.392787    a    two
4  1.347359 -0.177858    a    two
6 -0.624708  0.450885    a    one
7 -1.019229 -1.143825    a  three, 
'b':       data1     data2 key1   key2
1 -0.214324  0.758372    b    one
3  0.522911  0.630814    b  three
5 -0.264616  1.017155    b    two}

3.1. 按列分組、按資料型別分組

示例程式碼:

# 按列分組
print(df_obj.dtypes)

# 按資料型別分組
print(df_obj.groupby(df_obj.dtypes, axis=1).size())
print(df_obj.groupby(df_obj.dtypes, axis=1).sum())

執行結果:

data1    float64
data2    float64
key1      object
key2      object
dtype: object

float64    2
object     2
dtype: int64

    float64  object
0  0.302191    a one
1  0.544048    b one
2  1.901626    a two
3  1.153725  b three
4  1.169501    a two
5  0.752539    b two
6 -0.173823    a one
7 -2.163054  a three

3.2. 其他分組方法

示例程式碼:

df_obj2 = pd.DataFrame(np.random.randint(1, 10, (5,5)),
                       columns=['a', 'b', 'c', 'd', 'e'],
                       index=['A', 'B', 'C', 'D', 'E'])
df_obj2.ix[1, 1:4] = np.NaN
print(df_obj2)

執行結果:

   a    b    c    d  e
A  7  2.0  4.0  5.0  8
B  4  NaN  NaN  NaN  1
C  3  2.0  5.0  4.0  6
D  3  1.0  9.0  7.0  3
E  6  1.0  6.0  8.0  1

3.3. 通過字典分組

示例程式碼:

# 通過字典分組
mapping_dict = {'a':'Python', 'b':'Python', 'c':'Java', 'd':'C', 'e':'Java'}
print(df_obj2.groupby(mapping_dict, axis=1).size())
print(df_obj2.groupby(mapping_dict, axis=1).count()) # 非NaN的個數
print(df_obj2.groupby(mapping_dict, axis=1).sum())

執行結果:

C         1
Java      2
Python    2
dtype: int64

   C  Java  Python
A  1     2       2
B  0     1       1
C  1     2       2
D  1     2       2
E  1     2       2

     C  Java  Python
A  5.0  12.0     9.0
B  NaN   1.0     4.0
C  4.0  11.0     5.0
D  7.0  12.0     4.0
E  8.0   7.0     7.0

3.4. 通過函式分組,函式傳入的引數為行索引或列索引

示例程式碼:

# 通過函式分組
df_obj3 = pd.DataFrame(np.random.randint(1, 10, (5,5)),
                       columns=['a', 'b', 'c', 'd', 'e'],
                       index=['AA', 'BBB', 'CC', 'D', 'EE'])
#df_obj3

def group_key(idx):
    """
        idx 為列索引或行索引
    """
    #return idx
    return len(idx)

print(df_obj3.groupby(group_key).size())

# 以上自定義函式等價於
#df_obj3.groupby(len).size()

執行結果:

1    1
2    3
3    1
dtype: int64

3.5. 通過索引級別分組

示例程式碼:

# 通過索引級別分組
columns = pd.MultiIndex.from_arrays([['Python', 'Java', 'Python', 'Java', 'Python'],
                                     ['A', 'A', 'B', 'C', 'B']], names=['language', 'index'])
df_obj4 = pd.DataFrame(np.random.randint(1, 10, (5, 5)), columns=columns)
print(df_obj4)

# 根據language進行分組
print(df_obj4.groupby(level='language', axis=1).sum())
# 根據index進行分組
print(df_obj4.groupby(level='index', axis=1).sum())

執行結果:

language Python Java Python Java Python
index         A    A      B    C      B
0             2    7      8    4      3
1             5    2      6    1      2
2             6    4      4    5      2
3             4    7      4    3      1
4             7    4      3    4      8

language  Java  Python
0           11      13
1            3      13
2            9      12
3           10       9
4            8      18

index   A   B  C
0       9  11  4
1       7   8  1
2      10   6  5
3      11   5  3
4      11  11  4

(二)、聚合 (aggregation)

  • 陣列產生標量的過程,如mean()、count()等
  • 常用於對分組後的資料進行計算

示例程式碼:

dict_obj = {'key1' : ['a', 'b', 'a', 'b', 
                      'a', 'b', 'a', 'a'],
            'key2' : ['one', 'one', 'two', 'three',
                      'two', 'two', 'one', 'three'],
            'data1': np.random.randint(1,10, 8),
            'data2': np.random.randint(1,10, 8)}
df_obj5 = pd.DataFrame(dict_obj)
print(df_obj5)

執行結果:

   data1  data2 key1   key2
0      3      7    a    one
1      1      5    b    one
2      7      4    a    two
3      2      4    b  three
4      6      4    a    two
5      9      9    b    two
6      3      5    a    one
7      8      4    a  three

1. 內建的聚合函式

sum(), mean(), max(), min(), count(), size(), describe()

示例程式碼:

print(df_obj5.groupby('key1').sum())
print(df_obj5.groupby('key1').max())
print(df_obj5.groupby('key1').min())
print(df_obj5.groupby('key1').mean())
print(df_obj5.groupby('key1').size())
print(df_obj5.groupby('key1').count())
print(df_obj5.groupby('key1').describe())

執行結果:

      data1  data2
key1              
a        27     24
b        12     18

      data1  data2 key2
key1                   
a         8      7  two
b         9      9  two

      data1  data2 key2
key1                   
a         3      4  one
b         1      4  one

      data1  data2
key1              
a       5.4    4.8
b       4.0    6.0

key1
a    5
b    3
dtype: int64

      data1  data2  key2
key1                    
a         5      5     5
b         3      3     3

               data1     data2
key1                          
a    count  5.000000  5.000000
     mean   5.400000  4.800000
     std    2.302173  1.303840
     min    3.000000  4.000000
     25%    3.000000  4.000000
     50%    6.000000  4.000000
     75%    7.000000  5.000000
     max    8.000000  7.000000
b    count  3.000000  3.000000
     mean   4.000000  6.000000
     std    4.358899  2.645751
     min    1.000000  4.000000
     25%    1.500000  4.500000
     50%    2.000000  5.000000
     75%    5.500000  7.000000
     max    9.000000  9.000000

2. 可自定義