資料分析工具之Pandas(二)轉載
阿新 • • 發佈:2018-12-17
一、Pandas統計計算和描述
示例程式碼:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(5,4), columns = ['a', 'b', 'c', 'd'])
print(df)
執行結果:
a b c d
0 1.469682 1.948965 1.373124 -0.564129
1 -1.466670 -0.494591 0.467787 -2.007771
2 1.368750 0.532142 0.487862 -1.130825
3 -0.758540 -0.479684 1.239135 1.073077
4 -0.007470 0.997034 2.669219 0.742070
1、常用的統計計算
sum, mean, max, min…
axis=0 按列統計,axis=1按行統計
skipna 排除缺失值, 預設為True
示例程式碼:
print('--------sum-------')
print(df.sum())
print('--------max-------')
print(df.max())
print('--------min------' )
print(df.min(axis=1, skipna=False))
執行結果:
2、常用的統計描述
describe 產生多個統計資料
示例程式碼:
print('----describe-------')
print(df.describe())
執行結果:
----describe-------
a b c d
count 5.000000 5.000000 5.000000 5.000000
mean 0.110632 0.306937 -0.081782 -0.382677
std 1.578243 0.767683 0.902212 1.251316
min -1.491773 -0.682076 -1.475137 -2.151721
25% -0.946890 0.020311 -0.478508 -0.619704
50% 0.091479 0.068958 0.275529 -0.543112
75% 0.290808 0.845925 0.595664 0.106920
max 2.609536 1.281567 0.673540 1.294231
常用的方法說明:
二、Pandas分組與聚合
(一)、分組 (groupby)
- 對資料集進行分組,然後對每組進行統計分析
- SQL能夠對資料進行過濾,分組聚合
- pandas能利用groupby進行更加複雜的分組運算
- 分組運算過程:split->apply->combine
- 拆分:進行分組的根據
- 應用:每個分組執行的計算規則
- 合併:把每個分組的計算結果合併起來
示例程式碼:
import pandas as pd
import numpy as np
dict_obj = {'key1' : ['a', 'b', 'a', 'b',
'a', 'b', 'a', 'a'],
'key2' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'data1': np.random.randn(8),
'data2': np.random.randn(8)}
df_obj = pd.DataFrame(dict_obj)
print(df_obj)
執行結果:
data1 data2 key1 key2
0 0.974685 -0.672494 a one
1 -0.214324 0.758372 b one
2 1.508838 0.392787 a two
3 0.522911 0.630814 b three
4 1.347359 -0.177858 a two
5 -0.264616 1.017155 b two
6 -0.624708 0.450885 a one
7 -1.019229 -1.143825 a three
1、GroupBy物件:
DataFrameGroupBy,SeriesGroupBy
1.1. 分組操作
groupby()進行分組,GroupBy物件沒有進行實際運算,只是包含分組的中間資料
按列名分組:obj.groupby(‘label’)
示例程式碼:
# dataframe根據key1進行分組
print(type(df_obj.groupby('key1')))
# dataframe的 data1 列根據 key1 進行分組
print(type(df_obj['data1'].groupby(df_obj['key1'])))
執行結果:
<class 'pandas.core.groupby.DataFrameGroupBy'>
<class 'pandas.core.groupby.SeriesGroupBy'>
1.2. 分組運算
對GroupBy物件進行分組運算/多重分組運算,如mean()
非數值資料不進行分組運算
示例程式碼:
# 分組運算
grouped1 = df_obj.groupby('key1')
print(grouped1.mean())
grouped2 = df_obj['data1'].groupby(df_obj['key1'])
print(grouped2.mean())
執行結果:
data1 data2
key1
a 0.437389 -0.230101
b 0.014657 0.802114
key1
a 0.437389
b 0.014657
Name: data1, dtype: float64
size() 返回每個分組的元素個數
示例程式碼:
# size
print(grouped1.size())
print(grouped2.size())
執行結果:
key1
a 5
b 3
dtype: int64
key1
a 5
b 3
dtype: int64
1.3. 按自定義的key分組
obj.groupby(self_def_key)
自定義的key可為列表或多層列表
obj.groupby([‘label1’, ‘label2’])->多層dataframe
示例程式碼:
# 按自定義key分組,列表
self_def_key = [0, 1, 2, 3, 3, 4, 5, 7]
print(df_obj.groupby(self_def_key).size())
# 按自定義key分組,多層列表
self_key = [df_obj['key1'], df_obj['key2']]
print(df_obj.groupby(self_key).size())
# 按多個列多層分組
grouped2 = df_obj.groupby(['key1', 'key2'])
print(grouped2.size())
# 多層分組按key的順序進行
grouped3 = df_obj.groupby(['key2', 'key1'])
print(grouped3.mean())
# unstack可以將多層索引的結果轉換成單層的dataframe
print(grouped3.mean().unstack())
2、GroupBy物件支援迭代操作
每次迭代返回一個元組 (group_name, group_data)
可用於分組資料的具體運算
2.1. 單層分組
示例程式碼:
# 單層分組,根據key1
for name, data in grouped1:
print(name)
print(data)
執行結果:
a
data1 data2 key1 key2
0 0.974685 -0.672494 a one
2 1.508838 0.392787 a two
4 1.347359 -0.177858 a two
6 -0.624708 0.450885 a one
7 -1.019229 -1.143825 a three
b
data1 data2 key1 key2
1 -0.214324 0.758372 b one
3 0.522911 0.630814 b three
5 -0.264616 1.017155 b two
2.2. 多層分組
示例程式碼:
# 多層分組,根據key1 和 key2
for group_name, group_data in grouped2:
print(group_name)
print(group_data)
執行結果:
('a', 'one')
data1 data2 key1 key2
0 0.974685 -0.672494 a one
6 -0.624708 0.450885 a one
('a', 'three')
data1 data2 key1 key2
7 -1.019229 -1.143825 a three
('a', 'two')
data1 data2 key1 key2
2 1.508838 0.392787 a two
4 1.347359 -0.177858 a two
('b', 'one')
data1 data2 key1 key2
1 -0.214324 0.758372 b one
('b', 'three')
data1 data2 key1 key2
3 0.522911 0.630814 b three
('b', 'two')
data1 data2 key1 key2
5 -0.264616 1.017155 b two
3、GroupBy物件轉換成列表或字典
示例程式碼:
# GroupBy物件轉換list
print(list(grouped1))
# GroupBy物件轉換dict
print(dict(list(grouped1)))
執行結果:
[('a', data1 data2 key1 key2
0 0.974685 -0.672494 a one
2 1.508838 0.392787 a two
4 1.347359 -0.177858 a two
6 -0.624708 0.450885 a one
7 -1.019229 -1.143825 a three),
('b', data1 data2 key1 key2
1 -0.214324 0.758372 b one
3 0.522911 0.630814 b three
5 -0.264616 1.017155 b two)]
{'a': data1 data2 key1 key2
0 0.974685 -0.672494 a one
2 1.508838 0.392787 a two
4 1.347359 -0.177858 a two
6 -0.624708 0.450885 a one
7 -1.019229 -1.143825 a three,
'b': data1 data2 key1 key2
1 -0.214324 0.758372 b one
3 0.522911 0.630814 b three
5 -0.264616 1.017155 b two}
3.1. 按列分組、按資料型別分組
示例程式碼:
# 按列分組
print(df_obj.dtypes)
# 按資料型別分組
print(df_obj.groupby(df_obj.dtypes, axis=1).size())
print(df_obj.groupby(df_obj.dtypes, axis=1).sum())
執行結果:
data1 float64
data2 float64
key1 object
key2 object
dtype: object
float64 2
object 2
dtype: int64
float64 object
0 0.302191 a one
1 0.544048 b one
2 1.901626 a two
3 1.153725 b three
4 1.169501 a two
5 0.752539 b two
6 -0.173823 a one
7 -2.163054 a three
3.2. 其他分組方法
示例程式碼:
df_obj2 = pd.DataFrame(np.random.randint(1, 10, (5,5)),
columns=['a', 'b', 'c', 'd', 'e'],
index=['A', 'B', 'C', 'D', 'E'])
df_obj2.ix[1, 1:4] = np.NaN
print(df_obj2)
執行結果:
a b c d e
A 7 2.0 4.0 5.0 8
B 4 NaN NaN NaN 1
C 3 2.0 5.0 4.0 6
D 3 1.0 9.0 7.0 3
E 6 1.0 6.0 8.0 1
3.3. 通過字典分組
示例程式碼:
# 通過字典分組
mapping_dict = {'a':'Python', 'b':'Python', 'c':'Java', 'd':'C', 'e':'Java'}
print(df_obj2.groupby(mapping_dict, axis=1).size())
print(df_obj2.groupby(mapping_dict, axis=1).count()) # 非NaN的個數
print(df_obj2.groupby(mapping_dict, axis=1).sum())
執行結果:
C 1
Java 2
Python 2
dtype: int64
C Java Python
A 1 2 2
B 0 1 1
C 1 2 2
D 1 2 2
E 1 2 2
C Java Python
A 5.0 12.0 9.0
B NaN 1.0 4.0
C 4.0 11.0 5.0
D 7.0 12.0 4.0
E 8.0 7.0 7.0
3.4. 通過函式分組,函式傳入的引數為行索引或列索引
示例程式碼:
# 通過函式分組
df_obj3 = pd.DataFrame(np.random.randint(1, 10, (5,5)),
columns=['a', 'b', 'c', 'd', 'e'],
index=['AA', 'BBB', 'CC', 'D', 'EE'])
#df_obj3
def group_key(idx):
"""
idx 為列索引或行索引
"""
#return idx
return len(idx)
print(df_obj3.groupby(group_key).size())
# 以上自定義函式等價於
#df_obj3.groupby(len).size()
執行結果:
1 1
2 3
3 1
dtype: int64
3.5. 通過索引級別分組
示例程式碼:
# 通過索引級別分組
columns = pd.MultiIndex.from_arrays([['Python', 'Java', 'Python', 'Java', 'Python'],
['A', 'A', 'B', 'C', 'B']], names=['language', 'index'])
df_obj4 = pd.DataFrame(np.random.randint(1, 10, (5, 5)), columns=columns)
print(df_obj4)
# 根據language進行分組
print(df_obj4.groupby(level='language', axis=1).sum())
# 根據index進行分組
print(df_obj4.groupby(level='index', axis=1).sum())
執行結果:
language Python Java Python Java Python
index A A B C B
0 2 7 8 4 3
1 5 2 6 1 2
2 6 4 4 5 2
3 4 7 4 3 1
4 7 4 3 4 8
language Java Python
0 11 13
1 3 13
2 9 12
3 10 9
4 8 18
index A B C
0 9 11 4
1 7 8 1
2 10 6 5
3 11 5 3
4 11 11 4
(二)、聚合 (aggregation)
- 陣列產生標量的過程,如mean()、count()等
- 常用於對分組後的資料進行計算
示例程式碼:
dict_obj = {'key1' : ['a', 'b', 'a', 'b',
'a', 'b', 'a', 'a'],
'key2' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'data1': np.random.randint(1,10, 8),
'data2': np.random.randint(1,10, 8)}
df_obj5 = pd.DataFrame(dict_obj)
print(df_obj5)
執行結果:
data1 data2 key1 key2
0 3 7 a one
1 1 5 b one
2 7 4 a two
3 2 4 b three
4 6 4 a two
5 9 9 b two
6 3 5 a one
7 8 4 a three
1. 內建的聚合函式
sum(), mean(), max(), min(), count(), size(), describe()
示例程式碼:
print(df_obj5.groupby('key1').sum())
print(df_obj5.groupby('key1').max())
print(df_obj5.groupby('key1').min())
print(df_obj5.groupby('key1').mean())
print(df_obj5.groupby('key1').size())
print(df_obj5.groupby('key1').count())
print(df_obj5.groupby('key1').describe())
執行結果:
data1 data2
key1
a 27 24
b 12 18
data1 data2 key2
key1
a 8 7 two
b 9 9 two
data1 data2 key2
key1
a 3 4 one
b 1 4 one
data1 data2
key1
a 5.4 4.8
b 4.0 6.0
key1
a 5
b 3
dtype: int64
data1 data2 key2
key1
a 5 5 5
b 3 3 3
data1 data2
key1
a count 5.000000 5.000000
mean 5.400000 4.800000
std 2.302173 1.303840
min 3.000000 4.000000
25% 3.000000 4.000000
50% 6.000000 4.000000
75% 7.000000 5.000000
max 8.000000 7.000000
b count 3.000000 3.000000
mean 4.000000 6.000000
std 4.358899 2.645751
min 1.000000 4.000000
25% 1.500000 4.500000
50% 2.000000 5.000000
75% 5.500000 7.000000
max 9.000000 9.000000