pandas 資料聚合
阿新 • • 發佈:2018-11-23
1. apply
- Series
Series.
apply
(func, convert_dtype=True, args=(), **kwds)- func: 要進行資料聚合的函式,自動對Series內的每個資料呼叫func
>>> import pandas as pd >>> import numpy as np >>> series = pd.Series([20, 21, 12], index=['London', ... 'New York','Helsinki']) >>> series London 20 New York 21 Helsinki 12 dtype: int64 >>> def square(x): ... return x**2 >>> series.apply(square) London 400 New York 441 Helsinki 144 dtype: int64
- DataFrame
DataFrame.apply
(func, axis=0, broadcast=None, raw=False, reduce=None, result_type=None, args=(), **kwds)- func: 同上
- axis: 沿某個軸進行聚合
每次呼叫func的都是一個Series,
axis = 0: apply函式會自動遍歷每一列DataFrame的資料,最後將所有結果組合成一個Series資料結構並返回
axis = 1: apply函式會自動遍歷每一行DataFrame的資料,最後將所有結果組合成一個Series資料結構並返回
>>> df = pd.DataFrame([[4, 9],] * 3, columns=['A', 'B']) >>> df A B 0 4 9 1 4 9 2 4 9 >>> df.apply(np.sum, axis=0) A 12 B 27 dtype: int64 >>> df.apply(np.sum, axis=1) 0 13 1 13 2 13 dtype: int64
2. iloc, loc
- iloc: 用整數進行索引
- loc: 用字串進行索引
- Series
>>> s2 = pd.Series(['a', 'b', 'c'], index=['one', 'two', 'three'])
>>> s2
one a
two b
three c
dtype: object
# loc是用字串進行索引
>>> s2.loc['one']
'a'
# iloc是用整數進行索引
>>> s2.iloc[0]
'a'
# []用整數進行索引
>>> s2[0]
'a'
- DataFrame
>>> s3 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, index=['one', 'two', 'three'])
>>> s3
a b
one 1 4
two 2 5
three 3 6
# loc與iloc都用來索引rows
>>> s3.iloc[0]
a 1
b 4
Name: one, dtype: int64
>>> s3.loc['one']
a 1
b 4
Name: one, dtype: int64
# []用來索引cols
>>> s3['a']
one 1
two 2
three 3
Name: a, dtype: int64
3. merge
DataFrame.
merge
(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)- how: 兩表連線的方式('inner': 內連線, 'outer': 外連線, 'left': 左外連線, 'right': 右外連線)
- on: 要連線的鍵(會自動尋找列名相同的列)
- left_on: 左表要連線的列名
- right_on: 右表要連線的列名
- left_index: 是否將左表的索引作為連線的鍵
- right_index: 是否將右表的索引作為連線的鍵
>>> A >>> B
lkey value rkey value
0 foo 1 0 foo 5
1 bar 2 1 bar 6
2 baz 3 2 qux 7
3 foo 4 3 bar 8
>>> A.merge(B, left_on='lkey', right_on='rkey', how='outer')
lkey value_x rkey value_y
0 foo 1 foo 5
1 foo 4 foo 5
2 bar 2 bar 6
3 bar 2 bar 8
4 baz 3 NaN NaN
5 NaN NaN qux 7
4. sort_values, sort_index
DataFrame.
sort_values
(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')- axis: 要排序的軸, axis=0則根據rows排序, axis=1則根據cols排序
- by: 要排序的鍵, axis=0則是columns, axis=1則是index
- ascending: True為升序, False為降序
- inplace: 在原表改變
>>> p1 = pd.DataFrame({'col1': [1, 2, 3],
... 'col2': [4, 5, 6],
... 'col3': [7, 8, 9]})
>>> p1
col1 col2 col3
0 1 4 7
1 2 5 8
2 3 6 9
>>> p1.sort_values(by=0, ascending=False, axis=1)
col3 col2 col1
0 7 4 1
1 8 5 2
2 9 6 3
>>> p1.sort_values(by='col1', ascending=False, axis=0)
col1 col2 col3
2 3 6 9
1 2 5 8
0 1 4 7
DataFrame.sort_index
(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, by=None)- axis: 根據軸排序索引, axis=0則排序row index, axis=1則排序column index
>>> p1
col1 col2 col3
0 1 4 7
1 2 5 8
2 3 6 9
>>> p1.sort_index(axis=0, ascending=False)
col1 col2 col3
2 3 6 9
1 2 5 8
0 1 4 7
>>> p1.sort_index(axis=1, ascending=False)
col3 col2 col1
0 7 4 1
1 8 5 2
2 9 6 3
5. idxmax
- Series
Series.
idxmax
(axis=0, skipna=True, *args, **kwargs)- 返回最大值的索引
>>> s = pd.Series(data=[1, None, 4, 3, 4],
... index=['A', 'B', 'C', 'D', 'E'])
>>> s
A 1.0
B NaN
C 4.0
D 3.0
E 4.0
dtype: float64
>>> s.idxmax()
'C'
- DataFrame
DataFrame.
idxmax
(axis=0, skipna=True)- 返回最大索引組成的Series
>>> p2
col1 col2 col3
0 1 6 2
1 2 1 8
2 3 0 1
>>> p2.idxmax()
col1 2
col2 0
col3 1
dtype: int64
>>> p2.idxmax(axis=1)
0 col2
1 col3
2 col1
dtype: object
6. 新構建的特徵需要轉換成numpy型別
result['citable doc per person'] = (result['Citable documents'] / result['Population']).astype(np.float64)
7.corr
- Series
Series.
corr
(other, method='pearson', min_periods=None)- 求出兩個Series的相關係數
>>> p2
col1 col2 col3
0 1 6 2
1 2 1 8
2 3 0 1
>>> p2.col2.corr(p2.col3)
-0.23281119015753007
- DataFrame
DataFrame.
corr
(method='pearson', min_periods=1)- 求出DataFrame各列的相關係數
>>> p2
col1 col2 col3
0 1 6 2
1 2 1 8
2 3 0 1
>>> p2.corr()
col1 col2 col3
col1 1.000000 -0.933257 -0.132068
col2 -0.933257 1.000000 -0.232811
col3 -0.132068 -0.232811 1.000000
8. map
- Series
-
接收函式作為或字典物件作為引數,返回經過函式或字典對映處理後的值
>>> p2
col1 col2 col3
0 1 6 2
1 2 1 8
2 3 0 1
>>> mapp1 = {1: 10, 2: 12}
>>> p2.col1.map(mapp1)
0 10.0
1 12.0
2 NaN
Name: col1, dtype: float64
>>> p2
col1 col2 col3
0 1 6 2
1 2 1 8
2 3 0 1
>>> p2.col2.map(lambda x: x+1)
0 7
1 2
2 1
Name: col2, dtype: int64
9. agg
DataFrameGroupBy.
agg
(arg, *args, **kwargs)- func: string function name / function / list of functions / dict of column names -> functions (or list of functions)
>>> df
A B C
0 1 1 0.362838
1 1 2 0.227877
2 2 3 1.267767
3 2 4 -0.562860
>>> df.groupby('A').agg(['min', 'max'])
B C
min max min max
A
1 1 2 0.227877 0.362838
2 3 4 -0.562860 1.267767
10. cut, qcut
- cut
pandas.
cut
(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise')- 將連續值均分割槽間轉換為離散值
>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3)
[(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], (0.994, 3.0]]
Categories (3, interval[float64]): [(0.994, 3.0] < (3.0, 5.0] < (5.0, 7.0]]
>>>
- qcut
pandas.
qcut
(x, q, labels=None, retbins=False, precision=3, duplicates='raise')- 按照頻率將值轉換為離散值
>>> pd.qcut(range(5), 4)
[(-0.001, 1.0], (-0.001, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]]
Categories (4, interval[float64]): [(-0.001, 1.0] < (1.0, 2.0] < (2.0, 3.0] < (3.0, 4.0]]