1. 程式人生 > >pandas 資料聚合

pandas 資料聚合

1. apply

  • Series
  • Series.apply(funcconvert_dtype=Trueargs=()**kwds)
  • func: 要進行資料聚合的函式,自動對Series內的每個資料呼叫func       
>>> import pandas as pd
>>> import numpy as np

>>> series = pd.Series([20, 21, 12], index=['London',
... 'New York','Helsinki'])
>>> series
London      20
New York    21
Helsinki    12
dtype: int64

>>> def square(x):
...     return x**2
>>> series.apply(square)
London      400
New York    441
Helsinki    144
dtype: int64
  • DataFrame
  • DataFrame.apply(funcaxis=0broadcast=Noneraw=Falsereduce=Noneresult_type=Noneargs=()**kwds)
  • func: 同上
  • axis: 沿某個軸進行聚合

         每次呼叫func的都是一個Series,

         axis = 0: apply函式會自動遍歷每一列DataFrame的資料,最後將所有結果組合成一個Series資料結構並返回

         axis = 1: apply函式會自動遍歷每一行DataFrame的資料,最後將所有結果組合成一個Series資料結構並返回

>>> df = pd.DataFrame([[4, 9],] * 3, columns=['A', 'B'])
>>> df
   A  B
0  4  9
1  4  9
2  4  9

>>> df.apply(np.sum, axis=0)
A    12
B    27
dtype: int64

>>> df.apply(np.sum, axis=1)
0    13
1    13
2    13
dtype: int64

 

2. iloc, loc

  • iloc: 用整數進行索引
  • loc: 用字串進行索引 
  • Series
>>> s2 = pd.Series(['a', 'b', 'c'], index=['one', 'two', 'three'])
>>> s2
one      a
two      b
three    c
dtype: object
# loc是用字串進行索引
>>> s2.loc['one']
'a'
# iloc是用整數進行索引
>>> s2.iloc[0]
'a'
# []用整數進行索引
>>> s2[0]
'a'
  • DataFrame
>>> s3 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, index=['one', 'two', 'three'])
>>> s3
       a  b
one    1  4
two    2  5
three  3  6
# loc與iloc都用來索引rows
>>> s3.iloc[0]
a    1
b    4
Name: one, dtype: int64
>>> s3.loc['one']
a    1
b    4
Name: one, dtype: int64
# []用來索引cols
>>> s3['a']
one      1
two      2
three    3
Name: a, dtype: int64

 

3. merge

  • DataFrame.merge(righthow='inner'on=Noneleft_on=Noneright_on=Noneleft_index=Falseright_index=Falsesort=Falsesuffixes=('_x''_y')copy=Trueindicator=Falsevalidate=None)
  • how: 兩表連線的方式('inner': 內連線, 'outer': 外連線, 'left': 左外連線, 'right': 右外連線)
  • on: 要連線的鍵(會自動尋找列名相同的列)
  • left_on: 左表要連線的列名
  • right_on: 右表要連線的列名
  • left_index: 是否將左表的索引作為連線的鍵
  • right_index: 是否將右表的索引作為連線的鍵
>>> A              >>> B
    lkey value         rkey value
0   foo  1         0   foo  5
1   bar  2         1   bar  6
2   baz  3         2   qux  7
3   foo  4         3   bar  8

>>> A.merge(B, left_on='lkey', right_on='rkey', how='outer')
   lkey  value_x  rkey  value_y
0  foo   1        foo   5
1  foo   4        foo   5
2  bar   2        bar   6
3  bar   2        bar   8
4  baz   3        NaN   NaN
5  NaN   NaN      qux   7

 

4. sort_values, sort_index

  • DataFrame.sort_values(byaxis=0ascending=Trueinplace=Falsekind='quicksort'na_position='last')
  • axis: 要排序的軸, axis=0則根據rows排序, axis=1則根據cols排序
  • by: 要排序的鍵, axis=0則是columns, axis=1則是index
  • ascending: True為升序, False為降序
  • inplace: 在原表改變
>>> p1 = pd.DataFrame({'col1': [1, 2, 3], 
... 'col2': [4, 5, 6],
... 'col3': [7, 8, 9]})
>>> p1
   col1  col2  col3
0     1     4     7
1     2     5     8
2     3     6     9
>>> p1.sort_values(by=0, ascending=False, axis=1)
   col3  col2  col1
0     7     4     1
1     8     5     2
2     9     6     3
>>> p1.sort_values(by='col1', ascending=False, axis=0)
   col1  col2  col3
2     3     6     9
1     2     5     8
0     1     4     7
  • DataFrame.sort_index(axis=0level=Noneascending=Trueinplace=Falsekind='quicksort'na_position='last'sort_remaining=Trueby=None)
  • axis: 根據軸排序索引, axis=0則排序row index, axis=1則排序column index
>>> p1
   col1  col2  col3
0     1     4     7
1     2     5     8
2     3     6     9
>>> p1.sort_index(axis=0, ascending=False)
   col1  col2  col3
2     3     6     9
1     2     5     8
0     1     4     7
>>> p1.sort_index(axis=1, ascending=False)
   col3  col2  col1
0     7     4     1
1     8     5     2
2     9     6     3

 

5. idxmax

  • Series
  • Series.idxmax(axis=0skipna=True*args**kwargs)
  • 返回最大值的索引
>>> s = pd.Series(data=[1, None, 4, 3, 4],
...               index=['A', 'B', 'C', 'D', 'E'])
>>> s
A    1.0
B    NaN
C    4.0
D    3.0
E    4.0
dtype: float64
>>> s.idxmax()
'C'
  • DataFrame
  • DataFrame.idxmax(axis=0skipna=True)
  • 返回最大索引組成的Series
>>> p2
   col1  col2  col3
0     1     6     2
1     2     1     8
2     3     0     1
>>> p2.idxmax()
col1    2
col2    0
col3    1
dtype: int64
>>> p2.idxmax(axis=1)
0    col2
1    col3
2    col1
dtype: object

 

6. 新構建的特徵需要轉換成numpy型別

result['citable doc per person'] = (result['Citable documents'] / result['Population']).astype(np.float64)

 

7.corr

  • Series
  • Series.corr(othermethod='pearson'min_periods=None)
  • 求出兩個Series的相關係數
>>> p2
   col1  col2  col3
0     1     6     2
1     2     1     8
2     3     0     1
>>> p2.col2.corr(p2.col3)
-0.23281119015753007
  • DataFrame
  • DataFrame.corr(method='pearson'min_periods=1)
  • 求出DataFrame各列的相關係數
>>> p2
   col1  col2  col3
0     1     6     2
1     2     1     8
2     3     0     1
>>> p2.corr()
          col1      col2      col3
col1  1.000000 -0.933257 -0.132068
col2 -0.933257  1.000000 -0.232811
col3 -0.132068 -0.232811  1.000000

 

8. map

  • Series
  • 接收函式作為或字典物件作為引數,返回經過函式或字典對映處理後的值

>>> p2
   col1  col2  col3
0     1     6     2
1     2     1     8
2     3     0     1
>>> mapp1 = {1: 10, 2: 12}
>>> p2.col1.map(mapp1)
0    10.0
1    12.0
2     NaN
Name: col1, dtype: float64

>>> p2
   col1  col2  col3
0     1     6     2
1     2     1     8
2     3     0     1
>>> p2.col2.map(lambda x: x+1)
0    7
1    2
2    1
Name: col2, dtype: int64

 

9. agg

  • DataFrameGroupBy.agg(arg*args**kwargs)
  • func: string function name / function / list of functions / dict of column names -> functions (or list of functions)
>>> df
   A  B         C
0  1  1  0.362838
1  1  2  0.227877
2  2  3  1.267767
3  2  4 -0.562860
>>> df.groupby('A').agg(['min', 'max'])
    B             C
  min max       min       max
A
1   1   2  0.227877  0.362838
2   3   4 -0.562860  1.267767

 

10. cut, qcut

  • cut
  • pandas.cut(xbinsright=Truelabels=Noneretbins=Falseprecision=3include_lowest=Falseduplicates='raise')
  • 將連續值均分割槽間轉換為離散值
>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3)
[(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], (0.994, 3.0]]
Categories (3, interval[float64]): [(0.994, 3.0] < (3.0, 5.0] < (5.0, 7.0]]
>>> 
  • qcut
  • pandas.qcut(xqlabels=Noneretbins=Falseprecision=3duplicates='raise')
  • 按照頻率將值轉換為離散值
>>> pd.qcut(range(5), 4)
[(-0.001, 1.0], (-0.001, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]]
Categories (4, interval[float64]): [(-0.001, 1.0] < (1.0, 2.0] < (2.0, 3.0] < (3.0, 4.0]]