pandas 資料聚合

阿新 • • 發佈：2018-11-23

1. apply

Series
Series.apply(func, convert_dtype=True, args=(), **kwds)
func: 要進行資料聚合的函式，自動對Series內的每個資料呼叫func

>>> import pandas as pd
>>> import numpy as np

>>> series = pd.Series([20, 21, 12], index=['London',
... 'New York','Helsinki'])
>>> series
London      20
New York    21
Helsinki    12
dtype: int64

>>> def square(x):
...     return x**2
>>> series.apply(square)
London      400
New York    441
Helsinki    144
dtype: int64

DataFrame
DataFrame.apply(func, axis=0, broadcast=None, raw=False, reduce=None, result_type=None, args=(), **kwds)
func: 同上
axis: 沿某個軸進行聚合

每次呼叫func的都是一個Series,

axis = 0: apply函式會自動遍歷每一列DataFrame的資料，最後將所有結果組合成一個Series資料結構並返回

axis = 1: apply函式會自動遍歷每一行DataFrame的資料，最後將所有結果組合成一個Series資料結構並返回

>>> df = pd.DataFrame([[4, 9],] * 3, columns=['A', 'B'])
>>> df
   A  B
0  4  9
1  4  9
2  4  9

>>> df.apply(np.sum, axis=0)
A    12
B    27
dtype: int64

>>> df.apply(np.sum, axis=1)
0    13
1    13
2    13
dtype: int64

2. iloc, loc

iloc: 用整數進行索引
loc: 用字串進行索引
Series

>>> s2 = pd.Series(['a', 'b', 'c'], index=['one', 'two', 'three'])
>>> s2
one      a
two      b
three    c
dtype: object
# loc是用字串進行索引
>>> s2.loc['one']
'a'
# iloc是用整數進行索引
>>> s2.iloc[0]
'a'
# []用整數進行索引
>>> s2[0]
'a'

DataFrame

>>> s3 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, index=['one', 'two', 'three'])
>>> s3
       a  b
one    1  4
two    2  5
three  3  6
# loc與iloc都用來索引rows
>>> s3.iloc[0]
a    1
b    4
Name: one, dtype: int64
>>> s3.loc['one']
a    1
b    4
Name: one, dtype: int64
# []用來索引cols
>>> s3['a']
one      1
two      2
three    3
Name: a, dtype: int64

3. merge

DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
how: 兩表連線的方式('inner': 內連線, 'outer': 外連線, 'left': 左外連線, 'right': 右外連線)
on: 要連線的鍵(會自動尋找列名相同的列)
left_on: 左表要連線的列名
right_on: 右表要連線的列名
left_index: 是否將左表的索引作為連線的鍵
right_index: 是否將右表的索引作為連線的鍵

>>> A              >>> B
    lkey value         rkey value
0   foo  1         0   foo  5
1   bar  2         1   bar  6
2   baz  3         2   qux  7
3   foo  4         3   bar  8

>>> A.merge(B, left_on='lkey', right_on='rkey', how='outer')
   lkey  value_x  rkey  value_y
0  foo   1        foo   5
1  foo   4        foo   5
2  bar   2        bar   6
3  bar   2        bar   8
4  baz   3        NaN   NaN
5  NaN   NaN      qux   7

4. sort_values, sort_index

DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
axis: 要排序的軸, axis=0則根據rows排序, axis=1則根據cols排序
by: 要排序的鍵, axis=0則是columns, axis=1則是index
ascending: True為升序, False為降序
inplace: 在原表改變

>>> p1 = pd.DataFrame({'col1': [1, 2, 3], 
... 'col2': [4, 5, 6],
... 'col3': [7, 8, 9]})
>>> p1
   col1  col2  col3
0     1     4     7
1     2     5     8
2     3     6     9
>>> p1.sort_values(by=0, ascending=False, axis=1)
   col3  col2  col1
0     7     4     1
1     8     5     2
2     9     6     3
>>> p1.sort_values(by='col1', ascending=False, axis=0)
   col1  col2  col3
2     3     6     9
1     2     5     8
0     1     4     7

DataFrame.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, by=None)
axis: 根據軸排序索引, axis=0則排序row index, axis=1則排序column index

>>> p1
   col1  col2  col3
0     1     4     7
1     2     5     8
2     3     6     9
>>> p1.sort_index(axis=0, ascending=False)
   col1  col2  col3
2     3     6     9
1     2     5     8
0     1     4     7
>>> p1.sort_index(axis=1, ascending=False)
   col3  col2  col1
0     7     4     1
1     8     5     2
2     9     6     3

5. idxmax

Series
Series.idxmax(axis=0, skipna=True, *args, **kwargs)
返回最大值的索引

>>> s = pd.Series(data=[1, None, 4, 3, 4],
...               index=['A', 'B', 'C', 'D', 'E'])
>>> s
A    1.0
B    NaN
C    4.0
D    3.0
E    4.0
dtype: float64
>>> s.idxmax()
'C'

DataFrame
DataFrame.idxmax(axis=0, skipna=True)
返回最大索引組成的Series

>>> p2
   col1  col2  col3
0     1     6     2
1     2     1     8
2     3     0     1
>>> p2.idxmax()
col1    2
col2    0
col3    1
dtype: int64
>>> p2.idxmax(axis=1)
0    col2
1    col3
2    col1
dtype: object

6. 新構建的特徵需要轉換成numpy型別

result['citable doc per person'] = (result['Citable documents'] / result['Population']).astype(np.float64)

7.corr

Series
Series.corr(other, method='pearson', min_periods=None)
求出兩個Series的相關係數

>>> p2
   col1  col2  col3
0     1     6     2
1     2     1     8
2     3     0     1
>>> p2.col2.corr(p2.col3)
-0.23281119015753007

DataFrame
DataFrame.corr(method='pearson', min_periods=1)
求出DataFrame各列的相關係數

>>> p2
   col1  col2  col3
0     1     6     2
1     2     1     8
2     3     0     1
>>> p2.corr()
          col1      col2      col3
col1  1.000000 -0.933257 -0.132068
col2 -0.933257  1.000000 -0.232811
col3 -0.132068 -0.232811  1.000000

8. map

Series
接收函式作為或字典物件作為引數，返回經過函式或字典對映處理後的值

>>> p2
   col1  col2  col3
0     1     6     2
1     2     1     8
2     3     0     1
>>> mapp1 = {1: 10, 2: 12}
>>> p2.col1.map(mapp1)
0    10.0
1    12.0
2     NaN
Name: col1, dtype: float64

>>> p2
   col1  col2  col3
0     1     6     2
1     2     1     8
2     3     0     1
>>> p2.col2.map(lambda x: x+1)
0    7
1    2
2    1
Name: col2, dtype: int64

9. agg

DataFrameGroupBy.agg(arg, *args, **kwargs)
func: string function name / function / list of functions / dict of column names -> functions (or list of functions)

>>> df
   A  B         C
0  1  1  0.362838
1  1  2  0.227877
2  2  3  1.267767
3  2  4 -0.562860
>>> df.groupby('A').agg(['min', 'max'])
    B             C
  min max       min       max
A
1   1   2  0.227877  0.362838
2   3   4 -0.562860  1.267767

10. cut, qcut

cut
pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise')
將連續值均分割槽間轉換為離散值

>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3)
[(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], (0.994, 3.0]]
Categories (3, interval[float64]): [(0.994, 3.0] < (3.0, 5.0] < (5.0, 7.0]]
>>>

qcut
pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise')
按照頻率將值轉換為離散值

>>> pd.qcut(range(5), 4)
[(-0.001, 1.0], (-0.001, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]]
Categories (4, interval[float64]): [(-0.001, 1.0] < (1.0, 2.0] < (2.0, 3.0] < (3.0, 4.0]]

pandas 資料聚合與分組運算

1. GroupBy技術 pandas物件(無論是Series、DataFrame還是其他的)中的資料會根據你所提供的一個或多個鍵被拆分(split)為多組。拆分操作是在物件的特定軸上執行的。例如：DataFrame可以在其行(axis=0)或列(axis=1)上進行分組，然後將一個函式應用

pandas 資料聚合

1. apply Series Series.apply(func, convert_dtype=True, args=(), **kwds) func: 要進行資料聚合的函式，自動對Series內的每個資料呼叫func

python資料分析08——pandas資料聚合與分組運算

python資料分析08——pandas資料聚合與分組運算在將資料集載入、融合、準備好之後，通常就是計算分組統計或生成透視表， pandas提供了一個靈活高效的groupby功能，它使你能以一種自然的方式對資料集進行切片、切塊、摘要等操作。一、GroupBy機制分組運算"

pandas—資料聚合與分組運算

data1 (-2.848, -1.353] count 87.000000 max 2.230317 mean -0.073813

python資料分析06--Pandas資料歸整：聚合和重塑

在許多應用中，資料可能分散在許多檔案或資料庫中，儲存的形式也不不利利於分析，應採用聚合、合併、重塑資料的方法進行處理。一、層次化索引層次化索引（hierarchical indexing）是pandas的一項重要功能，它使你能在一個軸上擁有多個（兩個以上）索引級別。 In

pandas系列學習（六）：資料聚合

作者：chen_h 微訊號 & QQ：862251340 微信公眾號：coderpai 我最近一直在探索的一個方面是通過不同變數對大型資料幀進行分組，以及對每個組應用匯總函式的任務。這是在 pandas 中使用 DataFrame 物件的

python/pandas資料探勘（十四）-groupby,聚合，分組級運算

groupby import pandas as pd df = pd.DataFrame({'key1':list('aabba'), 'key2': ['one','two','one','two','one'],

python/pandas資料分析（十五）-聚合與分組運算例項

用特定於分組的值填充缺失值用平均值去填充nan s=pd.Series(np.random.randn(6)) s[::2]=np.nan s 0 NaN 1 -0.1181

python/pandas資料探勘（十四）-groupby,聚合，分組級運算---很全

groupby import pandas as pd df = pd.DataFrame({'key1':list('aabba'), 'key2': ['one','two','one','two','one'],

[Pandas]資料選取/資料切片

在對資料做相應操作之前，我們要先篩選出所需的資料，pandas提供了一些方法方便我們選取資料，這裡主要講解dataFrame型別的資料選取，Series型別用法類似，可以查閱文件中的相關屬性。 pandas主要提供了三種屬性用來選取行/列資料：屬性名

pandas資料對齊

Pandas的對齊運算是資料清洗的重要過程，可以按索引對齊進行運算，如果沒對齊的位置則補NaN，最後也可以填充NaN Series的對齊運算 1. Series 按行、索引對齊示例程式碼： s1 = pd.Series(range(10, 20), index = range(1

pandas資料規整化：清理、轉換、合併、重塑

資料分析和建模方面的大量程式設計工作都是用在資料準備上的：載入、清理、轉換以及重塑。許多人選擇使用通用程式語言或unix文字處理工具(如sed或awk)對資料格式進行專門處理。幸運的是，pandas和python標準庫提供了一組高階的、靈活的、高效的核心函式和演算法，將資料規整化正確的形

資料聚合與分組運算

#資料聚合與分組 import pandas as pd import numpy as np df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'], 'key2' : ['one',

資料基礎---《利用Python進行資料分析·第2版》第10章資料聚合與分組運算

之前自己對於numpy和pandas是要用的時候東學一點西一點，直到看到《利用Python進行資料分析·第2版》，覺得只看這一篇就夠了。非常感謝原博主的翻譯和分享。對資料集進行分組並對各組應用一個函式（無論是聚合還是轉換），通常是資料分析工作中的重要環節。在將資料集載入、融合、準備好之

Python中遍歷pandas資料的幾種方法介紹和效率對比說明

前言 Pandas是python的一個數據分析包，提供了大量的快速便捷處理資料的函式和方法。其中Pandas定義了Series 和 DataFrame兩種資料型別，這使資料操作變得更簡單。Series 是一種一維的資料結構，類似於將列表資料值與索引值相結合。DataFrame 是一種二維

pandas資料清洗，排序，索引設定，資料選取

此教程適合有pandas基礎的童鞋來看，很多知識點會一筆帶過，不做詳細解釋 Pandas資料格式 Series DataFrame：每個column就是一個Series 基礎屬性sha

pandas資料結構之Dataframe

Dataframe DataFrame是一個【表格型】的資料結構，可以看做是【由Series組成的字典】（多個series共用同一個索引）。DataFrame由按一定順序排列的多列資料組成。設計初衷是將Series的使用場景從一維拓展到多維。DataFrame既有行索引，也有列索引。行索引：ind

Python爬蟲，看看我最近部落格都寫了啥，帶你製作高逼格的資料聚合雲圖

今天一時興起，想用python爬爬自己的部落格，通過資料聚合，製作高逼格的雲圖(對詞彙出現頻率視覺上的展示)，看看最近我到底寫了啥文章。 1.1 爬取文章的標題的聚合 1.2 爬取文章的摘要的聚合 1.3 爬取文章的標題+摘要的聚合我

【python學習筆記】42：Pandas資料缺失值/異常值/重複值處理

學習《Python3爬蟲、資料清洗與視覺化實戰》時自己的一些實踐。缺失值處理 Pandas資料物件中的缺失值表示為NaN。 import pandas as pd # 讀取杭州天氣檔案 df = pd.read_csv("E:/Data/practice/hz_we

閒魚在資料聚合上的探索與實踐

概述隨著業務的不斷擴張，各種運營活動越來越多，原有的前端渲染-後端提供業務介面的開發方式對於一個生命週期可能只有幾天的活動來說成本巨大。閒魚在降低開發成本，提高整體效率上做了一些嘗試和實踐。本文介紹閒魚從資料聚合方面進行了一些探索和嘗試，以及Graphql的引入給閒魚帶了研發效率的提升。

pandas 資料聚合

相關推薦