pandas資料處理實踐五（透視表pivot_table、分組和透視表實戰Grouper和pivot_table）

阿新 • • 發佈：2018-12-13

透視表：

DataFrame.pivot_table（values = None，index = None，columns = None，aggfunc ='mean'，fill_value = None，margin = False，dropna = True，margins_name ='All' ）

建立一個電子表格樣式的資料透視表作為DataFrame。資料透視表中的級別將儲存在結果DataFrame的索引和列上的MultiIndex物件

引數：	values ：要聚合的列，可選 index：列，Grouper，陣列或前一個列表如果傳遞陣列，則它必須與資料的長度相同。該列表可以包含任何其他型別（列表除外）。在資料透視表索引上分組的鍵。如果傳遞陣列，則其使用方式與列值相同。 columns：列，Grouper，陣列或前一個列表如果傳遞陣列，則它必須與資料的長度相同。該列表可以包含任何其他型別（列表除外）。在資料透視表列上分組的鍵。如果傳遞陣列，則其使用方式與列值相同。 aggfunc：function，function of list，dict，default numpy.mean 如果傳遞的函式列表，生成的資料透視表將具有分層列，其頂層是函式名稱（從函式物件本身推斷）如果傳遞dict，則鍵是要聚合的列，值是函式或函式列表 fill_value：標量，預設無用於替換缺失值的值 margin：boolean，預設為False 新增所有行/列（例如，對於小計/總計） dropna：布林值，預設為True 不要包含條目都是NaN的列 margins_name：string，預設為'All' 當margin為True時，將包含總計的行/列的名稱。
返回：	table ： DataFrame

引數：

values ：要聚合的列，可選

index：列，Grouper，陣列或前一個列表

如果傳遞陣列，則它必須與資料的長度相同。該列表可以包含任何其他型別（列表除外）。在資料透視表索引上分組的鍵。如果傳遞陣列，則其使用方式與列值相同。

columns：列，Grouper，陣列或前一個列表

如果傳遞陣列，則它必須與資料的長度相同。該列表可以包含任何其他型別（列表除外）。在資料透視表列上分組的鍵。如果傳遞陣列，則其使用方式與列值相同。

aggfunc：function，function of list，dict，default numpy.mean

如果傳遞的函式列表，生成的資料透視表將具有分層列，其頂層是函式名稱（從函式物件本身推斷）如果傳遞dict，則鍵是要聚合的列，值是函式或函式列表

fill_value：標量，預設無

用於替換缺失值的值

margin：boolean，預設為False

新增所有行/列（例如，對於小計/總計）

dropna：布林值，預設為True

不要包含條目都是NaN的列

margins_name：string，預設為'All'

當margin為True時，將包含總計的行/列的名稱。

table ： DataFrame

In [7]: df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",^M
   ...:                           "bar", "bar", "bar", "bar"],^M
   ...:                     "B": ["one", "one", "one", "two", "two",^M
   ...:                           "one", "one", "two", "two"],^M
   ...:                     "C": ["small", "large", "large", "small",^M
   ...:                           "small", "large", "small", "small",^M
   ...:                           "large"],^M
   ...:                     "D": [1, 2, 2, 3, 3, 4, 5, 6, 7]})
   ...:

In [8]: df
Out[8]:
     A    B      C  D
0  foo  one  small  1
1  foo  one  large  2
2  foo  one  large  2
3  foo  two  small  3
4  foo  two  small  3
5  bar  one  large  4
6  bar  one  small  5
7  bar  two  small  6
8  bar  two  large  7

In [9]: table = pd.pivot_table(df, values='D', index=['A','B'], columns=['C'], aggfunc=np.sum)
# 通過透視表，以A,B為索引物件，以c作為列，把D作為值填充
In [10]: table
Out[10]:
C        large  small
A   B
bar one    4.0    5.0
    two    7.0    6.0
foo one    4.0    1.0
    two    NaN    6.0

再舉一個例子：

import numpy as np
import pandas as pd
from pandas import Series,DataFrame

df = pd.read_excel('sales-funnel.xlsx')

df.head() # 檢視前五行資料

	Account	Name	Rep	Manager	Product	Quantity	Price	Status
0	714466	Trantow-Barrows	Craig Booker	Debra Henley	CPU	1	30000	presented
1	714466	Trantow-Barrows	Craig Booker	Debra Henley	Software	1	10000	presented
2	714466	Trantow-Barrows	Craig Booker	Debra Henley	Maintenance	2	5000	pending
3	737550	Fritsch, Russel and Anderson	Craig Booker	Debra Henley	CPU	1	35000	declined
4	146832	Kiehn-Spinka	Daniel Hilton	Debra Henley	CPU	2	65000	won

# 生成透視表
# 從資料的顯示來看，我們對顧客  購買的總價錢感興趣，如何轉換表格呢？
pd.pivot_table(df,index=['Name'],aggfunc='sum')
	Account	Price	Quantity
Name			
Barton LLC	740150	35000	1
Fritsch, Russel and Anderson	737550	35000	1
Herman LLC	141962	65000	2
Jerde-Hilpert	412290	5000	2
Kassulke, Ondricka and Metz	307599	7000	3
Keeling LLC	688981	100000	5
Kiehn-Spinka	146832	65000	2
Koepp Ltd	1459666	70000	4
Kulas Inc	437790	50000	3
Purdy-Kunde	163416	30000	1
Stokes LLC	478688	15000	2
Trantow-Barrows	2143398	45000	4

pd.pivot_table(df,index=['Name','Rep','Manager'])


Account	Price	Quantity
Name	Rep	Manager			
Barton LLC	John Smith	Debra Henley	740150.0	35000.0	1.000000
Fritsch, Russel and Anderson	Craig Booker	Debra Henley	737550.0	35000.0	1.000000
Herman LLC	Cedric Moss	Fred Anderson	141962.0	65000.0	2.000000
Jerde-Hilpert	John Smith	Debra Henley	412290.0	5000.0	2.000000
Kassulke, Ondricka and Metz	Wendy Yule	Fred Anderson	307599.0	7000.0	3.000000
Keeling LLC	Wendy Yule	Fred Anderson	688981.0	100000.0	5.000000
Kiehn-Spinka	Daniel Hilton	Debra Henley	146832.0	65000.0	2.000000
Koepp Ltd	Wendy Yule	Fred Anderson	729833.0	35000.0	2.000000
Kulas Inc	Daniel Hilton	Debra Henley	218895.0	25000.0	1.500000
Purdy-Kunde	Cedric Moss	Fred Anderson	163416.0	30000.0	1.000000
Stokes LLC	Cedric Moss	Fred Anderson	239344.0	7500.0	1.000000
Trantow-Barrows	Craig Booker	Debra Henley	714466.0	15000.0	1.333333

分組和透視表的使用：

本試驗的資料是飛機延誤


In [15]: import numpy as np^M
    ...: import pandas as pd^M
    ...: from pandas import Series,DataFrame
    ...:
    ...:

In [16]: df = pd.read_csv('usa_flights.csv')

In [17]:

In [17]: df.head()
Out[17]:
       flight_date unique_carrier         ...          security_delay actual_elapsed_time
0  02/01/2015 0:00             AA         ...                     NaN               381.0
1  03/01/2015 0:00             AA         ...                     NaN               358.0
2  04/01/2015 0:00             AA         ...                     NaN               385.0
3  05/01/2015 0:00             AA         ...                     NaN               389.0
4  06/01/2015 0:00             AA         ...                     0.0               424.0

[5 rows x 14 columns]

In [18]: df.shape # 檢視資料的維度
Out[18]: (201664, 14)

In [22]: df.columns # 檢視資料的列標籤
Out[22]:
Index(['flight_date', 'unique_carrier', 'flight_num', 'origin', 'dest',
       'arr_delay', 'cancelled', 'distance', 'carrier_delay', 'weather_delay',
       'late_aircraft_delay', 'nas_delay', 'security_delay',
       'actual_elapsed_time'],
      dtype='object')

任務一：1.通過arr_delay排序觀察延誤時間最長top10

In [23]: df.sort_values('arr_delay',ascending=False).head(10)

2.計算延誤和沒有延誤的比例

In [24]: df['cancelled'].value_counts() # 計算取消航班和正常航班的總次數
Out[24]:
0    196873
1      4791
Name: cancelled, dtype: int64

In [25]: df['delayed'] = df['arr_delay'].apply(lambda x: x>0) #把延誤的轉為數值量

In [26]: df.head()

In [27]: delay_data = df['delayed'].value_counts()

In [28]: delay_data
Out[28]:
False    103037
True      98627
Name: delayed, dtype: int64

In [29]: delay_data[0]
Out[29]: 103037

In [30]: delay_data[1] / (delay_data[0] + delay_data[1])
Out[30]: 0.4890659711202793

3.每個航空公司的延誤情況


In [31]: delay_group = df.groupby(['unique_carrier','delayed'])

In [32]: df_delay = delay_group.size().unstack()

In [33]: df_delay
Out[33]:
delayed         False  True
unique_carrier
AA               8912   9841
AS               3527   2104
B6               4832   4401
DL              17719   9803
EV              10596  11371
F9               1103   1848
HA               1351   1354
MQ               4692   8060
NK               1550   2133
OO               9977  10804
UA               7885   8624
US               7850   6353
VX               1254    781
WN              21789  21150

In [34]: import matplotlib.pyplot as plt

In [35]: df_delay.plot()

pandas資料處理實踐五（透視表pivot_table、分組和透視表實戰Grouper和pivot_table）

透視表： DataFrame.pivot_table（values = None，index = None，columns = None，aggfunc ='mean'，fill_value = None，margin = False，dropna = True，margi

pandas資料處理實踐四（時間序列date_range、資料分箱cut、分組技術GroupBy）

時間序列：關鍵函式 pandas.date_range（start = None，end = None，periods = None，freq = None，tz = None，normalize = False，name = None，closed = None，**

pandas資料處理實踐三（DataFrame.apply資料預處理、DataFrame.drop_duplicates去重）

通過apply進行資料的預處理： DataFrame.apply（func，axis = 0，broadcast = None，raw = False，reduce = None，result_type = None，args =（），** kwds ） In [70

pandas資料處理（一）pymongo資料庫量大插入時去重速度慢

　　之前寫指令碼爬鬥魚主播資訊時用了一個pymongo的去重語句 db['host_info'].update({'主播': data['主播'], '時間': data['時間']}, {'$set': data}, True): 　　這句話以主播和時間為索引判斷資料庫中如果沒有同一主播同一時

資料處理--reshape2包（長寬資料）

寬資料 ozone wind temp 1 23.62 11.623 65.55 2 29.44 10.267 79.10 3 59.12 8.942 83.90 4 59.96 8.794 83.97 長資料 variable value 1 ozone 23.

（轉）大資料處理之道（十分鐘學會Python）

轉自：http://blog.csdn.net/u010700335/article/details/42025391，如侵刪（0）目錄快速學Python 和易犯錯誤（文字處理） Python文字處理和Java/C比對十分鐘學會Python的基本型別快速學會Python（

海量資料處理：十道面試題與十個海量資料處理方法總結（大資料演算法面試題）

第一部分、十道海量資料處理面試題 1、海量日誌資料，提取出某日訪問百度次數最多的那個IP。首先是這一天，並且是訪問百度的日誌中的IP取出來，逐個寫入到一個大檔案中。注意到IP是32位的，最多有個2^32個IP。同樣可以採用對映的方法

大資料ETL實踐探索（4）---- 之搜尋神器elastic search

3.本地檔案匯入aws elastic search 修改訪問策略，設定本地電腦的公網ip，這個經常會變化，每次使用時候需要設定一下安裝anancota https://www.anaconda.com/download/ 初始化環境，win10下開啟Anaco

大資料ETL實踐探索（3）---- pyspark 之大資料ETL利器

5.spark dataframe 資料匯入Elasticsearch 5.1 dataframe 及環境初始化初始化， spark 第三方網站下載包：elasticsearch-spark-20_2.11-6.1.1.jar http://spark.apache.org/t

大資料ETL實踐探索（1）---- python 與oracle資料庫匯入匯出

文章大綱 ETL 簡介工具的選擇 1. oracle資料泵匯入匯出實戰 1.1 資料庫建立 1.2. installs Oracle 1.3 export / import data from oracle

大資料ETL實踐探索（2）---- python 與aws 互動

文章大綱本文主要使用python基於oracle和aws 相關元件進行一些基本的資料匯入匯出實戰，oracle使用資料泵impdp進行匯入操作，aws使用awscli進行上傳下載操作。本地檔案上傳至aws es，spark dataframe錄

第3章 Pandas資料處理(3.1-3.2)_Python資料科學手冊學習筆記

第2章介紹的NumPy和它的ndarray物件. 為多維陣列提供了高效的儲存和處理方法. Pandas是在NumPy的基礎上建立的新程式庫, 提供DataFrame資料結構. DataFrame帶行標籤(索引),列標籤(變數名),支援相同資料型別和缺失值的多維陣

第3章 Pandas資料處理(3.4-3.5)_Python資料科學手冊學習筆記

3.4 Pandas 數值運算方法對於一元運算(像函式與三角函式),這些通用函式將在輸出結果中保留索引和列標籤; 而對於二元運算(如加法和乘法), Pandas在傳遞通用函式時會自動對齊索引進行計算. 這就意味著,儲存資料內容和組合不同來源的資料—兩處在Num

第3章 Pandas資料處理(3.3)_Python資料科學手冊學習筆記

3.3 資料取值與選擇第2章回顧: - NumPy中取值操作: arr[2,1] - 切片操作: arr[:,1:5] - 掩碼操作: arr[arr>0] - 花哨的索引操作: arr[0,[1,5]] - 組合操作: arr[:,[1:5]] 3.3

一共81個，開源大資料處理工具彙總（下）

日誌收集系統　　一、Facebook Scribe 　　貢獻者：Facebook 　　簡介：Scribe是Facebook開源的日誌收集系統，在Facebook內部已經得到大量的應用。它能夠從各種日誌源上收集日誌，儲存到一箇中央儲存系統（可以是NFS，分散式檔案系

pandas資料分析輕鬆學（二）——讀取Excel檔案

該系列部落格，均來自劉鐵猛老師的視訊內容，網址如下：一、IDE環境：anaconda+PyCharm，python3.6 二、新建.py檔案（注意檔案命名）三、具體讀取資料操作如下： Excel檔案頭部和尾部行資料的讀取 import pandas as

第3章 Pandas資料處理(3.9-3.10)_Python資料科學手冊學習筆記

3.9 累計與分組 3.9.1 行星資料 import seaborn as sns planets = sns.load_dataset('planets') planets.shape (1035, 6) planets.head()

python的pandas資料處理

1、numpy 純屬組，有一維二維三維陣列，但是無索引與列名，所以計算速度快 2、series 一維陣列，有標籤，（主要是用在時間序列的資料上） 3、dataframe 二維資料表格裡橫向A B ，縱向A B 4、panel 三維資料由items major

pandas資料合併與重塑（pd.concat篇）

1 concat concat函式是在pandas底下的方法，可以將資料根據不同的軸作簡單的融合 1 2 pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,

Python資料處理(二) | Pandas資料處理

本篇部落格所有示例使用Jupyter NoteBook演示。 Python資料處理系列筆記基於：Python資料科學手冊電子版下載密碼：ovnh 示例程式碼下載密碼:02f4 目錄

pandas資料處理實踐五（透視表pivot_table、分組和透視表實戰Grouper和pivot_table）

透視表：

分組和透視表的使用：

任務一：1.通過arr_delay排序觀察延誤時間最長top10

2.計算延誤和沒有延誤的比例

3.每個航空公司的延誤情況

相關推薦