1. 程式人生 > >pandas資料處理實踐五(透視表pivot_table、分組和透視表實戰Grouper和pivot_table)

pandas資料處理實踐五(透視表pivot_table、分組和透視表實戰Grouper和pivot_table)

透視表:

DataFrame.pivot_tablevalues = Noneindex = Nonecolumns = Noneaggfunc ='mean'fill_value = Nonemargin = Falsedropna = Truemargins_name ='All' 

建立一個電子表格樣式的資料透視表作為DataFrame。資料透視表中的級別將儲存在結果DataFrame的索引和列上的MultiIndex物件

引數:

values : 要聚合的列,可選

index:列,Grouper,陣列或前一個列表

如果傳遞陣列,則它必須與資料的長度相同。該列表可以包含任何其他型別(列表除外)。在資料透視表索引上分組的鍵。如果傳遞陣列,則其使用方式與列值相同。

columns:列,Grouper,陣列或前一個列表

如果傳遞陣列,則它必須與資料的長度相同。該列表可以包含任何其他型別(列表除外)。在資料透視表列上分組的鍵。如果傳遞陣列,則其使用方式與列值相同。

aggfunc:function,function of list,dict,default numpy.mean

如果傳遞的函式列表,生成的資料透視表將具有分層列,其頂層是函式名稱(從函式物件本身推斷)如果傳遞dict,則鍵是要聚合的列,值是函式或函式列表

fill_value:標量,預設無

用於替換缺失值的值

margin:boolean,預設為False

新增所有行/列(例如,對於小計/總計)

dropna:布林值,預設為True

不要包含條目都是NaN的列

margins_name:string,預設為'All'

當margin為True時,將包含總計的行/列的名稱。

返回:

table : DataFrame

In [7]: df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",^M
   ...:                           "bar", "bar", "bar", "bar"],^M
   ...:                     "B": ["one", "one", "one", "two", "two",^M
   ...:                           "one", "one", "two", "two"],^M
   ...:                     "C": ["small", "large", "large", "small",^M
   ...:                           "small", "large", "small", "small",^M
   ...:                           "large"],^M
   ...:                     "D": [1, 2, 2, 3, 3, 4, 5, 6, 7]})
   ...:

In [8]: df
Out[8]:
     A    B      C  D
0  foo  one  small  1
1  foo  one  large  2
2  foo  one  large  2
3  foo  two  small  3
4  foo  two  small  3
5  bar  one  large  4
6  bar  one  small  5
7  bar  two  small  6
8  bar  two  large  7

In [9]: table = pd.pivot_table(df, values='D', index=['A','B'], columns=['C'], aggfunc=np.sum)
# 通過透視表,以A,B為索引物件,以c作為列,把D作為值填充
In [10]: table
Out[10]:
C        large  small
A   B
bar one    4.0    5.0
    two    7.0    6.0
foo one    4.0    1.0
    two    NaN    6.0

再舉一個例子:

import numpy as np
import pandas as pd
from pandas import Series,DataFrame

df = pd.read_excel('sales-funnel.xlsx')

df.head() # 檢視前五行資料

	Account	Name	Rep	Manager	Product	Quantity	Price	Status
0	714466	Trantow-Barrows	Craig Booker	Debra Henley	CPU	1	30000	presented
1	714466	Trantow-Barrows	Craig Booker	Debra Henley	Software	1	10000	presented
2	714466	Trantow-Barrows	Craig Booker	Debra Henley	Maintenance	2	5000	pending
3	737550	Fritsch, Russel and Anderson	Craig Booker	Debra Henley	CPU	1	35000	declined
4	146832	Kiehn-Spinka	Daniel Hilton	Debra Henley	CPU	2	65000	won

# 生成透視表
# 從資料的顯示來看,我們對顧客  購買的總價錢感興趣,如何轉換表格呢?
pd.pivot_table(df,index=['Name'],aggfunc='sum')
	Account	Price	Quantity
Name			
Barton LLC	740150	35000	1
Fritsch, Russel and Anderson	737550	35000	1
Herman LLC	141962	65000	2
Jerde-Hilpert	412290	5000	2
Kassulke, Ondricka and Metz	307599	7000	3
Keeling LLC	688981	100000	5
Kiehn-Spinka	146832	65000	2
Koepp Ltd	1459666	70000	4
Kulas Inc	437790	50000	3
Purdy-Kunde	163416	30000	1
Stokes LLC	478688	15000	2
Trantow-Barrows	2143398	45000	4

pd.pivot_table(df,index=['Name','Rep','Manager'])


Account	Price	Quantity
Name	Rep	Manager			
Barton LLC	John Smith	Debra Henley	740150.0	35000.0	1.000000
Fritsch, Russel and Anderson	Craig Booker	Debra Henley	737550.0	35000.0	1.000000
Herman LLC	Cedric Moss	Fred Anderson	141962.0	65000.0	2.000000
Jerde-Hilpert	John Smith	Debra Henley	412290.0	5000.0	2.000000
Kassulke, Ondricka and Metz	Wendy Yule	Fred Anderson	307599.0	7000.0	3.000000
Keeling LLC	Wendy Yule	Fred Anderson	688981.0	100000.0	5.000000
Kiehn-Spinka	Daniel Hilton	Debra Henley	146832.0	65000.0	2.000000
Koepp Ltd	Wendy Yule	Fred Anderson	729833.0	35000.0	2.000000
Kulas Inc	Daniel Hilton	Debra Henley	218895.0	25000.0	1.500000
Purdy-Kunde	Cedric Moss	Fred Anderson	163416.0	30000.0	1.000000
Stokes LLC	Cedric Moss	Fred Anderson	239344.0	7500.0	1.000000
Trantow-Barrows	Craig Booker	Debra Henley	714466.0	15000.0	1.333333

分組和透視表的使用:

本試驗的資料是飛機延誤


In [15]: import numpy as np^M
    ...: import pandas as pd^M
    ...: from pandas import Series,DataFrame
    ...:
    ...:

In [16]: df = pd.read_csv('usa_flights.csv')

In [17]:

In [17]: df.head()
Out[17]:
       flight_date unique_carrier         ...          security_delay actual_elapsed_time
0  02/01/2015 0:00             AA         ...                     NaN               381.0
1  03/01/2015 0:00             AA         ...                     NaN               358.0
2  04/01/2015 0:00             AA         ...                     NaN               385.0
3  05/01/2015 0:00             AA         ...                     NaN               389.0
4  06/01/2015 0:00             AA         ...                     0.0               424.0

[5 rows x 14 columns]

In [18]: df.shape # 檢視資料的維度
Out[18]: (201664, 14)

In [22]: df.columns # 檢視資料的列標籤
Out[22]:
Index(['flight_date', 'unique_carrier', 'flight_num', 'origin', 'dest',
       'arr_delay', 'cancelled', 'distance', 'carrier_delay', 'weather_delay',
       'late_aircraft_delay', 'nas_delay', 'security_delay',
       'actual_elapsed_time'],
      dtype='object')

任務一:1.通過arr_delay排序觀察延誤時間最長top10

In [23]: df.sort_values('arr_delay',ascending=False).head(10)

2.計算延誤和沒有延誤的比例 

In [24]: df['cancelled'].value_counts() # 計算取消航班和正常航班的總次數
Out[24]:
0    196873
1      4791
Name: cancelled, dtype: int64

In [25]: df['delayed'] = df['arr_delay'].apply(lambda x: x>0) #把延誤的轉為數值量

In [26]: df.head()

In [27]: delay_data = df['delayed'].value_counts()

In [28]: delay_data
Out[28]:
False    103037
True      98627
Name: delayed, dtype: int64

In [29]: delay_data[0]
Out[29]: 103037

In [30]: delay_data[1] / (delay_data[0] + delay_data[1])
Out[30]: 0.4890659711202793

3.每個航空公司的延誤情況


In [31]: delay_group = df.groupby(['unique_carrier','delayed'])

In [32]: df_delay = delay_group.size().unstack()

In [33]: df_delay
Out[33]:
delayed         False  True
unique_carrier
AA               8912   9841
AS               3527   2104
B6               4832   4401
DL              17719   9803
EV              10596  11371
F9               1103   1848
HA               1351   1354
MQ               4692   8060
NK               1550   2133
OO               9977  10804
UA               7885   8624
US               7850   6353
VX               1254    781
WN              21789  21150

In [34]: import matplotlib.pyplot as plt

In [35]: df_delay.plot()