pandas 的group 及其as_index理解
以下面這張表進行說明:
print(dfoff)
User_id Merchant_id Coupon_id Discount_rate Distance Date_received \ 0 1439408 2632 NaN NaN 0.0 NaN 1 1439408 4663 11002.0 150:20 1.0 20160528.0 2 1439408 2632 8591.0 20:1 0.0 20160217.0 3 1439408 2632 1078.0 20:1 0.0 20160319.0 4 1439408 2632 8591.0 20:1 0.0 20160613.0 Date 0 20160217.0 1 NaN 2 NaN 3 NaN 4 NaN
假設現在想以Date_received (收到優惠券的時間)進行分組並計數那麼可以這樣:
couponbydate = dfoff[dfoff['Date_received'].notnull()][['Date_received', 'Date']].groupby(['Date_received']).count()
print(couponbydate.columns)
print(couponbydate.head(5))
Index(['Date'], dtype='object') Date Date_received 20160101.0 74 20160102.0 67 20160103.0 74 20160104.0 98 20160105.0 107
as_index預設是True
那令其為false時看看有什麼不同:
couponbydate = dfoff[dfoff['Date_received'].notnull()][['Date_received', 'Date']].groupby(['Date_received'],as_index=False).count()
print(couponbydate.columns)
print(couponbydate.head(5))
Index(['Date_received', 'Date'], dtype='object') Date_received Date 0 20160101.0 74 1 20160102.0 67 2 20160103.0 74 3 20160104.0 98 4 20160105.0 107
兩者一對比相信一某瞭然了吧,as_index的作用正如其名,就是說要不要把其作為索引,作為了索引的話,即上面那個結果,其實它就只有一列啦,而後者並沒有將其作為索引,所以所以就採用了預設的0,1,2,,,,,而Date_received成了表格中一個普通的列
其實按這樣統計Date_received分組計數並不正確,筆者在這裡只是說明了as_index的用法,關於為什麼不正確,有興趣的可以接著往下看
---------------------------------------------------------------------------------------------------------------------------------------------------------
最後順便說一下分組搭配使用聚合函式(count是其中一種)需要注意的地方:
如果沒有使用聚合函式的話
couponbydate = dfoff[dfoff['Date_received'].notnull()][['Date_received', 'Date']].groupby(['Date_received'],as_index=False)
print(couponbydate)
<pandas.core.groupby.DataFrameGroupBy object at 0x7f90d445d710>
可以看到:變數couponbydate是一個GroupBy物件,只是含有一些有關分組鍵['Date_received']的中間資料而已,除此之外沒有任何計算,於是乎我們可以使用一些聚合函式對這個中間資料即分組後的資料進行一些特定的計算比如count的計算方法,mean求平均值方法等等。
再來直觀看一下其計數後整個結果:
couponbydate = dfoff[dfoff['Date_received'].notnull()].groupby(['Date_received'],as_index=False).count()
print(couponbydate.head(5))
Date_received User_id Merchant_id Coupon_id Discount_rate Distance \
0 20160101.0 554 554 554 554 447
1 20160102.0 542 542 542 542 439
2 20160103.0 536 536 536 536 429
3 20160104.0 577 577 577 577 474
4 20160105.0 691 691 691 691 579
Date
0 74
1 67
2 74
3 98
4 107
首先可以看到當沒有使用分組鍵Date_received 作為索引值時,分組鍵出現在第一列。
除了第一列的Date_received,後面所有了下面的數字代表什麼呢?其實就是計算結果,比如第一行對應的Date_received是20160101.0,後面的554的含義就是說20160101.0在所有資料中出現了554次,就這麼簡單和別的沒什麼關係不論你是User_id列也好,Merchant_id列也罷,所以上面每一行的數字都是相同的(綠色部分)
看到這裡也許會奇怪,那紅色呢?比如第一行的447 和74是怎麼回事?按理說也應該等於554呀!造成這個結果的主要原因就是有空值(NaN),具體來說就是:
print(dfoff[dfoff['Date_received']==20160101.0][['Date_received','Date']])
筆者這裡只是截取了部分結果,可以看到在Date_received==20160101.0對應的行中,Date大部分是空值,那這樣的情況有多少種呢?
print(dfoff[dfoff['Date_received']==20160101.0].shape[0])
print(dfoff[(dfoff['Date_received']==20160101.0)&(dfoff['Date'].notnull())].shape[0])
print(dfoff[(dfoff['Date_received']==20160101.0)&(dfoff['Date'].isnull())].shape[0])
554
74
480
可以看到有480中,也就是說不是非空的有74種,這也正是group在Date這一列顯示的是74而不是554的原因,我們還可以再拿
Merchant_id列驗證一下:
print(dfoff[dfoff['Date_received']==20160101.0].shape[0])
print(dfoff[(dfoff['Date_received']==20160101.0)&(dfoff['Merchant_id'].notnull())].shape[0])
554
554
所以分析到這裡的話回過頭來看一開始給出的那個統計方法並不正確(一開始那個實際上是在'Date_received'不為空且‘Date也不為空’基礎上統計的結果)即:
couponbydate = dfoff[dfoff['Date_received'].notnull()][['Date_received', 'Date']].groupby(['Date_received'],as_index=False).count()
print(couponbydate.columns)
print(couponbydate.head(5))
Index(['Date_received', 'Date'], dtype='object')
Date_received Date
0 20160101.0 74
1 20160102.0 67
2 20160103.0 74
3 20160104.0 98
4 20160105.0 107
而我們現在就是想單純的統計Date_received中每一個日期真真出現的次數,正確的話應該是這樣:
couponbydate = dfoff[dfoff['Date_received'].notnull()][['Date_received', 'Coupon_id']].groupby(['Date_received'], as_index=False).count()
couponbydate.columns = ['Date_received','count']
Date_received count
0 20160101.0 554
1 20160102.0 542
2 20160103.0 536
3 20160104.0 577
4 20160105.0 691
可以看到 20160101.0出現的次數是554並不是74
可能有人說,你這也不對呀,這不是在'Date_received'不為空且‘Coupon_id也不為空’的基礎上統計的嗎?也有可能 Date_received是20160101.0 但Coupon_id是空值的這一情況,也不是沒有統計進去嗎?是的,說的沒錯!!!!!!!
之所以說這樣做是正確的是和資料含義有關,Date_received 代表的是收到優惠券的時間,Coupon_id代表的是優惠券的id,所以說只要Date_received有一個日期,Coupon_id必定不為空值,而Date代表的是使用優惠券的時間,Date_received即使有一個日期,但Date也可以為空,因為有優惠券不一定用是吧,其實上面的統計還可使用:
couponbydate = dfoff[dfoff['Date_received'].notnull()][['Date_received', 'Discount_rate']].groupby(['Date_received'], as_index=False).count()
couponbydate.columns = ['Date_received','count']
因為Discount_rate是優惠率,一個Date_received必定對應一個非空的Discount_rate。
當然啦除此之外還有一種辦法即value_counts()方法:
print(dfoff['Date_received'].value_counts())
20160129.0 71658 20160125.0 65904 20160124.0 39481 20160131.0 35427 20160128.0 34334 20160207.0 33319 20160130.0 33226 20160126.0 26027 20160123.0 24045 20160521.0 19859 20160127.0 18893 20160203.0 17494 20160201.0 16371 20160520.0 14796 20160204.0 14450 20160326.0 13719 20160525.0 13576 20160327.0 13341 20160522.0 13299 20160528.0 13276 20160325.0 11265 20160202.0 11253 20160523.0 11008 20160524.0 10998 20160519.0 10215 20160321.0 9923 20160322.0 9826 20160323.0 9754 20160518.0 9440 20160324.0 9283 ... 20160106.0 808 20160315.0 788 20160311.0 786 20160308.0 786 20160225.0 780 20160112.0 773 20160304.0 768 20160107.0 746 20160316.0 715 20160111.0 712 20160215.0 701 20160317.0 700 20160309.0 694 20160105.0 691 20160216.0 685 20160221.0 670 20160219.0 669 20160222.0 667 20160224.0 633 20160314.0 619 20160217.0 618 20160307.0 609 20160302.0 578 20160104.0 577 20160303.0 566 20160101.0 554 20160223.0 554 20160218.0 543 20160102.0 542 20160103.0 536 Name: Date_received, Length: 167, dtype: int64
下面給出一個小實踐,也是天池上面的一個專案,參考他人的改進版:
先統計一下,每天發放的優惠券數以及被使用的數量:
data_received_type = dfoff['Date_received'].unique()
data_received_type = sorted(data_received_type[data_received_type==data_received_type])
couponbydate = dfoff[dfoff['Date_received'].notnull()][['Date_received', 'Coupon_id']].groupby(['Date_received'], as_index=False).count()
couponbydate.columns = ['date_received','count']
buybydate = dfoff[(dfoff['Date_received'].notnull())&(dfoff['Date'].notnull())][['Date_received', 'Date']].groupby(['Date_received'],as_index=False).count()
buybydate.columns = ['date_buy','count']
plt.figure(figsize = (12,8))
date_received_dt = pd.to_datetime(data_received_type, format='%Y%m%d')
plt.bar(date_received_dt, couponbydate['count'], label = 'number of coupon received' )
plt.bar(date_received_dt, buybydate['count'], label = 'number of coupon used')
plt.yscale('log')
plt.ylabel('Count')
plt.xlabel('Date')
plt.legend()
-------------------------------------------------------------------------------------------------------------------------------------------------------------
最後總結一下:
as_index的作用是要不要將分組鍵作為索引,作為了索引,最後得到的新表格就沒有這一列啦,否則相反
分組的結果第一列是分組鍵分出的各個關鍵值,後面的列下面的數字就是代表該關鍵值出現的次數,本身和那一列沒關係,所有列數值都是一樣的,統計的話,就取分組鍵和後面任意一列,得到的每一行結果分組後的關鍵字及其計數
上面有特殊情況,即如果任選後面的一列中有空值,group是不計算在內的