1. 程式人生 > >pandas 的group 及其as_index理解

pandas 的group 及其as_index理解

以下面這張表進行說明:

print(dfoff)
   User_id  Merchant_id  Coupon_id Discount_rate  Distance  Date_received  \
0  1439408         2632        NaN           NaN       0.0            NaN   
1  1439408         4663    11002.0        150:20       1.0     20160528.0   
2  1439408         2632     8591.0          20:1       0.0     20160217.0   
3  1439408         2632     1078.0          20:1       0.0     20160319.0   
4  1439408         2632     8591.0          20:1       0.0     20160613.0   

         Date  
0  20160217.0  
1         NaN  
2         NaN  
3         NaN  
4         NaN  

假設現在想以Date_received (收到優惠券的時間)進行分組並計數那麼可以這樣:

couponbydate = dfoff[dfoff['Date_received'].notnull()][['Date_received', 'Date']].groupby(['Date_received']).count()
print(couponbydate.columns)
print(couponbydate.head(5))
Index(['Date'], dtype='object')
               Date
Date_received      
20160101.0       74
20160102.0       67
20160103.0       74
20160104.0       98
20160105.0      107

 as_index預設是True

那令其為false時看看有什麼不同:

couponbydate = dfoff[dfoff['Date_received'].notnull()][['Date_received', 'Date']].groupby(['Date_received'],as_index=False).count()
print(couponbydate.columns)
print(couponbydate.head(5))
Index(['Date_received', 'Date'], dtype='object')
   Date_received  Date
0     20160101.0    74
1     20160102.0    67
2     20160103.0    74
3     20160104.0    98
4     20160105.0   107

兩者一對比相信一某瞭然了吧,as_index的作用正如其名,就是說要不要把其作為索引,作為了索引的話,即上面那個結果,其實它就只有一列啦,而後者並沒有將其作為索引,所以所以就採用了預設的0,1,2,,,,,而Date_received成了表格中一個普通的列

其實按這樣統計Date_received分組計數並不正確,筆者在這裡只是說明了as_index的用法,關於為什麼不正確,有興趣的可以接著往下看

---------------------------------------------------------------------------------------------------------------------------------------------------------

最後順便說一下分組搭配使用聚合函式(count是其中一種)需要注意的地方:

如果沒有使用聚合函式的話

couponbydate = dfoff[dfoff['Date_received'].notnull()][['Date_received', 'Date']].groupby(['Date_received'],as_index=False)
print(couponbydate)
<pandas.core.groupby.DataFrameGroupBy object at 0x7f90d445d710>

可以看到:變數couponbydate是一個GroupBy物件,只是含有一些有關分組鍵['Date_received']的中間資料而已,除此之外沒有任何計算,於是乎我們可以使用一些聚合函式對這個中間資料即分組後的資料進行一些特定的計算比如count的計算方法,mean求平均值方法等等。

再來直觀看一下其計數後整個結果:

couponbydate = dfoff[dfoff['Date_received'].notnull()].groupby(['Date_received'],as_index=False).count()
print(couponbydate.head(5))
   Date_received  User_id  Merchant_id  Coupon_id  Discount_rate  Distance  \
0     20160101.0      554          554        554            554       447   
1     20160102.0      542          542        542            542       439   
2     20160103.0      536          536        536            536       429   
3     20160104.0      577          577        577            577       474   
4     20160105.0      691          691        691            691       579   

   Date  
0    74  
1    67  
2    74  
3    98  
4   107  

首先可以看到當沒有使用分組鍵Date_received 作為索引值時,分組鍵出現在第一列。

除了第一列的Date_received,後面所有了下面的數字代表什麼呢?其實就是計算結果,比如第一行對應的Date_received是20160101.0,後面的554的含義就是說20160101.0在所有資料中出現了554次,就這麼簡單和別的沒什麼關係不論你是User_id列也好,Merchant_id列也罷,所以上面每一行的數字都是相同的(綠色部分)

看到這裡也許會奇怪,那紅色呢?比如第一行的447 74是怎麼回事?按理說也應該等於554呀!造成這個結果的主要原因就是有空值(NaN),具體來說就是:

print(dfoff[dfoff['Date_received']==20160101.0][['Date_received','Date']])

筆者這裡只是截取了部分結果,可以看到在Date_received==20160101.0對應的行中,Date大部分是空值,那這樣的情況有多少種呢?

print(dfoff[dfoff['Date_received']==20160101.0].shape[0])
print(dfoff[(dfoff['Date_received']==20160101.0)&(dfoff['Date'].notnull())].shape[0])
print(dfoff[(dfoff['Date_received']==20160101.0)&(dfoff['Date'].isnull())].shape[0])
554
74
480

可以看到有480中,也就是說不是非空的有74種,這也正是group在Date這一列顯示的是74而不是554的原因,我們還可以再拿

Merchant_id列驗證一下:

print(dfoff[dfoff['Date_received']==20160101.0].shape[0])
print(dfoff[(dfoff['Date_received']==20160101.0)&(dfoff['Merchant_id'].notnull())].shape[0])
554
554

所以分析到這裡的話回過頭來看一開始給出的那個統計方法並不正確(一開始那個實際上是在'Date_received'不為空且‘Date也不為空’基礎上統計的結果)即:

couponbydate = dfoff[dfoff['Date_received'].notnull()][['Date_received', 'Date']].groupby(['Date_received'],as_index=False).count()
print(couponbydate.columns)
print(couponbydate.head(5))
Index(['Date_received', 'Date'], dtype='object')
   Date_received  Date
0     20160101.0    74
1     20160102.0    67
2     20160103.0    74
3     20160104.0    98
4     20160105.0   107

而我們現在就是想單純的統計Date_received中每一個日期真真出現的次數,正確的話應該是這樣:

couponbydate = dfoff[dfoff['Date_received'].notnull()][['Date_received', 'Coupon_id']].groupby(['Date_received'], as_index=False).count()
couponbydate.columns = ['Date_received','count']
   Date_received  count
0     20160101.0    554
1     20160102.0    542
2     20160103.0    536
3     20160104.0    577
4     20160105.0    691

可以看到 20160101.0出現的次數是554並不是74

可能有人說,你這也不對呀,這不是在'Date_received'不為空且‘Coupon_id也不為空’的基礎上統計的嗎?也有可能 Date_received是20160101.0 但Coupon_id是空值的這一情況,也不是沒有統計進去嗎?是的,說的沒錯!!!!!!!

之所以說這樣做是正確的是和資料含義有關,Date_received 代表的是收到優惠券的時間,Coupon_id代表的是優惠券的id,所以說只要Date_received有一個日期,Coupon_id必定不為空值,而Date代表的是使用優惠券的時間,Date_received即使有一個日期,但Date也可以為空,因為有優惠券不一定用是吧,其實上面的統計還可使用:

couponbydate = dfoff[dfoff['Date_received'].notnull()][['Date_received', 'Discount_rate']].groupby(['Date_received'], as_index=False).count()
couponbydate.columns = ['Date_received','count']

因為Discount_rate是優惠率,一個Date_received必定對應一個非空的Discount_rate。

當然啦除此之外還有一種辦法即value_counts()方法:

print(dfoff['Date_received'].value_counts())
20160129.0    71658
20160125.0    65904
20160124.0    39481
20160131.0    35427
20160128.0    34334
20160207.0    33319
20160130.0    33226
20160126.0    26027
20160123.0    24045
20160521.0    19859
20160127.0    18893
20160203.0    17494
20160201.0    16371
20160520.0    14796
20160204.0    14450
20160326.0    13719
20160525.0    13576
20160327.0    13341
20160522.0    13299
20160528.0    13276
20160325.0    11265
20160202.0    11253
20160523.0    11008
20160524.0    10998
20160519.0    10215
20160321.0     9923
20160322.0     9826
20160323.0     9754
20160518.0     9440
20160324.0     9283
              ...  
20160106.0      808
20160315.0      788
20160311.0      786
20160308.0      786
20160225.0      780
20160112.0      773
20160304.0      768
20160107.0      746
20160316.0      715
20160111.0      712
20160215.0      701
20160317.0      700
20160309.0      694
20160105.0      691
20160216.0      685
20160221.0      670
20160219.0      669
20160222.0      667
20160224.0      633
20160314.0      619
20160217.0      618
20160307.0      609
20160302.0      578
20160104.0      577
20160303.0      566
20160101.0      554
20160223.0      554
20160218.0      543
20160102.0      542
20160103.0      536
Name: Date_received, Length: 167, dtype: int64

下面給出一個小實踐,也是天池上面的一個專案,參考他人的改進版:

先統計一下,每天發放的優惠券數以及被使用的數量:

data_received_type = dfoff['Date_received'].unique()
data_received_type = sorted(data_received_type[data_received_type==data_received_type])
couponbydate = dfoff[dfoff['Date_received'].notnull()][['Date_received', 'Coupon_id']].groupby(['Date_received'], as_index=False).count()
couponbydate.columns = ['date_received','count']
buybydate = dfoff[(dfoff['Date_received'].notnull())&(dfoff['Date'].notnull())][['Date_received', 'Date']].groupby(['Date_received'],as_index=False).count()
buybydate.columns = ['date_buy','count']

plt.figure(figsize = (12,8))
date_received_dt = pd.to_datetime(data_received_type, format='%Y%m%d')


plt.bar(date_received_dt, couponbydate['count'], label = 'number of coupon received' )
plt.bar(date_received_dt, buybydate['count'], label = 'number of coupon used')
plt.yscale('log')
plt.ylabel('Count')
plt.xlabel('Date')
plt.legend()

-------------------------------------------------------------------------------------------------------------------------------------------------------------

最後總結一下:

as_index的作用是要不要將分組鍵作為索引,作為了索引,最後得到的新表格就沒有這一列啦,否則相反

分組的結果第一列是分組鍵分出的各個關鍵值,後面的列下面的數字就是代表該關鍵值出現的次數,本身和那一列沒關係,所有列數值都是一樣的,統計的話,就取分組鍵和後面任意一列,得到的每一行結果分組後的關鍵字及其計數

上面有特殊情況,即如果任選後面的一列中有空值,group是不計算在內的