利用pandas對資料離散化
阿新 • • 發佈:2018-12-18
在實際的工作場景中,我們經常會遇到這樣一種場景:想要將某些欄位進行離散化即分桶,簡單來說就是講年齡分成幾個區間。pandas中的cut方法能很好地完成此操作。
#匯入相關庫,並建立資料集 import pandas as pd import numpy as np index = pd.Index(data=["Tom", "Bob", "Mary", "James"], name="name") data = { "age": [15, 28, 23, 37], "city": ["Hangzhou", "ShangHai", "Hefei", "Luan"], "sex": ["male", "female", "female", "male"] } user_info = pd.DataFrame(data=data, index=index) In [48]:user_info Out[48]: age city sex name Tom 15 Hangzhou male Bob 28 ShangHai female Mary 23 Hefei female James 37 Luan male #將user_info中的age分成三個年齡段 pd.cut(user_info.age,3) Out[51]: name Tom (14.978, 22.333] Bob (22.333, 29.667] Mary (22.333, 29.667] James (29.667, 37.0] Name: age, dtype: category Categories (3, interval[float64]): [(14.978, 22.333] < (22.333, 29.667] < (29.667, 37.0]]
從以上得到的結果可以看出,cut將年齡段進行了均等的切分。當然我們還可以進行自定義操作(此時可以對區間名稱起別名):
#自定義區間並進行分割 qujian=[5,15,25,40] pd.cut(user_info.age,qujian) Out[55]: name Tom (5, 15] Bob (25, 40] Mary (15, 25] James (25, 40] Name: age, dtype: category Categories (3, interval[int64]): [(5, 15] < (15, 25] < (25, 40]] #起別名 pd.cut(user_info.age,qujian,labels=['child','youth','middle']) Out[56]: name Tom child Bob middle Mary youth James middle Name: age, dtype: category Categories (3, object): [child < youth < middle]
如果現在想求出每個區間出現male和female出現的次數,該如何操作呢?可以結合groupby函式來進行操作,如下:
#首先對user_info進行切割
pdd=pd.cut(user_info['age'],qujian)
user_info['age'].groupby(pdd).count()
Out[66]:
age
(5, 15] 1
(15, 25] 1
(25, 40] 2
Name: age, dtype: int64
從上可以看出,實現age區間出現次數的統計。