12-Pandas之離散化、面元劃分(等距cut()、等頻pcut()))
阿新 • • 發佈:2020-07-30
有時在處理連續型資料時,為了方便分析,需要將其進行離散化或者是拆分成“面元(bin)”,即將資料放置於一個小區間中。
在Pandas中,cut()--->資料離散化
qcut()-->面元劃分
一、cut():等距離散化,設定的bins的每個區間的間隔相等。
與排序與隨機重排中採用同樣的例子,即“新冠肺炎”的例子。
此時對累計確診那一列進行操作,首先檢視其最大值和最小值,便於瞭解將資料劃分為多少個組別:在此將資料劃分7個組別,如下:
>>> df['total_confirm'].max() 677146 >>> df['total_confirm'].min() 1 >>> bins = [0,10000,20000,30000,40000,50000,60000,70000] >>> pd.cut(df['total_confirm'],bins)[:8] 0 (0.0, 10000.0] 1 (0.0, 10000.0] 2 NaN 3 (10000.0, 20000.0] 4 (0.0, 10000.0] 5 (0.0, 10000.0] 6 (10000.0, 20000.0] 7 (0.0, 10000.0] Name: total_confirm, dtype: category Categories (7, interval[int64]): [(0, 10000] < (10000, 20000] < (20000, 30000] < (30000, 40000] < (40000, 50000] < (50000, 60000] < (60000, 70000]]
通過labels引數可以將這些區間換成其他的字串
>>> pd.cut(df['total_confirm'],bins=bins,labels=['A','B','C','D','E','F','G'])[:8] 0 A 1 A 2 NaN 3 B 4 A 5 A 6 B 7 A Name: total_confirm, dtype: category Categories (7, object): [A < B < C < D < E < F < G]
二、qcut():等頻離散化,每個區間的樣本數相同。
#分成8個等頻區間 >>> bs = pd.qcut(df['total_confirm'],8)[:5] >>> bs = pd.qcut(df['total_confirm'],8) >>> bs[:5] 0 (380.5, 979.5] 1 (2720.75, 8321.25] 2 (8321.25, 677146.0] 3 (8321.25, 677146.0] 4 (979.5, 2720.75] Name: total_confirm, dtype: category Categories (8, interval[float64]): [(0.999, 12.0] < (12.0, 35.0] < (35.0, 122.375] < (122.375, 380.5] < (380.5, 979.5] < (979.5, 2720.75] < (2720.75, 8321.25] < (8321.25, 677146.0]] #檢視每個區間的樣本數 >>> bs.value_counts() (0.999, 12.0] 28 (8321.25, 677146.0] 26 (979.5, 2720.75] 26 (2720.75, 8321.25] 25 (380.5, 979.5] 25 (122.375, 380.5] 25 (12.0, 35.0] 25 (35.0, 122.375] 24 Name: total_confirm, dtype: int64
從每個區間的樣本數可以發現,每個區間的樣本數挺不是完全相等的,所以:此處的等頻真正的含義是每個區間的數量並不是理想中的等量,而是大致等量。