1. 程式人生 > 實用技巧 >12-Pandas之離散化、面元劃分(等距cut()、等頻pcut()))

12-Pandas之離散化、面元劃分(等距cut()、等頻pcut()))

  有時在處理連續型資料時,為了方便分析,需要將其進行離散化或者是拆分成“面元(bin)”,即將資料放置於一個小區間中。

  在Pandas中,cut()--->資料離散化

        qcut()-->面元劃分

一、cut():等距離散化,設定的bins的每個區間的間隔相等

  與排序與隨機重排中採用同樣的例子,即“新冠肺炎”的例子。

  此時對累計確診那一列進行操作,首先檢視其最大值和最小值,便於瞭解將資料劃分為多少個組別:在此將資料劃分7個組別,如下:

>>> df['total_confirm'].max()
677146
>>> df['total_confirm'].min()
1
>>> bins = [0,10000,20000,30000,40000,50000,60000,70000]
>>> pd.cut(df['total_confirm'],bins)[:8]
0        (0.0, 10000.0]
1        (0.0, 10000.0]
2                   NaN
3    (10000.0, 20000.0]
4        (0.0, 10000.0]
5        (0.0, 10000.0]
6    (10000.0, 20000.0]
7        (0.0, 10000.0]
Name: total_confirm, dtype: category
Categories (7, interval[int64]): [(0, 10000] < (10000, 20000] < (20000, 30000] < (30000, 40000] <
                                  (40000, 50000] < (50000, 60000] < (60000, 70000]]

  通過labels引數可以將這些區間換成其他的字串

>>> pd.cut(df['total_confirm'],bins=bins,labels=['A','B','C','D','E','F','G'])[:8]
0      A
1      A
2    NaN
3      B
4      A
5      A
6      B
7      A
Name: total_confirm, dtype: category
Categories (7, object): [A < B < C < D < E < F < G]

二、qcut():等頻離散化,每個區間的樣本數相同

#分成8個等頻區間
>>> bs = pd.qcut(df['total_confirm'],8)[:5]
>>> bs = pd.qcut(df['total_confirm'],8)
>>> bs[:5]
0         (380.5, 979.5]
1     (2720.75, 8321.25]
2    (8321.25, 677146.0]
3    (8321.25, 677146.0]
4       (979.5, 2720.75]
Name: total_confirm, dtype: category
Categories (8, interval[float64]): [(0.999, 12.0] < (12.0, 35.0] < (35.0, 122.375] <
                                    (122.375, 380.5] < (380.5, 979.5] < (979.5, 2720.75] <
                                    (2720.75, 8321.25] < (8321.25, 677146.0]]

#檢視每個區間的樣本數
>>> bs.value_counts()
(0.999, 12.0]          28
(8321.25, 677146.0]    26
(979.5, 2720.75]       26
(2720.75, 8321.25]     25
(380.5, 979.5]         25
(122.375, 380.5]       25
(12.0, 35.0]           25
(35.0, 122.375]        24
Name: total_confirm, dtype: int64

從每個區間的樣本數可以發現,每個區間的樣本數挺不是完全相等的,所以:此處的等頻真正的含義是每個區間的數量並不是理想中的等量,而是大致等量