1. 程式人生 > 其它 >機器學習程式設計基礎

機器學習程式設計基礎

預處理黑色星期五購物資料
主要步驟流程:
  • 1. 匯入包和資料集
  • 2. 處理缺失資料
  • 3. 特徵工程
  • 4. 處理類別型欄位
  • 5. 得到自變數和因變數
  • 6. 拆分訓練集和測試集
  • 7. 特徵縮放

 

資料集連結:https://www.heywhale.com/mw/dataset/622f197b8a84f900178aa2c7/file

 

1. 匯入包和資料集

In [2]:
# 匯入包
import numpy as np
import pandas as pd

In [3]:   

 # 匯入資料集
data = pd.read_csv('BlackFriday.csv
') data.head(5)
Out[3]:
  User_ID Product_ID Gender Age Occupation City_Category Stay_In_Current_City_Years Marital_Status Product_Category_1 Product_Category_2 Product_Category_3 Purchase
0 1000001 P00069042 F 0-17 10 A 2 0.0 3 NaN NaN 8370
1 1000001 P00248942 F 0-17 10 A 2 0.0 1 6.0 14.0 15200
2 1000001 P00087842 F 0-17
10 A 2 NaN 12 NaN NaN 1422
3 1000001 P00085442 F 0-17 10 A 2 0.0 12 14.0 NaN 1057
4 1000002 P00285442 M 55+ 16 C 4+ 0.0 8 NaN NaN 7969
 

2. 處理缺失資料

In [4]:
# 處理缺失資料
# 檢測缺失值
null_df = data.isnull().sum()
null_df
Out[4]:
User_ID                           0
Product_ID                        0
Gender                            0
Age                               0
Occupation                        0
City_Category                     0
Stay_In_Current_City_Years        0
Marital_Status                    3
Product_Category_1                0
Product_Category_2            15721
Product_Category_3            34817
Purchase                          0
dtype: int64

 

Marital_Status欄位有3個缺失值;根據業務場景,缺失的預設是未婚,即用0填補。 Product_Category_2欄位有15721個缺失值;根據業務場景,這個欄位不重要,刪除。 Product_Category_3欄位有34817個缺失值;根據業務場景,這個欄位不重要,刪除。

In [5]:
# 刪除2個缺失列
data = data.drop(['Product_Category_2', 'Product_Category_3'], axis = 1) 
In [6]:
# 填補缺失列
data['Marital_Status'].fillna(0, inplace = True)
In [7]:
# 再次檢測缺失值
null_df = data.isnull().sum()
null_df 
Out[7]:
User_ID                       0
Product_ID                    0
Gender                        0
Age                           0
Occupation                    0
City_Category                 0
Stay_In_Current_City_Years    0
Marital_Status                0
Product_Category_1            0
Purchase                      0
dtype: int64

3. 特徵工程

特徵工程(Feature Engineering)是將原始資料轉化成更好的表達問題本質的特徵的過程

In [8]:
# 特徵工程
# 刪除無用的列
data = data.drop(['User_ID', 'Product_ID'], axis = 1)
In [9]:
# 處理Stay_In_Current_City_Years列
data['Stay_In_Current_City_Years'].replace('4+', 4, inplace = True)
data['Stay_In_Current_City_Years'] = data['Stay_In_Current_City_Years'].astype('int64')

 

4. 處理類別型欄位

In [10]:
# 處理類別型欄位
# 檢查類別型變數
print(data.dtypes)
Gender                         object
Age                            object
Occupation                      int64
City_Category                  object
Stay_In_Current_City_Years      int64
Marital_Status                float64
Product_Category_1              int64
Purchase                        int64
dtype: object

根據業務場景,Occupation列、Marital_Status列和Product_Category_1列應該是類別型欄位。需要轉換。

In [11]:
# 轉換變數型別
data['Product_Category_1'] = data['Product_Category_1'].astype('object')
data['Occupation'] = data['Occupation'].astype('object')
data['Marital_Status'] = data['Marital_Status'].astype('object')
In [12]:
# 檢查類別型變數
print(data.dtypes)
Gender                        object
Age                           object
Occupation                    object
City_Category                 object
Stay_In_Current_City_Years     int64
Marital_Status                object
Product_Category_1            object
Purchase                       int64
dtype: object
In [13]:
# 字元編碼&獨熱編碼
data = pd.get_dummies(data, drop_first = True) 
data.head(5)
Out[13]:
  Stay_In_Current_City_Years Purchase Gender_M Age_18-25 Age_26-35 Age_36-45 Age_46-50 Age_51-55 Age_55+ Occupation_1 ... Product_Category_1_9 Product_Category_1_10 Product_Category_1_11 Product_Category_1_12 Product_Category_1_13 Product_Category_1_14 Product_Category_1_15 Product_Category_1_16 Product_Category_1_17 Product_Category_1_18
0 2 8370 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 2 15200 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 2 1422 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
3 2 1057 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
4 4 7969 1 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 49 columns

 

5. 得到自變數和因變數

In [14]:
# 得到自變數和因變數
y = data['Purchase'].values
print(y.shape)
data = data.drop(['Purchase'], axis = 1)
x = data.values
print(x.shape)
(50000,)
(50000, 48)

 

6. 拆分訓練集和測試集

In [15]:
# 拆分訓練集和測試集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 205)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
(35000, 48)
(15000, 48)
(35000,)
(15000,)

自變數儲存到x_train和x_test中,因變數儲存到y_train和y_test中。

 

7. 特徵縮放

In [16]:
y_train.shape
Out[16]:
(35000,)
In [17]:
a = y_train.reshape(-1, 1)
a.shape
Out[17]:
(35000, 1)
In [18]:
# 特徵縮放
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
sc_y = StandardScaler()
y_train = np.ravel(sc_y.fit_transform(y_train.reshape(-1, 1)))
In [19]:
x_train[:3,:]
Out[19]:
array([[ 0.10343173, -1.77050054, -0.4873769 ,  1.24137799, -0.49566581,
        -0.2961413 , -0.27386969, -0.20083646, -0.30534824, -0.22524545,
        -0.18825963, -0.38533409, -0.1388486 , -0.18621612, -0.33838987,
        -0.05537615, -0.10670196, -0.16153035, -0.14868325, -0.24336363,
        -0.12529891, -0.22657272, -0.1519627 , -0.2198004 , -0.28316115,
        -0.11100179, -0.12739959, -0.26453064, -0.84608185, -0.66227816,
        -0.83770833, -0.2140971 , -0.19340603, -0.14727185,  1.59734051,
        -0.19412034, -0.08274392, -0.51575763, -0.02726553, -0.09620978,
        -0.21851165, -0.08396038, -0.09756185, -0.05216966, -0.10900638,
        -0.13139418, -0.03339953, -0.0726975 ],
       [-1.43966733,  0.56481203,  2.05180016, -0.80555641, -0.49566581,
        -0.2961413 , -0.27386969, -0.20083646, -0.30534824, -0.22524545,
        -0.18825963,  2.5951506 , -0.1388486 , -0.18621612, -0.33838987,
        -0.05537615, -0.10670196, -0.16153035, -0.14868325, -0.24336363,
        -0.12529891, -0.22657272, -0.1519627 , -0.2198004 , -0.28316115,
        -0.11100179, -0.12739959, -0.26453064,  1.18191875, -0.66227816,
        -0.83770833, -0.2140971 , -0.19340603, -0.14727185, -0.62604059,
        -0.19412034, -0.08274392,  1.93889522, -0.02726553, -0.09620978,
        -0.21851165, -0.08396038, -0.09756185, -0.05216966, -0.10900638,
        -0.13139418, -0.03339953, -0.0726975 ],
       [-0.6681178 ,  0.56481203, -0.4873769 ,  1.24137799, -0.49566581,
        -0.2961413 , -0.27386969, -0.20083646, -0.30534824, -0.22524545,
        -0.18825963, -0.38533409, -0.1388486 , -0.18621612, -0.33838987,
        -0.05537615, -0.10670196, -0.16153035,  6.7257073 , -0.24336363,
        -0.12529891, -0.22657272, -0.1519627 , -0.2198004 , -0.28316115,
        -0.11100179, -0.12739959, -0.26453064,  1.18191875, -0.66227816,
         1.19373291, -0.2140971 , -0.19340603, -0.14727185,  1.59734051,
        -0.19412034, -0.08274392, -0.51575763, -0.02726553, -0.09620978,
        -0.21851165, -0.08396038, -0.09756185, -0.05216966, -0.10900638,
        -0.13139418, -0.03339953, -0.0726975 ]])
In [20]:
y_train[:3]
Out[20]:
array([-0.12211265, -1.46218147, -0.78499224])

縮放後的x_train和y_train,所有特徵的值處於相似範圍內

 

結論: 資料預處理有固定的方法。 Python提供了豐富的庫,方便人們做資料預處理工作。 最初的資料通過資料預處理生成了x_train、y_train、x_test、y_test。在下一章中,前2個變數將訓練模型,後2個變數將評估模型。