機器學習—程式設計基礎1
阿新 • • 發佈:2022-03-14
預處理黑色星期五資料
主要步驟流程:
- 1. 匯入包和資料集
- 2. 處理缺失資料
- 3. 特徵工程
- 4. 處理類別型欄位
- 5. 得到自變數和因變數
- 6. 拆分訓練集和測試集
- 7. 特徵縮放
1. 匯入包和資料集
In [2]:# 匯入包
import numpy as np
import pandas as pd
In [3]:
# 匯入資料集
data = pd.read_csv(' BlackFriday.csv')
data.head(5)
Out[3]:
User_ID | Product_ID | Gender | Age | Occupation | City_Category | Stay_In_Current_City_Years | Marital_Status | Product_Category_1 | Product_Category_2 | Product_Category_3 | Purchase | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1000001 | P00069042 | F | 0-17 | 10 | A | 2 | 0.0 | 3 | NaN | NaN | 8370 |
1 | 1000001 | P00248942 | F | 0-17 | 10 | A | 2 | 0.0 | 1 | 6.0 | 14.0 | 15200 |
2 | 1000001 | P00087842 | F | 0-17 | 10 | A | 2 | NaN | 12 | NaN | NaN | 1422 |
3 | 1000001 | P00085442 | F | 0-17 | 10 | A | 2 | 0.0 | 12 | 14.0 | NaN | 1057 |
4 | 1000002 | P00285442 | M | 55+ | 16 | C | 4+ | 0.0 | 8 | NaN | NaN | 7969 |
2. 處理缺失資料
In [4]:# 處理缺失資料
# 檢測缺失值
null_df = data.isnull().sum()
null_df
Out[4]:
User_ID 0 Product_ID 0 Gender 0 Age 0 Occupation 0 City_Category 0 Stay_In_Current_City_Years 0 Marital_Status 3 Product_Category_1 0 Product_Category_2 15721 Product_Category_3 34817 Purchase 0 dtype: int64
Marital_Status欄位有3個缺失值;根據業務場景,缺失的預設是未婚,即用0填補。 Product_Category_2欄位有15721個缺失值;根據業務場景,這個欄位不重要,刪除。 Product_Category_3欄位有34817個缺失值;根據業務場景,這個欄位不重要,刪除。
In [5]:# 刪除2個缺失列
data = data.drop(['Product_Category_2', 'Product_Category_3'], axis = 1)
In [6]:
# 填補缺失列
data['Marital_Status'].fillna(0, inplace = True)
In [7]:
# 再次檢測缺失值
null_df = data.isnull().sum()
null_df
Out[7]:
User_ID 0
Product_ID 0
Gender 0
Age 0
Occupation 0
City_Category 0
Stay_In_Current_City_Years 0
Marital_Status 0
Product_Category_1 0
Purchase 0
dtype: int64
3. 特徵工程
特徵工程(Feature Engineering)是將原始資料轉化成更好的表達問題本質的特徵的過程
In [8]:# 特徵工程
# 刪除無用的列
data = data.drop(['User_ID', 'Product_ID'], axis = 1)
In [9]:
# 處理Stay_In_Current_City_Years列
data['Stay_In_Current_City_Years'].replace('4+', 4, inplace = True)
data['Stay_In_Current_City_Years'] = data['Stay_In_Current_City_Years'].astype('int64')
4. 處理類別型欄位
In [10]:# 處理類別型欄位
# 檢查類別型變數
print(data.dtypes)
Gender object
Age object
Occupation int64
City_Category object
Stay_In_Current_City_Years int64
Marital_Status float64
Product_Category_1 int64
Purchase int64
dtype: object
根據業務場景,Occupation列、Marital_Status列和Product_Category_1列應該是類別型欄位。需要轉換。
In [11]:# 轉換變數型別
data['Product_Category_1'] = data['Product_Category_1'].astype('object')
data['Occupation'] = data['Occupation'].astype('object')
data['Marital_Status'] = data['Marital_Status'].astype('object')
In [12]:
# 檢查類別型變數
print(data.dtypes)
Gender object
Age object
Occupation object
City_Category object
Stay_In_Current_City_Years int64
Marital_Status object
Product_Category_1 object
Purchase int64
dtype: object
In [13]:
# 字元編碼&獨熱編碼
data = pd.get_dummies(data, drop_first = True)
data.head(5)
Out[13]:
Stay_In_Current_City_Years | Purchase | Gender_M | Age_18-25 | Age_26-35 | Age_36-45 | Age_46-50 | Age_51-55 | Age_55+ | Occupation_1 | ... | Product_Category_1_9 | Product_Category_1_10 | Product_Category_1_11 | Product_Category_1_12 | Product_Category_1_13 | Product_Category_1_14 | Product_Category_1_15 | Product_Category_1_16 | Product_Category_1_17 | Product_Category_1_18 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 8370 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 2 | 15200 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 2 | 1422 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 2 | 1057 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 4 | 7969 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 49 columns
5. 得到自變數和因變數
In [14]:# 得到自變數和因變數
y = data['Purchase'].values
print(y.shape)
data = data.drop(['Purchase'], axis = 1)
x = data.values
print(x.shape)
(50000,)
(50000, 48)
6. 拆分訓練集和測試集
In [15]:# 拆分訓練集和測試集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 205)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
(35000, 48)
(15000, 48)
(35000,)
(15000,)
自變數儲存到x_train和x_test中,因變數儲存到y_train和y_test中。
7. 特徵縮放
In [16]:y_train.shape
Out[16]:
(35000,)
In [17]:
a = y_train.reshape(-1, 1)
a.shape
Out[17]:
(35000, 1)
In [18]:
# 特徵縮放
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
sc_y = StandardScaler()
y_train = np.ravel(sc_y.fit_transform(y_train.reshape(-1, 1)))
In [19]:
x_train[:3,:]
Out[19]:
array([[ 0.10343173, -1.77050054, -0.4873769 , 1.24137799, -0.49566581,
-0.2961413 , -0.27386969, -0.20083646, -0.30534824, -0.22524545,
-0.18825963, -0.38533409, -0.1388486 , -0.18621612, -0.33838987,
-0.05537615, -0.10670196, -0.16153035, -0.14868325, -0.24336363,
-0.12529891, -0.22657272, -0.1519627 , -0.2198004 , -0.28316115,
-0.11100179, -0.12739959, -0.26453064, -0.84608185, -0.66227816,
-0.83770833, -0.2140971 , -0.19340603, -0.14727185, 1.59734051,
-0.19412034, -0.08274392, -0.51575763, -0.02726553, -0.09620978,
-0.21851165, -0.08396038, -0.09756185, -0.05216966, -0.10900638,
-0.13139418, -0.03339953, -0.0726975 ],
[-1.43966733, 0.56481203, 2.05180016, -0.80555641, -0.49566581,
-0.2961413 , -0.27386969, -0.20083646, -0.30534824, -0.22524545,
-0.18825963, 2.5951506 , -0.1388486 , -0.18621612, -0.33838987,
-0.05537615, -0.10670196, -0.16153035, -0.14868325, -0.24336363,
-0.12529891, -0.22657272, -0.1519627 , -0.2198004 , -0.28316115,
-0.11100179, -0.12739959, -0.26453064, 1.18191875, -0.66227816,
-0.83770833, -0.2140971 , -0.19340603, -0.14727185, -0.62604059,
-0.19412034, -0.08274392, 1.93889522, -0.02726553, -0.09620978,
-0.21851165, -0.08396038, -0.09756185, -0.05216966, -0.10900638,
-0.13139418, -0.03339953, -0.0726975 ],
[-0.6681178 , 0.56481203, -0.4873769 , 1.24137799, -0.49566581,
-0.2961413 , -0.27386969, -0.20083646, -0.30534824, -0.22524545,
-0.18825963, -0.38533409, -0.1388486 , -0.18621612, -0.33838987,
-0.05537615, -0.10670196, -0.16153035, 6.7257073 , -0.24336363,
-0.12529891, -0.22657272, -0.1519627 , -0.2198004 , -0.28316115,
-0.11100179, -0.12739959, -0.26453064, 1.18191875, -0.66227816,
1.19373291, -0.2140971 , -0.19340603, -0.14727185, 1.59734051,
-0.19412034, -0.08274392, -0.51575763, -0.02726553, -0.09620978,
-0.21851165, -0.08396038, -0.09756185, -0.05216966, -0.10900638,
-0.13139418, -0.03339953, -0.0726975 ]])
In [20]:
y_train[:3]
Out[20]:
array([-0.12211265, -1.46218147, -0.78499224])
縮放後的x_train和y_train,所有特徵的值處於相似範圍內。
結論: 資料預處理有固定的方法。 Python提供了豐富的庫,方便人們做資料預處理工作。 最初的資料通過資料預處理生成了x_train、y_train、x_test、y_test。在下一章中,前2個變數將訓練模型,後2個變數將評估模型。