機器學習—程式設計基礎1

阿新 • • 發佈：2022-03-14

預處理黑色星期五資料

主要步驟流程：

1. 匯入包和資料集
2. 處理缺失資料
3. 特徵工程
4. 處理類別型欄位
5. 得到自變數和因變數
6. 拆分訓練集和測試集
7. 特徵縮放

資料集：https://www.heywhale.com/mw/dataset/622f197b8a84f900178aa2c7/file

1. 匯入包和資料集

In [2]:

# 匯入包
import numpy as np
import pandas as pd

In [3]:

# 匯入資料集
data = pd.read_csv(' 
BlackFriday.csv')
data.head(5)

Out[3]:

	User_ID	Product_ID	Gender	Age	Occupation	City_Category	Stay_In_Current_City_Years	Marital_Status	Product_Category_1	Product_Category_2	Product_Category_3	Purchase
0	1000001	P00069042	F	0-17	10	A	2	0.0	3	NaN	NaN	8370
1	1000001	P00248942	F	0-17	10	A	2	0.0	1	6.0	14.0	15200
2	1000001	P00087842	F	0-17	10	A	2	NaN	12	NaN	NaN	1422
3	1000001	P00085442	F	0-17	10	A	2	0.0	12	14.0	NaN	1057
4	1000002	P00285442	M	55+	16	C	4+	0.0	8	NaN	NaN	7969

2. 處理缺失資料

In [4]:

# 處理缺失資料
# 檢測缺失值
null_df = data.isnull().sum()
null_df

Out[4]:

User_ID                           0
Product_ID                        0
Gender                            0
Age                               0
Occupation                        0
City_Category                     0
Stay_In_Current_City_Years        0
Marital_Status                    3
Product_Category_1                0
Product_Category_2            15721
Product_Category_3            34817
Purchase                          0
dtype: int64

Marital_Status欄位有3個缺失值；根據業務場景，缺失的預設是未婚，即用0填補。 Product_Category_2欄位有15721個缺失值；根據業務場景，這個欄位不重要，刪除。 Product_Category_3欄位有34817個缺失值；根據業務場景，這個欄位不重要，刪除。

In [5]:

# 刪除2個缺失列
data = data.drop(['Product_Category_2', 'Product_Category_3'], axis = 1)

In [6]:

# 填補缺失列
data['Marital_Status'].fillna(0, inplace = True)

In [7]:

# 再次檢測缺失值
null_df = data.isnull().sum()
null_df

Out[7]:

User_ID                       0
Product_ID                    0
Gender                        0
Age                           0
Occupation                    0
City_Category                 0
Stay_In_Current_City_Years    0
Marital_Status                0
Product_Category_1            0
Purchase                      0
dtype: int64

3. 特徵工程

特徵工程（Feature Engineering）是將原始資料轉化成更好的表達問題本質的特徵的過程

In [8]:

# 特徵工程
# 刪除無用的列
data = data.drop(['User_ID', 'Product_ID'], axis = 1)

In [9]:

# 處理Stay_In_Current_City_Years列
data['Stay_In_Current_City_Years'].replace('4+', 4, inplace = True)
data['Stay_In_Current_City_Years'] = data['Stay_In_Current_City_Years'].astype('int64')

4. 處理類別型欄位

In [10]:

# 處理類別型欄位
# 檢查類別型變數
print(data.dtypes)

Gender                         object
Age                            object
Occupation                      int64
City_Category                  object
Stay_In_Current_City_Years      int64
Marital_Status                float64
Product_Category_1              int64
Purchase                        int64
dtype: object

根據業務場景，Occupation列、Marital_Status列和Product_Category_1列應該是類別型欄位。需要轉換。

In [11]:

# 轉換變數型別
data['Product_Category_1'] = data['Product_Category_1'].astype('object')
data['Occupation'] = data['Occupation'].astype('object')
data['Marital_Status'] = data['Marital_Status'].astype('object')

In [12]:

# 檢查類別型變數
print(data.dtypes)

Gender                        object
Age                           object
Occupation                    object
City_Category                 object
Stay_In_Current_City_Years     int64
Marital_Status                object
Product_Category_1            object
Purchase                       int64
dtype: object

In [13]:

# 字元編碼&獨熱編碼
data = pd.get_dummies(data, drop_first = True) 
data.head(5)

Out[13]:

	Stay_In_Current_City_Years	Purchase	Gender_M	Age_55+	...	Product_Category_1_12
0	2	8370	0	0	...	0
1	2	15200	0	0	...	0
2	2	1422	0	0	...	1
3	2	1057	0	0	...	1
4	4	7969	1	1	...	0

5 rows × 49 columns

5. 得到自變數和因變數

In [14]:

# 得到自變數和因變數
y = data['Purchase'].values
print(y.shape)
data = data.drop(['Purchase'], axis = 1)
x = data.values
print(x.shape)

(50000,)
(50000, 48)

6. 拆分訓練集和測試集

In [15]:

# 拆分訓練集和測試集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 205)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(35000, 48)
(15000, 48)
(35000,)
(15000,)

自變數儲存到x_train和x_test中，因變數儲存到y_train和y_test中。

7. 特徵縮放

In [16]:

y_train.shape

Out[16]:

(35000,)

In [17]:

a = y_train.reshape(-1, 1)
a.shape

Out[17]:

(35000, 1)

In [18]:

# 特徵縮放
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
sc_y = StandardScaler()
y_train = np.ravel(sc_y.fit_transform(y_train.reshape(-1, 1)))

In [19]:

x_train[:3,:]

Out[19]:

array([[ 0.10343173, -1.77050054, -0.4873769 ,  1.24137799, -0.49566581,
        -0.2961413 , -0.27386969, -0.20083646, -0.30534824, -0.22524545,
        -0.18825963, -0.38533409, -0.1388486 , -0.18621612, -0.33838987,
        -0.05537615, -0.10670196, -0.16153035, -0.14868325, -0.24336363,
        -0.12529891, -0.22657272, -0.1519627 , -0.2198004 , -0.28316115,
        -0.11100179, -0.12739959, -0.26453064, -0.84608185, -0.66227816,
        -0.83770833, -0.2140971 , -0.19340603, -0.14727185,  1.59734051,
        -0.19412034, -0.08274392, -0.51575763, -0.02726553, -0.09620978,
        -0.21851165, -0.08396038, -0.09756185, -0.05216966, -0.10900638,
        -0.13139418, -0.03339953, -0.0726975 ],
       [-1.43966733,  0.56481203,  2.05180016, -0.80555641, -0.49566581,
        -0.2961413 , -0.27386969, -0.20083646, -0.30534824, -0.22524545,
        -0.18825963,  2.5951506 , -0.1388486 , -0.18621612, -0.33838987,
        -0.05537615, -0.10670196, -0.16153035, -0.14868325, -0.24336363,
        -0.12529891, -0.22657272, -0.1519627 , -0.2198004 , -0.28316115,
        -0.11100179, -0.12739959, -0.26453064,  1.18191875, -0.66227816,
        -0.83770833, -0.2140971 , -0.19340603, -0.14727185, -0.62604059,
        -0.19412034, -0.08274392,  1.93889522, -0.02726553, -0.09620978,
        -0.21851165, -0.08396038, -0.09756185, -0.05216966, -0.10900638,
        -0.13139418, -0.03339953, -0.0726975 ],
       [-0.6681178 ,  0.56481203, -0.4873769 ,  1.24137799, -0.49566581,
        -0.2961413 , -0.27386969, -0.20083646, -0.30534824, -0.22524545,
        -0.18825963, -0.38533409, -0.1388486 , -0.18621612, -0.33838987,
        -0.05537615, -0.10670196, -0.16153035,  6.7257073 , -0.24336363,
        -0.12529891, -0.22657272, -0.1519627 , -0.2198004 , -0.28316115,
        -0.11100179, -0.12739959, -0.26453064,  1.18191875, -0.66227816,
         1.19373291, -0.2140971 , -0.19340603, -0.14727185,  1.59734051,
        -0.19412034, -0.08274392, -0.51575763, -0.02726553, -0.09620978,
        -0.21851165, -0.08396038, -0.09756185, -0.05216966, -0.10900638,
        -0.13139418, -0.03339953, -0.0726975 ]])

In [20]:

y_train[:3]

Out[20]:

array([-0.12211265, -1.46218147, -0.78499224])

縮放後的x_train和y_train，所有特徵的值處於相似範圍內。

結論：資料預處理有固定的方法。 Python提供了豐富的庫，方便人們做資料預處理工作。最初的資料通過資料預處理生成了x_train、y_train、x_test、y_test。在下一章中，前2個變數將訓練模型，後2個變數將評估模型。

機器學習—程式設計基礎1

預處理黑色星期五資料

1. 匯入包和資料集

2. 處理缺失資料

3. 特徵工程

4. 處理類別型欄位

5. 得到自變數和因變數

6. 拆分訓練集和測試集

7. 特徵縮放

機器學習—程式設計基礎1

機器學習程式設計基礎

scikit基礎與機器學習入門（1）背景介紹

Linux下socket程式設計基礎1-初探

Linux下socket程式設計基礎1-初探（把服務端程式碼放到雲伺服器上）

機器學習筆記 Day 1

2015級計算機學院《程式設計基礎(1)》（第一場）

機器學習0——基礎知識和線性迴歸

Python筆記：機器學習之基礎概念

Educoder-程式設計基礎1：順序結構

整合學習--機器學習數學基礎

機器學習數學基礎Datawhale-8月（4）筆記

spark程式設計基礎1

Blazor和Vue對比學習（基礎1.1）：元件結構

TensorFlow.NET機器學習 TensorFlow.NET機器學習入門【1】開發環境與型別簡介TensorFlow.NET機器學習入門【0】前言與目錄

機器學習—分類3-1（KNN演算法）

機器學習—迴歸2-1（簡單線性迴歸）

機器學習----python基礎

機器學習_step1_基礎知識

機器學習的基礎講解：神經網路

機器學習—程式設計基礎1

預處理黑色星期五資料

1. 匯入包和資料集

2. 處理缺失資料

3. 特徵工程

4. 處理類別型欄位

5. 得到自變數和因變數

6. 拆分訓練集和測試集

7. 特徵縮放

相關推薦