機器學習——資料預處理

阿新 • • 發佈：2018-12-13

基礎

機器學習主要有兩種，監督學習和非監督學習。監督學習就是督促計算機去學習，明確告訴它目標是什麼，非監督學習是讓計算機“自學成才”，沒有設定目標，學習完告訴我你學到了什麼

 1 # encoding=utf-8
 2 
 3 from sklearn import linear_model
 4 import matplotlib.pyplot as plt
 5 import numpy as np
 6 
 7 # 房屋面積與價格歷史資料（csv檔案）
 8 data = np.array([[150, 6450], [200, 7450], [250, 8450], [300, 9450], [350, 11450], [400, 15450], [600, 18450]])
 
 9 # print data[:, 0].reshape(-1, 1)
10 # plt.scatter(data[:, 0], data[:, 1], color='blue')
11 # plt.show()
12 
13 # 線性模型
14 # regr = linear_model.LinearRegression()
15 # 擬合
16 # regr.fit(data[:, 0].reshape(-1, 1), data[:, 1])
17 # 直線的斜率、截距
18 # a, b = regr.coef_, regr.intercept_
19 # print a, b
20 # plt.plot(data[:,0],regr.predict(data[:,0].reshape(-1,1)),color='red',linewidth=4) 

21 # plt.scatter(data[:, 0], regr.predict(data[:, 0].reshape(-1, 1)), color='red')
22 # 預測175天和800天房價資料
23 # print regr.predict(175)
24 # print regr.predict(800)
25 # plt.show()

資料預處理

匯入類庫

1 from sklearn.feature_extraction import DictVectorizer
2 from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
 
3 import jieba
4 from sklearn.feature_selection import VarianceThreshold
5 from sklearn.preprocessing import StandardScaler, MinMaxScaler

資料處理

字典資料抽取

程式碼

 1 def dictvec():
 2     '''
 3     字典資料抽取：DictVectorizer
 4     sprase：為False時生成矩陣形式
 5     fit_transform：訓練資料集
 6     get_feature_names：獲取特徵名，即列名或表頭
 7     inverse_transform：得到每行資料中為1的資料（為1即為存在）
 8     :return:
 9     '''
10     dict = DictVectorizer(sparse=False)
11     data = dict.fit_transform(
12         [{'city': '北京', 'pos': '北方', 'temperature': 100},
13          {'city': '上海', 'pos': '南方', 'temperature': 60},
14          {'city': '深圳', 'pos': '南方', 'temperature': 30},
15          {'city': '重慶', 'pos': '南方', 'temperature': 70},
16          {'city': '北京', 'pos': '北方', 'temperature': 100}])
17 
18     print(dict.get_feature_names())
19     print(dict.inverse_transform(data))
20     print(data)
21     return None

結果

'''
['city=上海', 'city=北京', 'city=深圳', 'city=重慶', 'pos=北方', 'pos=南方', 'temperature']
[{'city=北京': 1.0, 'pos=北方': 1.0, 'temperature': 100.0}, {'city=上海': 1.0, 'pos=南方': 1.0, 'temperature': 60.0}, {'city=深圳': 1.0, 'pos=南方': 1.0, 'temperature': 30.0}, {'city=重慶': 1.0, 'pos=南方': 1.0, 'temperature': 70.0}, {'city=北京': 1.0, 'pos=北方': 1.0, 'temperature': 100.0}]
[[  0.   1.   0.   0.   1.   0. 100.]
 [  1.   0.   0.   0.   0.   1.  60.]
 [  0.   0.   1.   0.   0.   1.  30.]
 [  0.   0.   0.   1.   0.   1.  70.]
 [  0.   1.   0.   0.   1.   0. 100.]]
'''

英文特徵值化

程式碼

 1 def countvec():
 2     '''
 3     對文字進行特徵值化：CountVectorizer對文字中的詞可進行統計
 4     排序：會按照英文常用性進行排序
 5     停用：a 等無顯著特徵的詞會被停用
 6     :return: None
 7     '''
 8     cv = CountVectorizer()
 9     data = cv.fit_transform(['this is a test test', 'we have a test'])
10 
11     print(cv.get_feature_names())
12     print(data.toarray())
13     return None

結果

'''
['have', 'is', 'test', 'this', 'we']
[[0 1 2 1 0]
 [1 0 1 0 1]]
'''

中文特徵值化

程式碼

def cutword():
    # 分詞
    con1 = jieba.cut('天空灰得像哭過')
    con2 = jieba.cut('離開你以後')
    con3 = jieba.cut('並沒有很自由')

    # 轉換成列表
    content1 = list(con1)
    content2 = list(con2)
    content3 = list(con3)

    # 把列表轉換成字串
    c1 = ' '.join(content1)
    c2 = ' '.join(content2)
    c3 = ' '.join(content3)
    return c1, c2, c3

 1 def hanzivec():
 2     '''
 3     對文字進行特徵值化：CountVectorizer對文字中的詞可進行統計
 4     :return: None
 5     '''
 6     c1, c2, c3 = cutword()
 7     cv = CountVectorizer()
 8     print(c1, c2, c3)
 9     data = cv.fit_transform([c1, c2, c3])
10 
11     print(cv.get_feature_names())
12     print(data.toarray())
13     return None

結果

'''
天空 灰得 像 哭 過 離開 你 以後 並 沒有 很 自由
['以後', '天空', '沒有', '灰得', '離開', '自由']
[[0 1 0 1 0 0]
 [1 0 0 0 1 0]
 [0 0 1 0 0 1]]
'''

詞頻

程式碼

def tfidfvec():
    '''
    中文特徵值化
    TF(詞頻)：在一篇文章中出現該詞的次數與文章中總詞數的比值，（出現次數/文章總詞數）
    IDF(逆向詞頻)：log(文章總數/該詞出現的文章數)
    TF,IDF值越大說明該詞特徵越顯著
    '''
    c1, c2, c3 = cutword()
    print(c1, c2, c3)
    tf = TfidfVectorizer()
    data = tf.fit_transform([c1, c2, c3])
    print(tf.get_feature_names())
    print(data.toarray())
    return None

結果

'''
天空 灰得 像 哭 過 離開 你 以後 並 沒有 很 自由
['以後', '天空', '沒有', '灰得', '離開', '自由']
[[0.         0.70710678 0.         0.70710678 0.         0.        ]
 [0.70710678 0.         0.         0.         0.70710678 0.        ]
 [0.         0.         0.70710678 0.         0.         0.70710678]]
'''

標準化縮放

程式碼

 1 def stand():
 2     '''
 3     標準化縮放：特徵列均值為0，標準差為1
 4     將資料差值很大，但變化率等相近的資料標準化，類似於橫座標是1000,2000,3000，縱座標是1,2,3
 5     :return:
 6     '''
 7     std = StandardScaler()
 8     # data = std.fit_transform([[1., -1., 3.], [2., 4., 2.], [4., 6., -1.]])
 9     data = std.fit_transform([[1., 2., 3.], [100., 200., 300.], [1000., 2000., 3000.]])
10     print(data)
11     return None

結果

'''
[[-0.81438366 -0.81438366 -0.81438366]
 [-0.59409956 -0.59409956 -0.59409956]
 [ 1.40848322  1.40848322  1.40848322]]
'''

歸一化

程式碼

1 def mm():
2     '''
3     歸一化處理：類似於上面標準化，可以設定歸一化後的特徵值範圍
4     :return:
5     '''
6     mm = MinMaxScaler(feature_range=(2, 3))
7     data = mm.fit_transform([[90, 2, 10, 40], [60, 4, 15, 45], [75, 3, 13, 46]])
8     print(data)
9     return None

結果

'''
[[3.         2.         2.         2.        ]
 [2.         3.         3.         2.83333333]
 [2.5        2.5        2.6        3.        ]]
'''

特徵選擇

程式碼

 1 def var():
 2     '''
 3     特徵選擇-刪除低方差的特徵
 4     threshold：閾值，小於設定閾值方差的特徵列將被剔除
 5     注：方差小的，特徵不顯著
 6     :return:
 7     '''
 8     var = VarianceThreshold(threshold=1.0)
 9     data = var.fit_transform([[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]])
10 
11     print(data)
12     return None

結果

'''
[[0]
 [4]
 [1]]
'''

機器學習資料預處理（sklearn庫系列函式）

【1】 sklearn.preprocessing.PolynomialFeatures PolynomialFeatures有三個引數 degree：控制多項式的度 interaction_

機器學習——資料預處理

基礎機器學習主要有兩種，監督學習和非監督學習。監督學習就是督促計算機去學習，明確告訴它目標是什麼，非監督學習是讓計算機“自學成才”，沒有設定目標，學習完告訴我你學到了什麼 1 # encoding=utf-8 2 3 from sklearn import linear_model 4 im

Python機器學習-資料預處理技術標準化處理、歸一化、二值化、獨熱編碼、標記編碼總結

資料預處理技術機器是看不懂絕大部分原始資料的，為了讓讓機器看懂，需要將原始資料進行預處理。引入模組和資料 import numpy as np from sklearn import preprocessing data = np.array([[3,-1.5,2,-5.4], &nbs

閒扯淡之機器學習——資料預處理

上篇文章我們針對ML閒扯了一番，並在最後又借鑑Data Mining的CRISP-DM模型分析了一個ML專案的開發過程。今天說點什麼呢？我猶豫了，我迷茫了！先給大家講個故事吧！有一天你的boss找到你說：XX聽說你對ML很熟悉啊，正好我們公司有很多*

學習資料預處理

# GB18030，全稱：國家標準GB18030 - 2005《資訊科技中文編碼字符集》，是中華人民共和國現時最新的內碼字集， # 是GB18030 - 2000《資訊科技資訊交換用漢字編碼字符集基本集的擴充》的修訂版。GB18030與GB2312 - 1980和GBK相容，共收錄漢字70244個

深度學習-----資料預處理是必要的，一些經驗化的預處理措施

資料歸一化更多詳細資訊，參照網址：非常值得一看： http://blog.csdn.net/qq_26898461/article/details/50463052 http://blog.csdn.net/bea_tree/article/details/5

【深度學習資料預處理2】使用Matlab批量生成聲譜圖

接上篇裁剪音樂檔案。給深度學習做資料準備，通過Matlab生成聲譜圖。 function cut_wavs(file_dir,output_dir,t,t_overlap) files=dir(file_dir); count=0; for i =3:le

Keras學習---資料預處理篇

1. 資料預處理是必要的，這裡以最簡單的MNIST dataset的輸入資料預處理為例。 A. 設定隨機種子 np.random.seed(1337) # for reproducibility B. 輸入資料維度規格化，這裡每個樣本只是si

機器學習一：資料預處理

最近一直在學習機器學習的知識，入門很難。之前跟著吳恩達老師的視訊在學習，發現還是有很多的知識點難以理解。前不久，《機器學習A-Z》出了中文翻譯，老師講的非常淺顯易懂，所以開始跟著學起來了。為了能更系統的整理到學的知識進行一個整理，也作為一個自我監督，接下來就把較為系統的知識點都整理到部落格上。相應的程式碼