特徵選擇學習筆記

阿新 • • 發佈：2021-01-02

技術標籤：機器學習 python

學習目標

掌握特徵選擇的基本原理及方法，實現特徵的選擇

1.特徵選擇原理

對於一個學習任務而言，有些特徵很關鍵，但有些特徵對於最後的分類可能並沒有什麼用，我們稱前者為相關特徵，後者為無關特徵，特徵選擇是一個選擇相關特徵，去除無關特徵的過程。

為什麼要進行特徵選擇？

1.降維。 同PCA
2.減少任務的難度。 較少的干擾會降低任務的難度

無關特徵不同於冗餘特徵，無關特徵是指與學習任務無關的特徵，而冗餘特徵是指可以從當前特徵中推演出來的特徵

去除冗餘特徵在很大程度上可以降低任務的難度，但是若冗餘特徵充當了中間變數，冗餘特徵的存在也是有益的

常見的特徵選擇方法包括：

1.過濾式選擇 2.嵌入式選擇 3.包裹式選擇

1.1 過濾式選擇

過濾式選擇是先進行特徵選擇，後進行訓練器學習；即，特徵選擇與訓練器學習是具有先後順序的

過濾式選擇的方法採用Relief法，該方法設計了 “相關統計量” 來描述各個特徵的重要程度

最終可以指定一個閾值，選擇比閾值大的相關統計變數所對應的特徵;或者直接選擇前k大的相關統計變數對應的特徵作為最終特徵

確定相關統計變數的方法

1.在同類樣本中找到距離最近的樣本 x n h x_nh xnh作為”猜中近鄰"
2.在不同類的樣本中找到距離最近的樣本 x n m x_nm xnm作為"猜錯近鄰“

根據公式：

− d i f f ( x i j , x n h j ) 2 + d i f f ( x i j , x n m j ) 2 -diff(x_i^j,x_{nh}^j)^2+diff(x_i^j,x_{nm}^j)^2

−diff(xij,xnhj)2+diff(xij,xnmj)2

可求得樣本 x i x_i xi在特徵 j j j上的相關統計變數

若特徵 j j j為離散變數:

當 x i j = x n h j x_i^j=x_{nh}^j xij=xnhj時，diff=0
當 x i j ≠ x n h j x_i^j \neq x_{nh}^j xij=xnhj時，diff=1

若特徵 j j j為連續變數：

d i f f = ∣ x i j − x n h j ∣ diff=|x_i^j-x_{nh}^j| diff=∣xij−xnhj∣

若當前問題為多分類問題，相關統計變數的公式可以轉化為

− d i f f ( x i j , x n h j ) 2 + ∑ k ≠ p ( L k / L ) ∗ d i f f ( x i j , x n m j ) 2 -diff(x_i^j,x_{nh}^j)^2+\sum_{k\neq p} (L_k/L)*diff(x_i^j,x_{nm}^j)^2 −diff(xij,xnhj)2+∑k=p(Lk/L)∗diff(xij,xnmj)2

此時的 x n m j x_{nm}^j xnmj為在類別 k k k上的猜錯近鄰（ k k k不為樣本 x i x_i xi的標籤 p p p）

實戰

1.計算同類最近擊中和非同類最近擊中

#計算最近擊中和最遠擊中
def caculate_distance(x,grap):
    distance={}
    for i in range(0,len(grap)):
        xj=np.array(list(grap.iloc[i]))
        x=np.array(x)#計算xj與x之間的距離
        if any(xj!=x):#若計算的樣本不是x本身
            dis=np.sum(np.sqrt(np.square(xj-x)))#計算距離
            distance[dis]=xj
    distance=sorted(distance.items(),key=lambda x:x[0])
    return distance[0][1]

def caculate_hit_and_miss(x): 
    miss_all_categoral=[]
    hit=np.array([])
    for col,grap in data.groupby(by='花瓣種類'):
        if col==x[-1]:    #計算同類最近擊中距離
            hit=caculate_distance(x,grap)
        else:
            mis=caculate_distance(x,grap)#計算非同類最近擊中距離
            miss_all_categoral.append((len(grap)/len(data),mis))
    return hit,miss_all_categoral

2.計算相關統計量

def caculate_statistical(x,hit,miss):
    statisitical=[]
#     形如[(0.3333333333333333, array([5.1, 2.5, 3. , 1.1, 2. ]))]
    for i in range(0,len(hit)-1):#對於第i個特徵
        a=-np.square(abs(x[i]-hit[i]))#計算同類相關量
        b=0
        for j in range(0,len(miss)):   #計算非同類相關量
            b+=np.square(abs(miss[j][1][i]-x[i]))*miss[j][0]
        statisitical.append(a+b)   
    return statisitical

3.分別彙總各特徵在所有樣本上的的相關統計量之和

def filter_chioce():  
    all_statistical=[]
    for i in range(0,len(data)):
        hit,miss=caculate_hit(data.iloc[i])#計算對於樣本x的最近擊中和最遠擊中
        all_statistical.append(caculate_statistical(data.iloc[i],hit,miss))
    result=np.array(all_statistical).sum(axis=0)
    return result

4.得到結果

result=filter_chioce()
result=DataFrame(result,column=['花萼長度', '花萼寬度', '花瓣長度', '花瓣寬度'])

在這裡插入圖片描述
可以看出，花瓣長度與花瓣寬度的相關統計變數較大，特徵對分類過程的貢獻較大

1.2 嵌入式選擇

嵌入式選擇是指將訓練器學習的過程與特徵選擇的過程融為一體，最有代表性的嵌入式選擇就是神經網路的誤差傳播過程

實戰

for epoch in range(iter):
    print(".......")
    for step,(x,y) in enumerate(train_db):
        #x:[b,28,28]
        x=tf.reshape(x,[-1,784])
        with tf.GradientTape() as tape:
            out=net(x)
            #out：[b,10]
            #y:[b]
            y=tf.one_hot(y,depth=10)
            loss_mse=tf.reduce_mean(tf.losses.MSE(y,out))
            loss_ce=tf.reduce_mean(tf.losses.categorical_crossentropy(y,out,from_logits=True))
        grad=tape.gradient(loss_ce,net.trainable_variables)#求解梯度，進行優化器優化
        optimizer.apply_gradients(zip(grad,net.trainable_variables))
        if step % 20==0:
            print("epoch={0},loss={1}".format(epoch,loss_ce))

    test_accurancy=0
    print("testing.......")
    for step,(x_test,y_test) in enumerate(test_db):
        x_test= tf.reshape(x_test, [-1, 784])
        y_test=tf.cast(y_test,tf.int64)
        test_out=net(x_test)
        #test_out:[b,10]
        #y_test:[b]
        test_out=tf.nn.softmax(test_out,axis=1)
        test_out=tf.argmax(test_out,axis=1)
        # test_out:[b]
        acc=tf.reduce_mean(tf.cast(tf.equal(test_out,y_test),dtype=tf.float32))
        test_accurancy+=float(acc)
    test_accurancy=test_accurancy/step
    print("epoch={0},test_accurancy={1}".format(epoch, test_accurancy))

1.3 包裹式選擇

包裹式選擇是指根據訓練器的學習結果來選擇特徵，包裹式特徵選擇可以為學習器選擇 “量身定做” 的特徵子集

包裹式選擇的學習效果一般比過濾式選擇要好，但是包裹式選擇在選擇特徵的過程中需要多次訓練學習器，計算開銷比較大

實戰

data1=np.genfromtxt('iris.data.txt',delimiter=',')
data=data1[:,:4]
labels=data1[:,-1]
m,n=data.shape
numstep=10  #確定步長
best_state={}
best_predict_labels=np.ones((m,1))
min_error=np.inf
data=np.mat(data)#獲得資料
labels=np.mat(labels)

for i in range(0,n):#遍歷所有的特徵
    max_num=max(data[:,i])
    min_num=min(data[:,i])
    stepsize=(max_num-min_num)/numstep  #求解當前列所需要走過的步數
    for j in range(-1,int(numstep+1)):
        thresh=(min_num+j*stepsize)#獲得閾值,注意閾值的求法
        for flag in ['less','great']:
            predict_labels=weakclassifier(data,i,flag,thresh)#訓練弱分類器,找到最好的分類方法
            error=np.mat(np.ones((m,1)))#定義誤差矩陣
            error[labels.T==predict_labels]=0#如果預測對了就是0
            weighted_error=(D.T*error)  #公式中是按照加權的方法構造的矩陣
#        if weighted_error<min_error :#如果說當前的錯誤率是最小的話
                min_error=weighted_error
                best_state['dim']=i
                best_state['thresh']=thresh
                best_state['flag']=flag
                best_predict_labels=predict_labels.copy()
print('best character:',best_state['dim'])