1. 程式人生 > 實用技巧 >特徵選取之IV(資訊值)及python實現

特徵選取之IV(資訊值)及python實現

IV表徵特徵的預測能力:小於0.02,幾乎沒有預測能力;小於0.1,弱;小於0.3,中等;小於0.5,強;大於0.5,難以置信,需進一步確認

WOE describes the relationship between a predictive variable and a binary target variable.
IV measures the strength of that relationship.

計算公式:暫不寫……

程式碼實現如下:

# 定義字典,記錄每個特徵的資訊值iv
iv_dict=dict()
def cal_iv(df,feature,target='target'):
    '''
    用於二分類的資訊值計算,返回資訊值和具體資訊
    :df pd.DataFrame
    :feature 選擇的特徵
    :target 目標特徵名
    '''
    ls=[]
    for val in df[feature].unique():
        al=df[df[feature]==val][feature].count()
        good=df[(df[feature]==val)&(df[target]==1)][feature].count()
        bad=df[(df[feature]==val)&(df[target]==0)][feature].count()
        ls.append([val,al,good,bad])
    data=pd.DataFrame(ls,columns=[feature,'all','good','bad'])
    good_rate=data['good']/data['good'].sum()# good邊際概率
    bad_rate=data['bad']/data['bad'].sum()# bad邊際概率
    data['woe']=np.log(good_rate/bad_rate)# woe為證據權重
    data = data.replace({'woe': {np.inf: 0, -np.inf: 0}})
    data['iv']=data['woe']*(good_rate-bad_rate)
    iv=data.iv.sum()
#     新增到字典
    if feature not in iv_dict.keys():
        iv_dict[feature]=iv
    print('iv for %s is %f: '%(feature,iv))
    return iv,data