CCF-企業非法集資風險預測比賽收穫——字串特徵處理
阿新 • • 發佈:2020-12-08
這篇題為收穫而不是總結,主要是自己棄賽了一大段時間,再回來已是水哥baseline的天下/(ㄒoㄒ)/~~。如今比賽告一段落,來分享一個自己構造出的效果顯著的特徵——主要是對字串opscope的處理。
- 處理前
- 處理中:對opscope特徵進行處理,將字串修正為列表
train_base_df['opscope'] = train_base_df['opscope'].apply(lambda x: x.replace("(", "("))
train_base_df['opscope'] = train_base_df[ 'opscope'].apply(lambda x: x.replace(")", ")"))
train_base_df['opscope'] = train_base_df['opscope'].apply(lambda x: re.sub(u"\\(.*?\\)", "", x))
pattern = r',|\.|/|;|\'|`|\[|\]|<|>|\?|:|"|\{|\}|\~|!|@|#|\$|%|\^|&|\(|\)|-|=|\_|\+|,|。|、|;|‘|’|【|】|·|!| |…|(|)'
train_base_df['opscope'] = train_base_df['opscope'].apply(lambda x: re.split(pattern, x))
train_base_df['opscope'] = train_base_df['opscope'].apply(lambda x: [i for i in x if (i != '')&(i != '***')])
- 處理後
- 後續構造特徵
# 計算label為1企業opscope頻率最高的業務:有風險的業務
label1_sample_opscope = train_base_df.loc[train_base_df[ 'label'] == 1, ['opscope']]
opscope_risk_list = []
for _, i in label1_sample_opscope.iterrows():
opscope_risk_list.extend(i['opscope'])
import collections
count = collections.Counter(opscope_risk_list)
risk_num = 8
opscope_mostrisk_list = []
for i in range(risk_num):
opscope_mostrisk_list.append(count.most_common(risk_num)[i][0])
opscope_mostrisk_list
Result:[‘投資諮詢’, ‘投資管理’, ‘實業投資’, ‘股權投資’, ‘資產管理’, ‘創業投資’, ‘企業投資’, ‘企業管理諮詢’]
# 特徵:計算各企業包含風險業務的個數
def opscope_mostrisk_count(x, opscope_mostrisk_list):
count = 0
for i in x:
if i in opscope_mostrisk_list:
count += 1
return count
train_base_df['base_opscope_mostrisk_num'] = train_base_df['opscope'].apply(lambda x: opscope_mostrisk_count(x, opscope_mostrisk_list))
# 特徵:企業是否包含某項風險業務
for f in opscope_mostrisk_list:
train_base_df['base_opscope_'+f] = train_base_df['opscope'].apply(lambda x: f in x)
# 特徵:風險業務所佔比例(風險業務個數/企業經營業務總個數)
train_base_df['opscope_count'] = train_base_df['opscope'].apply(lambda x: len(x))
train_base_df['opscope_rate'] = train_base_df['base_opscope_mostrisk_num']/train_base_df['opscope_count']
這裡只是拋磚引玉,大家可以拓展思路哈!另外有沒有對資料比賽感興趣的小夥伴還缺少隊友的可以留言或者私信我哈 /(ㄒoㄒ)/~~