1. 程式人生 > >fasttext模型 訓練THUCNews

fasttext模型 訓練THUCNews

cati mes join color for 問題 clas red cal

# _*_coding:utf-8 _*_
import fasttext
import jieba
from sklearn import metrics
import random
def read_file(filename):
    i=0;
    sentences =[]
    out = open(data/cnews/fast_test.txt,a+)
    with open(filename) as ft:
        for line in ft:
            label, content = line.strip().split(\t)
            segs 
= jieba.cut(content) segs = filter(lambda x:len(x)>1,segs) sentences.append("__label__"+str(label)+"\t"+" ".join(segs)) random.shuffle(sentences) for sentence in sentences: out.write(sentence+"\n") out.close() read_file(data/cnews/cnews.train.txt
) classifier = fasttext.supervised(data/cnews/fast_train.txt,new_fasttext.model) classifier = fasttext.load_model(new_fasttext.model.bin) categories = [體育, 財經,房產,家居,教育, 科技, 時尚, 時政, 遊戲, 娛樂] read_file(data/cnews/cnews.test.txt) result = classifier.test(data/cnews/fast_test.txt
) print("準確率為:%f"%result.precision) print("召回率為: %f"%result.recall) with open(data/cnews/cnews.test.txt) as fw: contents,labels = [],[] for line in fw: label ,content = line.strip().split(\t) segs = jieba.cut(content) segs = filter(lambda x:len(x)>1,segs) contents.append(" ".join(segs)) labels.append(__label__+label) label_predict = [e[0] for e in classifier.predict(contents)] print("Precision,Recall and F1-Score....") print(metrics.classification_report(labels,label_predict,target_names=categories))

關於fasttext的使用一些疑問:fasttext.supervised的參數label_prefix 一直提示我這個參數使用有問題... 然而,搜素了半天,我也沒搞明白這個參數哪裏有問題

還有一點需要註意的地方:fasttext的識別標簽統一需要在標簽前面加上"__label__"

後續會更新fastext的原理

fasttext模型 訓練THUCNews