基於LDA主題模型和SVM的文字分類

阿新 • • 發佈：2020-12-04

用LDA模型抽取文字特徵，再用線性SVM分類，發現效果很差，F1=0.654。

Precision:0.680,Recall:0.649,F1:0.654

RandomForestClassifier的表現也比較差：

Precision:0.680,Recall:0.668,F1:0.670

而隨便用一個深度學習模型(textCNN,LSTM+Attention)都能達到0.95+的F1，而且還不用處理特徵、不用分詞。

說下具體流程：提取LDA特徵時，需要CountVectorizer來先對文字進行向量化，首先需要對文字進行分詞，考慮到樣本數量較多（搜狐新聞資料集，5個類別*3000條資訊），使用了多程序程(此處用了程序池ProcessPoolExecutor來實現)來進行jieba分詞。

import pandas as pd
import jieba
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import multiprocessing
from concurrent.futures import ProcessPoolExecutor,as_completed
from utils import log
from tqdm import tqdm
import time
import 
 pickle as pk
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC,SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score,recall_score,f1_score

def transform_text(text,stopwords):
    #對文章進行jieba分詞
    words=[w for 
 w in jieba.cut(text) if w.strip() and (w not in stopwords)]
    return ','.join(words)

def cut_texts(lock,texts,stopwords,processName,doc_list=[]):
    #程序+鎖的形式來做多程序分詞
    log('Process {} is cutting texts...'.format(processName))
    docs=[]
    for text in tqdm(texts):
        doc=transform_text(text,stopwords)
        #log(doc)
        docs.append(doc)
    lock.acquire()
    doc_list.extend(docs)
    lock.release()

def cut_texts_pool(texts,stopwords,processName):
    #分詞,此方法將以，程序池方式的方式實現多程序加速執行
    log('Process {} is cutting texts...'.format(processName))
    docs=[]
    for text in tqdm(texts):
        doc=transform_text(text,stopwords)
        #log(doc)
        docs.append(doc)
    log('Process {} finished cutting.'.format(processName))
    return docs

def hard_work(processName):
    #測試方法，模擬耗時操作
    log('Process {} is running...'.format(processName))
    time.sleep(2)
    log('Process {} finished.'.format(processName))
    return processName

def mp_pool_test(texts=None,res=None):
    #多程序測試
    n_process=multiprocessing.cpu_count()
    pool=ProcessPoolExecutor()
    fs=[]
    for i in range(n_process):
        f=pool.submit(hard_work,i)
        fs.append(f)
    names=[]
    for f in as_completed(fs):
        name = f.result()
        names.append(name)
    log(names)

def partition(iterable_,n_parittion):
    #多文字進行分割，大體均分為n_parittion份
    assert isinstance(n_parittion,int) and n_parittion>0,'Invalid value for "n_partition"'
    temp=list(iterable_)
    total=len(temp)
    assert total>n_parittion,'Size of iterable is less than "n_partition"'

    partition_size=total//n_parittion
    res=[]
    for i in range(n_parittion-1):
        res.append(temp[partition_size*i:partition_size*(i+1)])
    res.append(temp[partition_size*(i+1):])
    return res

def mp_cut_pool(texts):
    #有幾個CPU就建立幾個程序
    n_process=multiprocessing.cpu_count()
    texts=partition(texts,n_process)
    #以程序池的方式進行多程序分詞
    pool=ProcessPoolExecutor(max_workers=12)
    fs=[]
    docs=[]
    for i in range(n_process):
        #submit啟動程序，第一個引數是目標方法，後面是該方法的引數
        f=pool.submit(cut_texts_pool,texts[i],[],i)
        #f是一個Future物件
        fs.append(f)
    #as_completed返回一個迭代器，當程序池當中的程序執行結束時呼叫
    for f in as_completed(fs):
        #f.result()獲取每個程序的返回值
        docs.extend(f.result())
    return docs

class LDA_Transformer:
    def __init__(self,n_features):
        self.n_features=n_features

    def fit(self,texts):
        log('Building CountVectorizer with texts...')
        ct=CountVectorizer()
        self.count_vectorizer=ct
        log(type(texts))
        if isinstance(texts,list):
            log('Len of texts:{}'.format(len(texts)))
            #log(texts)
        else:
            log('Shape of texts:{}'.format(texts.shape))
        print('texts[0]',texts[0])
        ctv=ct.fit_transform(texts)
        log('Building LDA model with CountVectorizer..')
        #n_components是LDA的主題個數，類似於word embedding的維度大小
        lda=LatentDirichletAllocation(n_components=self.n_features)
        lda.fit(ctv)
        log('Done building LDA model.')
        self.lda_model=lda

    def transform(self,texts):
        count_vec=self.count_vectorizer.transform(texts)
        return self.lda_model.transform(count_vec)

def build_data():
    df=pd.read_excel('data/souhu_news_400_500.xlsx')
    texts=list(df['content'])#文字欄位
    log(df.columns)
    docs=mp_cut_pool(texts)
    lda_transformer=LDA_Transformer(64)
    lda_transformer.fit(docs)
    #儲存LDA模型到本地
    with open('output/lda_transformer.pkl','wb') as f:
        pk.dump(lda_transformer,f)

    indices=list(range(df.shape[0]))
    np.random.shuffle(indices)
    df=df.iloc[indices]
    dic={topic:i for i,topic in enumerate(list(df['topic'].unique()))}
    y=[dic[topic] for topic in list(df['topic'])]
    with open('data/y_lda.pkl','wb') as f:
        pk.dump(y,f)

    texts=list(df['content'])
    X=lda_transformer.transform(texts)
    with open('data/X_lda.pkl','wb') as f:
        pk.dump(X,f)
    log('Training data is saved.')

def load_train_data():
    with open('data/X_lda.pkl','rb') as f:
        X=pk.load(f)
    with open('data/y_lda.pkl','rb') as f:
        y=pk.load(f)
    X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
    return X_train,X_test,y_train,y_test

def main():
    log('Building training data...')
    build_data()
    log('Loading training data with LDA features...')
    X_train,X_test,y_train,y_test=load_train_data()
    log('Training LinearSVC model..')
    #model=LinearSVC()
    model=RandomForestClassifier()
    model.fit(X_train,y_train)
    log('Evaluating model...')
    acc=model.score(X_test,y_test)
    log('Accuracy:{}'.format(acc))
    y_pred=model.predict(X_test)
    p=precision_score(y_test,y_pred,average='macro')
    r=recall_score(y_test,y_pred,average='macro')
    f1=f1_score(y_test,y_pred,average='macro')
    log('Precision:{:.3f},Recall:{:.3f},F1:{:.3f}'.format(p,r,f1))


if __name__=='__main__':
    main()

基於LDA主題模型和SVM的文字分類

用LDA模型抽取文字特徵，再用線性SVM分類，發現效果很差，F1=0.654。 Precision:0.680,Recall:0.649,F1:0.654

【軟體測試】軟體測試理論基礎3——軟體質量模型和軟體測試分類

1. 軟體質量模型軟體質量模型可描述為外部和內部質量，總共有六大特性，在這裡就不加贅述了。以微信為例完成質量分析。照此案例，大家可任選一款自己用過的軟體進行質量分析。

基於Python 樸素貝葉斯--文字分類

技術標籤：基於Python演算法實戰python樸素貝葉斯演算法tf-idf 基於Python 樸素貝葉斯--文字分類

機器學習——LDA主題模型

技術標籤：LDA建模python機器學習自然語言處理 LDA主題模型 LDA是一種非監督機器學習技術，可以用來識別大規模文件集（document

LDA主題模型

LDA（主題模型）本文是啟發是v_JULY_v這位大佬的部落格部落格地址為：https://blog.csdn.net/v_JULY_v/article/details/41209515

pytorch+huggingface實現基於bert模型的文字分類（附程式碼）

從RNN到BERT 一年前的這個時候，我逃課了一個星期，從澳洲飛去上海觀看電競比賽，也順便在上海的一個公司聯絡了面試。當時，面試官問我對RNN的瞭解程度，我回答“沒有了解”。但我把這個問題帶回了學校，從

基於Bert和通用句子編碼的Spark-NLP文字分類

作者|Veysel Kocaman 編譯|VK 來源|Towards Data Science 自然語言處理(NLP)是許多資料科學系統中必須理解或推理文字的關鍵組成部分。常見的用例包括文字分類、問答、釋義或總結、情感分析、自然語言BI、語言建模和

基於fastText模型的文字分類

轉自：https://mp.weixin.qq.com/s/m01J5Mi25txyRkKo7_BAuw 1. 資料及背景 https://tianchi.aliyun.com/competition/entrance/531810/information（阿里天池-零基礎入門NLP賽事）

基於LSTM和詞嵌入的tweet文字分類

作者|Emmanuella Anggi 編譯|VK 來源|Towards Data Science 在這篇文章中，我將詳細介紹如何使用fastText和GloVe作單詞嵌入到LSTM模型上進行文字分類。

Bert文字分類實踐（三）：處理樣本不均衡和提升模型魯棒性trick

目錄寫在前面緩解樣本不均衡模型層面解決樣本不均衡Focal Loss pytorch程式碼實現資料層面解決樣本不均衡提升模型魯棒性對抗訓練對抗訓練pytorch程式碼實現知識蒸餾防止模型過擬合正則化L1和L2正則化Dropout資料增強

NLP文字分類學習筆記7：基於預訓練模型的文字分類

預訓練模型預訓練是一種遷移學習的思想，在一個大資料集上訓練大模型，之後可以利用這個訓練好的模型處理其他任務。預訓練模型的使用方法一般有：

基於Kaggle資料的詞袋模型文字分類教程

本教程展示了改善文字分類的方法，包括：做一個驗證集，為AUC預測概率，用線性模型代替隨機森林，使用TF-IDF權衡詞彙，留下停用詞，加上二元模型或者三元模型等。

Pytorch實現基於CharRNN的文字分類與生成示例

1 簡介本篇主要介紹使用pytorch實現基於CharRNN來進行文字分類與內容生成所需要的相關知識，並最終給出完整的實現程式碼。

使用pytorch和torchtext進行文字分類的例項

文字分類是NLP領域的較為容易的入門問題，本文記錄我自己在做文字分類任務以及復現相關論文時的基本流程，絕大部分操作都使用了torch和torchtext兩個庫。

基於TorchText的PyTorch文字分類

作者|DR. VAIBHAV KUMAR 編譯|VK 來源|Analytics In Diamag 文字分類是自然語言處理的重要應用之一。在機器學習中有多種方法可以對文字進行分類。但是這些分類技術大多需要大量的預處理和大量的計算資源。在這篇文章

Datawhale-新聞文字分類-task4-基於深度學習的文字分類2-word2vec-textcnn-textrnn

# 在上一次中10fold程式碼不知道怎麼寫![](https://img2020.cnblogs.com/blog/1358638/202007/1358638-20200731212605021-1488696390.png) ，之後看了論壇大哥程式碼，有點理解了。

基於機器學習的文字分類NLP基本介紹

學習目的： 1 學會TF-IDF的原理和使用 2 使用sklearn的機器學習模型完成文字分類

機器學習-文字分類（1）之獨熱編碼、詞袋模型、N-gram、TF-IDF

1、one-hot 一般是針對於標籤而言，比如現在有貓：0，狗：1，人：2，船：3，車：4這五類，那麼就有：

各種文字分類模型實踐

將進行以下嘗試：用詞級的 ngram 做 logistic 迴歸用字元級的 ngram 做 logistic 迴歸

python應用主題模型——lda，牛刀小試

說明： 1.資料來源：WoS文獻資料 2.python讀取excel中儲存的資料 3.通過分句、分詞、去停用詞、詞形還原分析TI（篇名）與AB（摘要）中的文字

基於LDA主題模型和SVM的文字分類

相關推薦