基於NLP自然語言構建的文件自動分類系統(搜狐娛樂)—word2vec模型

阿新 • • 發佈：2018-12-11

開發環境 jupyter notebook

1 載入資料

import numpy as np
import pandas as pd

# 檢視訓練資料
train_data = pd.read_csv('data/sohu_train.txt', sep='\t', header=None, 
                         dtype=np.str_, encoding='utf8', names=[u'頻道', u'文章'])
train_data.head() 

# 載入停用詞
stopwords = set()
with open('data/stopwords.txt' 
, 'rb') as infile:
    for line in infile:
        line = line.rstrip('\n')
        if line:
            stopwords.add(line.lower())

2 計算每個文章的詞向量

# 載入訓練好的Word2Vec模型
# 需要 4.0_訓練word2vec模型.ipynb 的執行結果
from gensim.models import Word2Vec
w2v = Word2Vec.load('output_word2vec/model.w2v') 

# 使用文章中所有詞的平均詞向量作為文章的向量 

import jieba
def compute_doc_vec_single(article):
    vec = np.zeros((w2v.layer1_size,), dtype=np.float32)
    n = 0
    for word in jieba.cut(article):
        if word in w2v:
            vec += w2v[word]
            n += 1
    return vec / n

def compute_doc_vec(articles):
    return np.row_stack([ 
compute_doc_vec_single(x) for x in articles]) 
    
x = compute_doc_vec(train_data[u'文章']) 
print (x.shape)

3 訓練分類器

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split 

# 編碼目標變數
y_encoder = LabelEncoder()
y = y_encoder.fit_transform(train_data[u'頻道']) 

# 劃分訓練測試資料,根據y分層抽樣，測試資料佔20%
train_idx, test_idx = train_test_split(range(len(y)), test_size=0.2, stratify=y)
train_x = x[train_idx, :]
train_y = y[train_idx]
test_x = x[test_idx, :]
test_y = y[test_idx]

4 .訓練與評估

# 訓練邏輯迴歸模型 
from sklearn.linear_model import LogisticRegression
"""
	常用引數說明:
	penalty: 正則項型別，l1還是l2
	C:       正則項懲罰係數的倒數，越大則懲罰越小
	fit_intercept: 是否擬合常數項
	max_iter:      最大迭代次數
	multi_class:   以何種方式訓練多分類模型
	     ovr =     對每個標籤訓練二分類模型
	     multinomial = 直接訓練多分類模型，僅當solver={newton-cg, sag, lbfgs}時支援
	 solver: 用哪種方法求解，可選有{liblinear, newton-cg, sag, lbfgs}
	         小資料liblinear比較好，大資料量sag更快
	         多分類問題，liblinear只支援ovr模式，其他支援ovr和multinomial
	         liblinear支援l1正則，其他只支援l2正則
"""

model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model.fit(train_x, train_y)

from sklearn.metrics import confusion_matrix, precision_recall_fscore_support 

# 在測試集上計算模型的表現
test_y_pred = model.predict(test_x)

# 計算混淆矩陣
pd.DataFrame(confusion_matrix(test_y, test_y_pred), 
             columns=y_encoder.classes_, 
             index=y_encoder.classes_)

$~$	體育	健康	女人	娛樂	房地產	教育	文化	新聞	旅遊	汽車	科技	財經
體育	385	0	4	3	2	1	1	0	2	0	1	1
健康	0	314	22	0	0	5	6	27	2	0	6	18
女人	7	13	323	19	2	2	17	4	3	3	7	0
娛樂	1	1	15	308	0	3	60	3	2	1	6	0
房地產	1	3	4	3	357	0	0	13	4	1	0	14
教育	0	5	4	5	0	335	6	32	3	1	5	4
文化	1	3	25	67	2	7	232	30	15	2	13	3
新聞	8	14	13	7	22	26	25	227	12	4	16	26
旅遊	1	5	15	1	5	5	15	15	312	6	11	9
汽車	0	3	5	0	1	0	0	7	5	365	0	14
科技	2	9	4	3	2	3	12	20	8	3	319	15
財經	3	4	2	2	29	1	1	40	2	12	25	279

# 計算各項評價指標
def eval_model(y_true, y_pred, labels):

    # 計算每個分類的Precision, Recall, f1, support
    p, r, f1, s = precision_recall_fscore_support(y_true, y_pred)
    # 計算總體的平均Precision, Recall, f1, support
    tot_p = np.average(p, weights=s)
    tot_r = np.average(r, weights=s)
    tot_f1 = np.average(f1, weights=s)
    tot_s = np.sum(s)
    res1 = pd.DataFrame({
        u'Label': labels,
        u'Precision': p,
        u'Recall': r,
        u'F1': f1,
        u'Support': s
    })
    res2 = pd.DataFrame({
        u'Label': [u'總體'],
        u'Precision': [tot_p],
        u'Recall': [tot_r],
        u'F1': [tot_f1],
        u'Support': [tot_s]
    })
    res2.index = [999]
    res = pd.concat([res1, res2])
    return res[[u'Label', u'Precision', u'Recall', u'F1', u'Support']] 

eval_model(test_y, test_y_pred, y_encoder.classes_) #檢視訓練結果

檢視前五條

$~$	Label	Precision	Recall	F1	Support
0	體育	0.941320	0.9625	0.951792	400
1	健康	0.839572	0.7850	0.811370	400
2	女人	0.740826	0.8075	0.772727	400
3	娛樂	0.736842	0.7700	0.753056	400
4	房地產	0.845972	0.8925	0.868613	400

5 模型儲存

# 儲存模型到檔案
import dill
import pickle
model_file = os.path.join(output_dir, u'model.pkl')
with open(model_file, 'wb') as outfile:
    pickle.dump({
        'y_encoder': y_encoder,
        'lr': model
    }, outfile)

6 載入模型對新文件進行預測

from gensim.models import Word2Vec
import dill
import pickle
import jieba

# 把預測相關的邏輯封裝在一個類中，使用這個類的例項來對新文件進行分類預測
class Predictor(object):
    
    def __init__(self, w2v_model_file, lr_model_file):
        self.w2v = Word2Vec.load(w2v_model_file)
        with open(lr_model_file, 'rb') as infile:
            self.model = pickle.load(infile)
    
    def predict(self, articles):
        x = self._compute_doc_vec(articles)
        y = self.model['lr'].predict(x)
        y_label = self.model['y_encoder'].inverse_transform(y)
        return y_label
    
    def _compute_doc_vec(self, articles):
        return np.row_stack([compute_doc_vec_single(x) for x in articles])

    def _compute_doc_vec_single(self, article):
        vec = np.zeros((w2v.layer1_size,), dtype=np.float32)
        n = 0
        for word in jieba.cut(article):
            if word in w2v:
                vec += w2v[word]
                n += 1
        return vec / n

# 載入新文件資料
new_data = pd.read_csv('data/sohu_test.txt', sep='\t', header=None, 
                       dtype=np.str_, encoding='utf8', names=[u'頻道', u'文章'])
new_data.head() 

# 載入模型
predictor = Predictor('output_word2vec/model.w2v', model_file)

# 預測前10篇的分類
new_y_pred = predictor.predict(new_data[u'文章'][:10])

# 對比預測
pd.DataFrame({u'預測頻道': new_y_pred, u'實際頻道': new_data[u'頻道'][:10]})

輸出：
|      | 實際頻道 | 預測頻道 |
| ---- | --------| -------- |
| 0    | 娛樂     | 娛樂     |
| 1    | 娛樂     | 體育     |
| 2    | 娛樂     | 娛樂     |
| 3    | 娛樂     | 文化     |
| 4    | 娛樂     | 女人     |
| 5    | 娛樂     | 新聞     |
| 6    | 娛樂     | 娛樂     |
| 7    | 娛樂     | 娛樂     |
| 8    | 娛樂     | 娛樂     |
| 9    | 娛樂     | 娛樂     |

基於NLP自然語言構建的文件自動分類系統(搜狐娛樂)—word2vec模型

開發環境 jupyter notebook 1 載入資料 import numpy as np import pandas as pd # 檢視訓練資料 train_data = pd.rea

kvm使用kickstart文件自動安裝系統

dhcp服務器 python device iso enable pla args -- bsp 假定kvm已經準備好1、創建磁盤qemu-img create -f qcow2 /kvm/os/vm-01.qcow2 16G2、上傳或下載安裝鏡像mk

基於 Python + Proftpd 實現文件自動備份

centos project ase getcwd upa roo 執行 http proftpd 一、環境概述 1.概述作用：將項目服務器的重要需備份文件自動定期備份至公司內部的服務器架構：FTP 服務器部署於內網服務器，為被動模式，通過防火墻映射21端口和通信端口，

NLP自然語言處理入門-- 文本預處理Pre-processing

檢查 bag 領域影響 rds except clean numpy 我只引言自然語言處理NLP（nature language processing），顧名思義，就是使用計算機對語言文字進行處理的相關技術以及應用。在對文本做數據分析時，我們一大半的時間都會花在文本

API管理-基於SpringBoot專案整合swagger實現介面文件自動生成

1. 為什麼要使用swagger？上一次部落格（API管理-使用開源xxl-api專案管理介面）中我也提到過介面文件在整個生命

C語言之文件操作07——讀取文件數據並計算均值方差標準差

取出學生 stdio.h fscanf track white data .net += //文件 /* =============================================================== 題目：從文本文件"high.txt

C語言之文件操作06——寫數據到文本文件遇0停止

語言 text null white ont .net main fopen scan //文件 /* =============================================================== 題目：輸入10個籃球運動員的

Linux 用C語言判斷文件和文件夾

nis ces 文件 char int pat code col exist Linux 用C語言判斷文件和文件夾 #include <stdio.h> #include <stdlib.h> #include <unistd.h> #

OGG運維優化腳本（十四）-信息同步類--定義文件自動下發

ogg oracle goldengate 腳本數據同步 shell 文件: resend.sh路徑:$HOME/ggscript/ggdef功能：該腳本為用於應對目標端因為定義文件失效導致的進程異常中斷所設計因源端業務經常未通知目標端以及系統組自行修改表結構因此設計該腳本自動生成定

rsync 基於ssh隧道拉去文件

rsync#!/bin/bash/usr/bin/rsync -tza -e ‘ssh -p 22‘ 172.16.4.12:/opt/devinfopack/ /opt/devinfopack/ |tee -a /opt/shell/log/devinfopack_`date "+%Y-%m-%d"

基於hbase+hdfs的小文件(圖片)存儲

current 創建表 2.7 con all getc close 讀取 println 圖片文件一般在100k一下，質量好一些的在幾百k，特殊的圖像可能達到10m左右，如果直接存儲在hdfs上會對namenode的內存造成很大的壓力，因為namenode的內存中會存儲每

apache主配置文件模板和基於域名虛擬主機配置文件模板

apache主配置文件模板默認配置文件模板禁止ip訪問基於域名虛擬主機配置文件模板記錄代理ip1，主配置文件模板 # cat httpd.conf ServerRoot "/usr/local/apache" Listen 80 LoadModule authn_file_module mo

C語言頭文件中定義全局變量導致重復定義錯誤

語言總結多個 sof pan 全局變量現在程序編譯升級合作方升級SDK後，程序編譯出現變量重復定義的錯誤，通過錯誤提示無法找到什麽位置重復定義了，但確定是引入新SDK後才出現的錯誤，從SDK的頭文件中查找，最終發現在頭文件中定義了全局變量我們的項目在多個頭文件

C#語言-07.文件操作

訪問方式 bsp 系統行數據 read direct 網絡 cnblogs 對象 a. 文件操作：適用於相對簡單的數據保存 i. 讀寫文件的步驟： 1. 創建文件流 2. 創建讀寫器

Linux下C語言的文件操作

poi 需要 inux 文件 name fcn flag exc 文件描述符代碼： 1 #include <stdio.h> 2 #include <string.h> 3 #include <fcntl.h> 4 /****

Jetty插件實現熱部署（開發時修改文件自動重啟Jetty）

max www. nbsp 很遺憾 pom als ati mave 重啟在pom.xml文件中配置Jetty插件的參數：scanIntervalSeconds <plugin> <grou

Linux C語言頭文件搜索路徑

type 是把一個 logs 規則 /usr 路徑 bbs 部分本文介紹在linux中頭文件的搜索路徑，也就是說你通過include指定的頭文件，linux下的gcc編譯器它是怎麽找到它的呢。在此之前，先了解一個基本概念。頭文件是一種文本文件，使用文本編輯器

Centos7基於UEFI+GPT的kickstart文件

arpa clear 硬盤分區開機引導 fst art ebo 代理 tlab Kickstart是一種無人值守的安裝方式。詳細的配置文件網上有很多模板。這裏就只貼出Centos7的kickstart.cfg文件當然，咱們這個是最小化安裝。　配置文

sublime 設置新建文件自動添加author(作者)等文件頭信息

style div 默認 inf pack 選項 pac user 模板很多時候, sublime 自帶自動添加文件頭信息, 但是並不是我們想要比如下面這樣的:新建一個python文件自動添加的author 信息== 上面並不是我想要的, 我想要下面這樣的效果

c語言_文件操作_FILE結構體小解釋

hold 輸出 def 系統 span fopen 大小 round typedef 參考文檔來自：https://www.cnblogs.com/haore147/p/3648395.html 我們通過fopen返回一個文件指針(指向FILE結構體的指針)來進行文件操作。

基於NLP自然語言構建的文件自動分類系統(搜狐娛樂)—word2vec模型

1 載入資料

2 計算每個文章的詞向量

3 訓練分類器

4 .訓練與評估

5 模型儲存

6 載入模型對新文件進行預測

相關推薦