中文短文字分類

阿新 • • 發佈：2018-11-30

在這裡插入圖片描述

特徵提取+樸素貝葉斯模型：

import random
import jieba
import pandas as pd
#載入停用詞
stopwords=pd.read_csv('D://input_py//day06//stopwords.txt',index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')
stopwords=stopwords['stopword'].values
#載入語料
laogong_df = pd.read_csv('D://input_py//day06//beilaogongda.csv', encoding='utf-8', sep=',')
laopo_df = pd.read_csv('D://input_py//day06//beilaopoda.csv', encoding='utf-8', sep=',')
erzi_df = pd.read_csv('D://input_py//day06//beierzida.csv', encoding='utf-8', sep=',')
nver_df = pd.read_csv('D://input_py//day06//beinverda.csv', encoding='utf-8', sep=',')
#刪除語料的NAN行
laogong_df.dropna(inplace=True)
laopo_df.dropna(inplace=True)
erzi_df.dropna(inplace=True)
nver_df.dropna(inplace=True)
#轉換
laogong = laogong_df.segment.values.tolist()
laopo = laopo_df.segment.values.tolist()
erzi = erzi_df.segment.values.tolist()
nver = nver_df.segment.values.tolist()

#定義分詞和打標籤函式preprocess_text
#引數content_lines即為上面轉換的list
#引數sentences是定義的空list，用來儲存打標籤之後的資料
#引數category 是型別標籤
def preprocess_text(content_lines, sentences, category):
    for line in content_lines:
        try:
            segs=jieba.lcut(line)
            segs = [v for v in segs if not str(v).isdigit()]#去數字
            segs = list(filter(lambda x:x.strip(), segs))   #去左右空格
            segs = list(filter(lambda x:len(x)>1, segs)) #長度為1的字元
            segs = list(filter(lambda x:x not in stopwords, segs)) #去掉停用詞
            sentences.append((" ".join(segs), category))# 打標籤
        except Exception:
            print(line)
            continue
sentences = []
preprocess_text(laogong, sentences, 0)
preprocess_text(laopo, sentences, 1)
preprocess_text(erzi, sentences, 2)
preprocess_text(nver, sentences, 3)
random.shuffle(sentences)
# 輸出前10條資料
# for sentence in sentences[:10]:
#         print(sentence[0], sentence[1])  # 下標0是詞列表，1是標籤
# 定義文字抽取詞袋模型特徵
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(
    analyzer='word', # tokenise by character ngrams
    max_features=4000,  # keep the most common 1000 ngrams
)
# 把語料資料切分
from sklearn.model_selection import train_test_split
x, y = zip(*sentences)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1256)
# 把訓練資料轉換為詞袋模型
vec.fit(x_train)
# 演算法建模和模型訓練
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(vec.transform(x_train), y_train)
# 計算 AUC 值
print(classifier.score(vec.transform(x_test), y_test))

結果評分為：0.6587

特徵提取+svm模型：

import random
import jieba
import pandas as pd
#載入停用詞
stopwords=pd.read_csv('D://input_py//day06//stopwords.txt',index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')
stopwords=stopwords['stopword'].values
#載入語料
laogong_df = pd.read_csv('D://input_py//day06//beilaogongda.csv', encoding='utf-8', sep=',')
laopo_df = pd.read_csv('D://input_py//day06//beilaopoda.csv', encoding='utf-8', sep=',')
erzi_df = pd.read_csv('D://input_py//day06//beierzida.csv', encoding='utf-8', sep=',')
nver_df = pd.read_csv('D://input_py//day06//beinverda.csv', encoding='utf-8', sep=',')
#刪除語料的NAN行
laogong_df.dropna(inplace=True)
laopo_df.dropna(inplace=True)
erzi_df.dropna(inplace=True)
nver_df.dropna(inplace=True)
#轉換
laogong = laogong_df.segment.values.tolist()
laopo = laopo_df.segment.values.tolist()
erzi = erzi_df.segment.values.tolist()
nver = nver_df.segment.values.tolist()

#定義分詞和打標籤函式preprocess_text
#引數content_lines即為上面轉換的list
#引數sentences是定義的空list，用來儲存打標籤之後的資料
#引數category 是型別標籤
def preprocess_text(content_lines, sentences, category):
    for line in content_lines:
        try:
            segs=jieba.lcut(line)
            segs = [v for v in segs if not str(v).isdigit()]#去數字
            segs = list(filter(lambda x:x.strip(), segs))   #去左右空格
            segs = list(filter(lambda x:len(x)>1, segs)) #長度為1的字元
            segs = list(filter(lambda x:x not in stopwords, segs)) #去掉停用詞
            sentences.append((" ".join(segs), category))# 打標籤
        except Exception:
            print(line)
            continue
sentences = []
preprocess_text(laogong, sentences, 0)
preprocess_text(laopo, sentences, 1)
preprocess_text(erzi, sentences, 2)
preprocess_text(nver, sentences, 3)
random.shuffle(sentences)
# 把語料資料切分
from sklearn.model_selection import train_test_split
x, y = zip(*sentences)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1256)
# 改變特徵向量模型
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(
    analyzer='word', # tokenise by character ngrams
    ngram_range=(1,4),  # use ngrams of size 1 and 2
    max_features=20000,  # keep the most common 1000 ngrams
)
vec.fit(x_train)
# 用svm演算法進行模型訓練
from sklearn.svm import SVC
svm = SVC(kernel='linear')
svm.fit(vec.transform(x_train), y_train)
print(svm.score(vec.transform(x_test), y_test))

結果評分為：0.9976

中文短文字分類

特徵提取+樸素貝葉斯模型： import random import jieba import pandas as pd #載入停用詞 stopwords=pd.read_csv('D://input_py//day06//stopwords.txt',index_col=Fals

新聞上的文字分類：機器學習大亂鬥王嶽王院長王嶽王院長 5 個月前目標從頭開始實踐中文短文字分類，記錄一下實驗流程與遇到的坑運用多種機器學習（深度學習 + 傳統機器學習）方法比較短文字分類處

目標從頭開始實踐中文短文字分類，記錄一下實驗流程與遇到的坑運用多種機器學習（深度學習 + 傳統機器學習）方法比較短文字分類處理過程與結果差別工具深度學習：keras 傳統機器學習：sklearn參與比較的機器學習方法 CNN 、 CNN + word2vec LSTM 、 LSTM + word

中文短文字聚類

文字聚類是將文件由原有的自然語言文字資訊轉化成數學資訊，以高維空間點的形式展現出來，通過計算哪些點距離比較近，從而將那些點聚成一個簇，簇的中心叫做簇心。 import random import jieba import pandas as pd import numpy as np f

構建短文字分類模型需要注意的幾點

一、深度學習模型　　1.CNN 　　2.LSTM 　　3.Attention 二、與傳統機器學習模型的比較　　1.SVM 　　2.LR 　　3.GBDT 　　4.XGBoost 　　5.RandomForest 　　6.LightGBM 三、文字特徵選擇　　1.一般短文字的長度在

TextGrocery短文字分類使用

TextGrocery是一個基於LibLinear和結巴分詞的短文字分類工具，特點是高效易用，同時支援中文和英文語料。 GitHub專案連結 1、安裝通過GitHub（最新版本） git clone https://github.com/2sho

［機器學習］機器學習在短文字分類專案中的應用

一：前言之前答應一個朋友介紹一下機器學習專案的基本流程，就以一個短文字分類專案為示例，介紹一下在面對機器學習專案時的基本解決思路，因為不是專業的演算法工程師，所以有疏漏之處請大家多多見諒。同時由於這是一個內部比賽專案，所以資料無法公開，但是程式碼會分享在git上，程式碼寫的也

短文字分類概述

Table of Contents 定義特點及難點評價指標定義短文字通常是指長度比較短，一般不超過160個字元的文字形式，如微博、聊天資訊、新聞主題、觀點評論、問題文字、手機簡訊、文獻摘要等。短文字分類任務的目的是自動對使用者輸入的短文字進行處理

（NLP）基於分詞標籤的中文短文字相似度

基於分詞標籤的中文短文字相似度最近接觸到了一些關於中文短文字相似度的演算法，將它們總結在此：中文編輯距離基於詞頻的餘弦相似度 Python difflib github傳送門：https://github.com/gongpx20069/DIY

短文字分類總結

一：分詞1、常用中文分詞工具：jieba、SnowNLP（MIT）、pynlpir、thulac，其中jieba比較常用2、去除停用詞這個主要需要匯入並構建停用詞表，然後刪除分詞結果中停用詞表中的詞。簡單說就是刪除一些語氣詞了，這些詞語並不能有效的代表句子的特徵。3、特徵提

深度學習Attention機制在短文字分類上的應用——qjzcy的部落格

平常我們對分類的判斷也是基於標題中的某些字，或者某些詞性。比如《姚明籃球打的怎樣》應該判別為體育，這時候“姚明”，“籃球”應該算對我們比較重要的詞彙。詞性我們關注點在“人名”和“名詞”上面，深度學習的attention機制剛好符合這個特點。我們能不能利用atte

深度學習與中文短文字分析總結與梳理

1.緒論過去幾年，深度神經網路在模式識別中佔絕對主流。它們在許多計算機視覺任務中完爆之前的頂尖演算法。在語音識別上也有這個趨勢了。而中文文字處理，以及中文自然語言處理上，似乎沒有太厲害的成果？尤其是中文短文字處理的問題上，尚且沒有太成功的應用於分散式

中文短文字相似度：WMD

開篇句子相似是目前我做問句匹配的基礎。這是我嘗試使用詞向量，以一種無監督方法去計算兩個句子相似度的第二種方法。第一種方法，我嘗試使用詞向量的加權平均生成句向量來計算句子間的相似度，效果很一般，之後我會嘗試使用不同的加權方法再次計算。有機會我會連著程式碼一起放

.net對含有中文的字符串進行MD5加密

utf result crypto pla ace tolower 編碼 ice convert MD5CryptoServiceProvider MD5 = new MD5CryptoServiceProvider(); var Sign = Bi

php實現中文反轉字符串的方法

str1 單個 head 共和國 list har 字符串 string text 1 <?php 2 3 header("content-type:text/html;charset=utf-8"); 4 /** 5 此函數的作用是反轉中文字符串

Python讀取文本，輸出指定中文（字符串）

class 分享 /tmp () fun 問題 print fin 斷路器因業務需求，需要提取文本中帶有檢查字樣的每一行。樣本如下： 1 投入10kVB、C母分段820閉鎖備自投壓板 2 退出10kVB、C母分段820備投跳803壓板 3 退出10kVB

JavaSE8基礎 String getBytes 將不含中文的字符串轉換成字節數組

es2017 logs 字符 public res bsp clas 源碼技術分享 os ：windows7 x64 jdk：jdk-8u131-windows-x64 ide：Eclipse Oxygen Release (4.7.0)

js處理包含中文的字符串

tools http fun gen turn strong .html logs sof 場景： js中String類型自帶的屬性length獲取的是字符串的字符數目，但是前端經常會需要限制字符串的顯示長度，一個中文字符又大概占兩個英文小寫字符的顯示位置，所以中英文混合

python 打印Linux中文編碼字符

style -c color linux中 linux中文服務器編碼 pytho span 2018-10-12 12:02:15 星期五 python -c "print ‘\346\234\215\345\212\241\345\231\250\346\217

獲取一個臨時檔案和對中文檔名字進行編碼的工具類

　　　　首先我們明白，一個檔案可以命名為任何名稱，比如一個excel，我們可以命名為不帶字尾，然後向裡面寫入對應的內容，只是在匯出的時候將檔案命名為正確的名字即可。　　一個在當前使用者的預設臨時資料夾中生成一個當前日期的資料夾，然後再裡面寫入一個用UUID生成名字的檔案，常用於Java

火眼金睛演算法，教你海量短文字場景下去重

本文由QQ大資料發表最樸素的做法在大多數情況下，大量的重複文字一般不會是什麼好事情，比如互相抄襲的新聞，群發的垃圾簡訊，鋪天蓋地的廣告文案等，這些都會造成網路內容的同質化並加重資料庫的儲存負擔，更糟糕的是降低了文字內容的質量。因此需要一種準確而高效率的文字去重演算法。而最樸素的做法就是將所有文字進行兩

中文短文字分類

特徵提取+樸素貝葉斯模型：

特徵提取+svm模型：

相關推薦