機器學習之新聞文字分類。

阿新 • • 發佈：2021-06-20

新聞文字分類首先需要通過大量的訓練之後獲得一個存放關鍵字的表，

之後再輸入一個新聞內容，通過程式碼就可以自動判斷出這個新聞的類別，

我這裡是在已經有了新聞文字的關鍵詞表後的處理，

# encoding=utf-8                                #遍歷檔案，用ProsessofWords處理檔案
from imp import reload
import jieba
import os
import sys
from imp import reload
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
 
from sklearn.neighbors import KNeighborsClassifier


reload(sys)
VECTOR_DIR = 'vectors.bin'
MAX_SEQUENCE_LENGTH = 100
EMBEDDING_DIM = 200
TEST_SPLIT = 0.2


def deposit_txt(title, content):
    textpath = "news/news.txt"
    f = open(textpath, 'w+', encoding='utf-8')
    f.write(title+content)
    f.close()


 
def EnumPathFiles(path, callback, stop_words_list):
    if not os.path.isdir(path):
        print('Error:"', path, '" is not a directory or does not exist.')
        return
    list_dirs = os.walk(path)

    for root, dirs, files in list_dirs:
        for d in dirs:
            print(d)
            EnumPathFiles(os.path.join(root, d), callback, stop_words_list)
         
for f in files:
            callback(root, f, stop_words_list)


def ProsessofWords(textpath, stop_words_list):
    f = open(textpath, 'r', encoding='utf-8')
    text = f.read()
    f.close()
    result = list()
    outstr = ''
    seg_list = jieba.cut(text, cut_all=False)
    for word in seg_list:
        if word not in stop_words_list:
            if word != '\t':
                outstr += word
                outstr += " "
    f = open(textpath, 'w+', encoding='utf-8')
    f.write(outstr)
    f.close()


def callback1(path, filename, stop_words_list):
    textpath = path + '\\' + filename
    print(textpath)
    ProsessofWords(textpath, stop_words_list)


def fenci():
    stopwords_file = "stopword/stopword.txt"
    stop_f = open(stopwords_file, "r", encoding='utf-8')
    stop_words = list()
    for line in stop_f.readlines():
        line = line.strip()
        if not len(line):
            continue
        stop_words.append(line)
    stop_f.close()
    print(len(stop_words))
    EnumPathFiles(r'news', callback1, stop_words)


def CV_Tfidf():

    reload(sys)

    # 資料獲取
    print('(1) load texts...')
    train_texts = open('dataset_train/x_train.txt', encoding='utf-8').read().split('\n')
    train_labels = open('dataset_train/y_train.txt', encoding='utf-8').read().split('\n')
    test_texts = open('news/news.txt', encoding='utf-8').read().split('\n')
    all_text = train_texts + test_texts

    # 特徵值抽取
    print('(2) doc to var...')

    count_v0 = CountVectorizer()
    counts_all = count_v0.fit_transform(all_text)
    count_v1 = CountVectorizer(vocabulary=count_v0.vocabulary_)
    counts_train = count_v1.fit_transform(train_texts)
    print("the shape of train is " + repr(counts_train.shape))
    count_v2 = CountVectorizer(vocabulary=count_v0.vocabulary_)
    counts_test = count_v2.fit_transform(test_texts)
    print("the shape of test is " + repr(counts_test.shape))

    tfidftransformer = TfidfTransformer()
    train_data = tfidftransformer.fit(counts_train).transform(counts_train)
    test_data = tfidftransformer.fit(counts_test).transform(counts_test)

    x_train = train_data
    y_train = train_labels
    x_test = test_data

    # KNN演算法建模
    print('(3) KNN...')
    knnclf = KNeighborsClassifier(n_neighbors=3)
    knnclf.fit(x_train, y_train)
    preds = knnclf.predict(x_test)
    preds = preds.tolist()
    for i, pred in enumerate(preds):
        print(pred)
        if pred == '1':
            return"此新聞為娛樂類新聞"
        elif pred == '2':
            return "此新聞為汽車類新聞"
        elif pred == '3':
            return "此新聞為遊戲類新聞"
        elif pred == '4':
            return "此新聞為科技類新聞"
        elif pred == '5':
            return "此新聞為綜合體育最新類新聞"
        elif pred == '6':
            return "此新聞為財經類新聞"
        elif pred == '7':
            return "此新聞為房產類新聞"
        elif pred == '8':
            return "此新聞為教育類新聞"
        elif pred == '9':
            return "此新聞為軍事類新聞"
def news(title, content):
    deposit_txt(title, content)
    fenci()
    result = CV_Tfidf()
    return result

基於機器學習的文字分類NLP基本介紹

學習目的： 1 學會TF-IDF的原理和使用 2 使用sklearn的機器學習模型完成文字分類

機器學習之新聞文字分類。

新聞文字分類首先需要通過大量的訓練之後獲得一個存放關鍵字的表，之後再輸入一個新聞內容，通過程式碼就可以自動判斷出這個新聞的類別，

機器學習之監督學習--（分類）支援向量機SVM①

技術標籤：機器學習python支援向量機分類演算法人工智慧 SVM簡單例子 from sklearn import svm

機器學習之分類問題實戰(基於UCI Bank Marketing Dataset)

導讀：分類問題是機器學習應用中的常見問題，而二分類問題是其中的典型，例如垃圾郵件的識別。本文基於UCI機器學習資料庫中的銀行營銷資料集，從對資料集進行探索，資料預處理和特徵工程，到學習模型的評估與選擇

Datawhale-新聞文字分類-task4-基於深度學習的文字分類2-word2vec-textcnn-textrnn

# 在上一次中10fold程式碼不知道怎麼寫![](https://img2020.cnblogs.com/blog/1358638/202007/1358638-20200731212605021-1488696390.png) ，之後看了論壇大哥程式碼，有點理解了。

機器學習之線性迴歸

解析解（不帶懲罰項） E ( w ) = 1 2 ∑ i = 1 N ( y ( x i , w ) − t i ) 2 E ( w ) = 1 2 ( X w − T ) T ( X w − T ) = 1 2 ( w T X T − T T ) ( X w −

機器學習筆記—模式分類（四）引數判別估計法3（貝葉斯引數估計）

前序文章：機器學習筆記—模式分類（一）緒論&貝葉斯決策論機器學習筆記—模式分類（二）引數判別估計法（最大似然估計和貝葉斯引數估計）1

機器學習之決策樹和隨機森林

一、迴歸實踐程式碼知識點總結 sklearn常用庫函式總結： from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.linear_model import L

機器學習之簡單線性迴歸

目錄1.簡單線性迴歸2.線性迴歸分析流程3.線性迴歸例項在前言中我們可以看到，目標標記為連續型數值的是迴歸。而回歸又分為線性和非線性

機器學習之邏輯迴歸

Logistic 迴歸的本質是：假設資料服從這個分佈，然後使用極大似然估計做引數的估計。 Logistic 分佈是一種連續型的概率分佈，其分佈函式和密度函式分別為：這個函式比較符合實際，例如蝗蟲的增長速度，員工

機器學習之 KNN近鄰演算法（二）鳶尾花資料集訓練

一、鳶尾花資料集 from sklearn.datasets import load_iris，通過datas= load_iris()獲得鳶尾花資料集用於測試

機器學習之 KNN近鄰演算法（三）影象識別

一、影象基礎知識 1）影象（如rpg格式）由畫素點組成　　400*300意思是行400畫素點，列300畫素點

機器學習之調參

匯入資料： from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split

機器學習之特徵選擇（Feature Selection）

引言　　特徵提取和特徵選擇作為機器學習的重點內容，可以將原始資料轉換為更能代表預測模型的潛在問題和特徵的過程，可以通過挑選最相關的特徵，提取特徵和創造特徵來實現。要想學習特徵選擇必然要了解什麼是特徵提

機器學習之決策樹

決策樹理論參考：https://www.cnblogs.com/fm-yangon/p/14072896.html 決策樹的sklearn實現決策樹模型（分類與迴歸引數方法屬性一致）：

機器學習之邏輯迴歸

logistic迴歸模型 logistic迴歸就是將線性迴歸模型的結果輸入一個Sigmoid函式，將回歸結果對映到0-1之間，表示類別“1”的概率。

機器學習之決策樹（Decision Tree）

1 引言　　決策樹（Decision Tree）是一種非引數的有監督學習方法，它能夠從一系列有特徵和標籤的資料中總結出決策規則，並用樹狀圖的結構來呈現這些規則，以解決分類和迴歸問題。決策樹中每個內部節點表示一個屬性

機器學習之隨機森林

隨機森林模型　　bagging模型的核心思想是每次同類別、彼此之間無強關聯的基學習器，以均等投票機制進行基學習器的組合。

機器學習之作業3

實際上就是在求樸素貝葉斯的引數估計而已。 X X X連續的情況 P ( X i ∣ Y i , θ ) = ∏ k = 1 K N

機器學習之線性迴歸簡單例項

技術標籤：機器學習機器學習線性迴歸簡介：線性迴歸目標是提取輸入變數和輸出變數的關聯線性模型。線性迴歸屬於有監督學習，有監督學習的基本架構和框架如下： 1.準備訓練資料，可以是文字資料、影象資料和音訊

機器學習之新聞文字分類。

相關推薦