python機器學習，載入樣本集，對資料分類

阿新 • • 發佈：2021-06-25

import pandas,numpy,os,nltk,langid
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

#preprocess用於將一個文字文件進行切詞，並以字串形式輸出切詞結果
def preprocess(path_name):
    text_with_spaces=""
    textfile 
=open(path_name,"r",encoding="utf-8").read()
    textcut=nltk.word_tokenize(textfile)
    for word in textcut:
        text_with_spaces+=word+" "
    return text_with_spaces


#loadtrainset用於將某一資料夾下的所有文字文件批量切詞後，載入為訓練資料集；返回訓練集和每一個文字（元組）對應的類標號。
def loadtrainset(path,classtag):
    allfiles=os.listdir(path)
    processed_textset 
=[]
    allclasstags=[]
    for thisfile in allfiles:
        path_name=path+"/"+thisfile
        processed_textset.append(preprocess(path_name))
        allclasstags.append(classtag)
    return processed_textset,allclasstags


def train():
    processed_textdata1,class1=loadtrainset("data/CS", "CS")
    processed_textdata2,class2 
=loadtrainset("data/CL", "CL")
    integrated_train_data=processed_textdata1+processed_textdata2
    classtags_list=class1+class2


    count_vector = CountVectorizer()
    #該類會將文字中的詞語轉換為詞頻矩陣，矩陣元素a[i][j] 表示j詞在i類文字下的詞頻
    vector_matrix = count_vector.fit_transform(integrated_train_data)

    #tfidf度量模型
    train_tfidf = TfidfTransformer(use_idf=False).fit_transform(vector_matrix)
    #將詞頻矩陣轉化為權重矩陣,每一個特徵值就是一個單詞的TF-IDF值


    #呼叫MultinomialNB分類器進行訓練
    clf = MultinomialNB().fit(train_tfidf,classtags_list)#

    return count_vector,clf


def isCyber(content):
    #[CL,CS]
    content_lang = langid.classify(content)[0]
    if  content_lang == 'en':
        text_with_spaces=""
        textcut=nltk.word_tokenize(content)
        for word in textcut:
            text_with_spaces+=word+" "

        testset=[]
        testset.append(text_with_spaces)
        count_vector,clf = train()
        new_count_vector = count_vector.transform(testset)
        new_tfidf= TfidfTransformer(use_idf=False).fit_transform(new_count_vector)
        predict_result = clf.predict(new_tfidf)    #預測結果
        print(predict_result)
        print( clf.predict_proba(new_tfidf) )
        print( clf.predict_proba(new_tfidf)[0][1] )
        if predict_result[0] == 'CS':
            if clf.predict_proba(new_tfidf)[0][1] >= 0.7:
                return True
        return False
    if content_lang == 'zh':
        print()

if __name__=='__main__':
    content = '''These pandemic days flow by in waves of exhilaration and stillness. Who knew a trip to the grocery store could be so exciting? Bread-and-milk runs have become surgical raids: Sterilize the grocery cart with a disinfectant wipe, scout out the TP aisle, exchange sideways glances with the could-be infected, grab the essentials, and get the hell out of there. Later, as another news alert interrupts the Netflix stream, the group text explodes: “This is crazy,” everyone says from their respective couches. Few hasten to add that crazy is also sort of fun.'''
    isCyber(content)

python機器學習，載入樣本集，對資料分類

import pandas,numpy,os,nltk,langid from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_extraction.text import CountVectorizer

機器學習程式語言之爭，Python奪魁

究竟哪種語言最適合機器學習成為爭論不休的話題。近日，密西根州立大學的博士生 Sebastian Raschka 再次發起了機器學習程式語言之爭（http://sebastianraschka.com/blog/2015/why-python.html），分析了自己選擇 Pyt

Python opencv學習-10尋找凸缺陷，輸出輪廓形狀匹配度

技術標籤：Python OpenCV 影象學習-21年opencvpython # 凸缺陷和輪廓形狀匹配 import cv2

富士康採用谷歌機器學習程式檢測不良產品，提高智慧手機生產效率

3 月 8 日訊息據 CNA 今日報道，富士康採用谷歌機器學習程式為基礎，製作了生產線自動偵測系統，大幅降低智慧手機零件缺陷遺失率，同時檢測時間也明顯減少。

python機器學習-資料集劃分

機器學習一般的資料集會劃分為兩個部分：訓練資料：用於訓練，構建模型測試資料：在模型檢驗時使用，用於評估模型是否有效

python機器學習實現決策樹

本文例項為大家分享了python機器學習實現決策樹的具體程式碼，供大家參考，具體內容如下

python機器學習庫xgboost的使用

1.資料讀取利用原生xgboost庫讀取libsvm資料 import xgboost as xgb data = xgb.DMatrix(libsvm檔案)

使用python機器學習和深度學習的5個很棒的計算機視覺專案創意

專案構想(Project Ideas) Computer Vision is a field of artificial intelligence that deals with images and pictures to solve real-life visual problems. The ability of the computer to r