1. 程式人生 > ><Machine Learning in Action >之二 樸素貝葉斯 C#實現文章分類

<Machine Learning in Action >之二 樸素貝葉斯 C#實現文章分類

options 直升機 water 飛機 math mes 視頻 write mod

def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = log(p1Num/p1Denom)          #change to log()
    p0Vect = log(p0Num/p0Denom)          #change to log()
    return p0Vect,p1Vect,pAbusive

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult   *提示一
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else: 
        return 0
    


*提示一

p(Ci|w)=p(w|Ci)p(Ci)/p(w) 對乘積取自然對數 ln(p(w|Ci)p(Ci))=ln(p(w|Ci))+ln(p(Ci))

在以下樣例中。由於每一個分類在樣本中的比例都一樣的,這樣不用再加上log(p(Ci))也不會影響最後的分類效果


用C#隨便做個樣例,實現文章類型的分類 隨機詞不如有針對性的詞來的有效,所以這裏都是從全部三個分類裏找到的詞匯

1、創建詞向量:中超/亞冠/國足/足協/英超/西甲/歐冠/意甲/德甲/籃球/NBA/CBA/高爾夫/乒乓/排球/網球/羽毛球/跑步/賽車/棋牌/臺球/遊泳/馬術/拳擊/田徑/功夫/撲克/體育/球隊/球員/訓練/國家隊/聯賽/俱樂部/場地/翻盤/絕殺/熱身/隊友/冠軍/亞軍/季軍/犯規/賽季/加時/反超/半場/爭奪/戰術/陣容/比賽/德比/恢復/進球/失球/奧斯卡/娛樂/影迷/電影/電視/音樂/戲劇/視頻/演員/導演/明星/經紀人/歌手/連續劇/展映/粉絲/寫真/演技/作秀/節目/藝人/超模/女星/模特/男星/性感/主創/院線/影業/拍攝/編劇/情節/影像/劇情/主演/上映/票房/開機/劇集/表演/收視/預告片/主持人/艾美獎/角色/劇院/樂迷/影迷/演出/專輯/樂壇/劇場/文藝/芭蕾/戲曲/舞蹈/軍事/軍隊/軍機/炸彈/軍方/坦克/軍艦/炸死/軍演/戰備/部隊/軍區/國防/士兵/艦船/潛艇/飛機/直升機/艦隊/保衛/演習/武器/反擊/打擊/閱兵/對抗/防衛/海軍/空軍/陸軍/武裝/戰略/空襲/沖突/裝甲/步兵/作戰/導彈/邊防/偵察/戰鬥機/雷達/轟炸/防禦/據點/火力/航空母艦/進攻/彈藥/軍營/包圍/攻占/俘虜/參戰/戰友/戰鬥/入侵


2、搜狐上下載三類文章各10篇組成訓練樣本,計算出每篇文章的文檔矩陣。標註每篇文章的類別標簽

樣本文件名稱格式: 編號_類別標簽.txt

文檔矩陣:

000000000000000000100000000000000000001100010001001010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000010000000000000100000000000000011110001010000000000000000011000000110000000000100000000000000000010000000000000000000000000000000000000000000000
000000000000000000000000000011000000000000000000000001001000001001000000001000000000000000000000010000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000001001000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000010000000000000000000000000000010010000100000000000000010010000001000000000100000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000010000000010000010000010100000000111111111110000000100000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000110000000000000011010000001000010000000000001100001110000000000000000000000000000000000000000000000000000000000000000000000000000
000000000100000000000000000000000000000000000000000000001010000110000000000000000100000001101000000100000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000010000010000000000000001000000001100000100000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000001010000110000000000000000000001011000010000110000000000000000000000000000000000000000000000000000000000000000000
000000010000000000000000000011100000001000010110001001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000
000000000000000000000000000000000000000000000000000000001001000100000000000000000000000010000100000100000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011110000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000110000000111111111111100000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001100000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000100000000000000000000000000000000000
000000000000000000000000000000000000000000000001000000000001000000000000000000000000100000000000000000000000000100100000010010000000000000000100000000000100000000000010
000000000000000000000000000000100000000000000000000000000001000000000000000000000000000000000000000000000000000100010000010000000000000000000100000100000000000000000000
000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000110010000000001001010000000010000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000010100000000000100000000010000000000000001000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000010000000000100000000100010000000000001000000000
000000010000000000000000000111001100000000010000001001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000100000000000000100000000110000010000000000000
110000000000000000000000000100001000100000010000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000
110000000000000000000000000111001100100100010001111011000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000
000000000001000000000000000001000000101000100110001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000
000000000000010000000000000000000000001101000001001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000010000000000000000010000000000000010010001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000001000000000000000111100000101000110100001000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000
000000000000000000000000000100000000000110010100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000


類別標簽向量:

122222222212333333333131111111

using System;
using System.Text;
using System.Windows.Forms;
using System.IO;

namespace NaiveBayes
{
    public partial class Form1 : Form
    {
        private string[] vocabArray;
        private double[] p0Num, p1Num, p2Num;

        public Form1()
        {
            InitializeComponent();
            label2.Text = "體育1、娛樂2、軍事3\r\n每一個類型10個訓練樣本\r\n文章所有出自搜狐新聞\r\n詞向量從各類文章中分詞獲得";
            StreamReader sr = new StreamReader("vocabList.txt", Encoding.Default);
            string line, all = "";
            while ((line = sr.ReadLine()) != null)
            {
                all += line;
            }
            vocabArray = all.Split(new string[] { "/" }, StringSplitOptions.RemoveEmptyEntries);
        }

        private void Form1_Resize(object sender, EventArgs e)
        {
            this.Width = 800;
            this.Height = 600;
        }

        private void button1_Click(object sender, EventArgs e)
        {
            //生成文檔矩陣和分類標簽向量
            DirectoryInfo di = new DirectoryInfo("train");
            FileInfo[] fi = di.GetFiles("*.txt");
            string[] trainMatrix = new string[fi.Length];
            p0Num = new double[vocabArray.Length];
            p1Num = new double[vocabArray.Length];
            p2Num = new double[vocabArray.Length];
            double p0Denom = 2.0;
            double p1Denom = 2.0;
            double p2Denom = 2.0;
            for (int i = 0; i < vocabArray.Length; i++)
            {
                p0Num[i] = p1Num[i] = p2Num[i] = 1.0;
            }
            string trainCategory = "";
            int m = 0;
            foreach (FileInfo i in fi)
            {
                StreamReader sr = new StreamReader(i.FullName, Encoding.Default);
                string line, all = "";
                while ((line = sr.ReadLine()) != null)
                {
                    all += line;
                }
                string strVec = "";
                foreach (string j in vocabArray)
                {
                    if (all.Contains(j))
                        strVec += "1";
                    else
                        strVec += "0";
                }
                trainMatrix[m] = strVec;
                m++;
                trainCategory += i.Name.Substring(i.Name.LastIndexOf("_") + 1, 1);
            }
            StreamWriter sw = new StreamWriter(".\\trainV\\trainMatrix.txt", true);
            foreach (string i in trainMatrix)
            {
                sw.WriteLine(i);
                sw.Flush();
            }
            sw.Close();
            sw = new StreamWriter(".\\trainV\\trainCategory.txt", true);
            sw.WriteLine(trainCategory);
            sw.Close();
            for (int i = 0; i < trainMatrix.Length; i++)
            {
                if (trainCategory.Substring(i, 1) == "1")
                {
                    double tmp = 0;
                    for (int j = 0; j < vocabArray.Length; j++)
                    {
                        p0Num[j] += double.Parse(trainMatrix[i].Substring(j, 1));
                        tmp += double.Parse(trainMatrix[i].Substring(j, 1));
                    }
                    p0Denom += tmp;
                }
                else if (trainCategory.Substring(i, 1) == "2")
                {
                    double tmp = 0;
                    for (int j = 0; j < vocabArray.Length; j++)
                    {
                        p1Num[j] += double.Parse(trainMatrix[i].Substring(j, 1));
                        tmp += double.Parse(trainMatrix[i].Substring(j, 1));
                    }
                    p1Denom += tmp;
                }
                else if (trainCategory.Substring(i, 1) == "3")
                {
                    double tmp = 0;
                    for (int j = 0; j < vocabArray.Length; j++)
                    {
                        p2Num[j] += double.Parse(trainMatrix[i].Substring(j, 1));
                        tmp += double.Parse(trainMatrix[i].Substring(j, 1));
                    }
                    p2Denom += tmp;
                }
                else
                {
                    //Undo
                }
            }
            for (int j = 0; j < vocabArray.Length; j++)
            {
                p0Num[j] = Math.Log(p0Num[j] / p0Denom);
                p1Num[j] = Math.Log(p1Num[j] / p1Denom);
                p2Num[j] = Math.Log(p2Num[j] / p2Denom);
            }
            label4.Text = "處理樣本數據完畢";
        }

        private void button2_Click(object sender, EventArgs e)
        {
            if (textBox1.Text.Trim() != "")
            {
                string strVec = "";
                foreach (string i in vocabArray)
                {
                    if (textBox1.Text.Contains(i))
                        strVec += "1";
                    else
                        strVec += "0";
                }
                double p0 = 0;
                double p1 = 0;
                double p2 = 0;
                for (int j = 0; j < vocabArray.Length; j++)
                {
                    p0 += p0Num[j] * double.Parse(strVec.Substring(j, 1));
                    p1 += p1Num[j] * double.Parse(strVec.Substring(j, 1));
                    p2 += p2Num[j] * double.Parse(strVec.Substring(j, 1));
                }
                string catelog = "";
                if (p0 > p1 && p0 > p2)
                    catelog = "體育";
                else if (p1 > p0 && p1 > p2)
                    catelog = "娛樂";
                else if (p2 > p0 && p2 > p1)
                    catelog = "軍事";
                else
                    catelog = "無法推斷";
                label3.Text = "體育:" + p0.ToString() + "\r\n娛樂:" + p1.ToString() + "\r\n軍事:" + p2.ToString();
                label1.Text = "所屬類型是:" + catelog;
            }
        }
    }
}

技術分享

&lt;Machine Learning in Action &gt;之二 樸素貝葉斯 C#實現文章分類