文字挖掘——基於TF-IDF的KNN分類演算法實現

阿新 • • 發佈：2019-01-11

一、專案背景

此專案是用於基建大資料的文字挖掘。首先爬蟲師已經從各個公開網站上採集了大量的文字，這些文字是關於基建行業的各種招中標公告，文本里會有部分詞彙明顯或者隱晦的介紹此專案是關於哪一工程類別的，比如公路工程，市政工程，建築工程，軌道交通工程，等等。

所以，拿到文字的我們需要對這些資訊進行工程行業的歸類，進而提供給不同行業有需求的客戶。下圖展示了部分採集的資料，現在我們需要根據專案名稱和專案詳情進行工程的分類。

圖1：文字資料展示

二、專案實施

此專案我們採用機器學習KNN演算法進行訓練和分類，KNN演算法的相關介紹已在另一篇部落格裡詳細介紹過，KNN分類演算法介紹

。而我們知道，KNN演算法處理的是數值向量，所以需要把文字資訊轉化為文字向量，再採用機器學習演算法進行訓練。下面，我簡單介紹本專案實施的思路。

首先，需要對文字進行分詞，提煉出有價值的詞彙，構成屬性詞典。其次，準備訓練樣本和測試樣本，訓練樣本為已知類別的樣本，測試樣本為待分類的樣本，並根據屬性詞典計算文字的TF向量值，實現文件的向量化表示。最後，實現KNN演算法，得出分類正確率。

1、屬性詞典的構造

選取採集的50000個招中標網路片段，去除網頁標籤，特殊字元等，寫入TXT文件中。然後用Hadoop的map——reduce機制處理文字。讀取文字，進行分詞，專案採用基於搜狗詞典的IK分詞介面進行分詞，然後統計分詞後的詞頻，並對詞頻進行排序，去除詞頻數較高和較低的詞，這些詞對分類器的構造沒有太大的價值，反而會造成干擾和增加計算複雜度。下面是採用Hadoop處理的Java程式碼：
(1)統計詞頻

import com.rednum.hadoopwork.tools.OperHDFS;
import java.io.IOException;
import java.io.StringReader;;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.wltea.analyzer.IKSegmentation;
import org.wltea.analyzer.Lexeme;


//讀取TXT檔案，進行分詞並統計詞頻 


public class WordCountJob extends Configured implements Tool {

    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }

    public static class IKTokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

            StringReader reader = new StringReader(value.toString());
            IKSegmentation ik = new IKSegmentation(reader, true);// 當為true時，分詞器進行最大詞長切分
            Lexeme lexeme = null;

            while ((lexeme = ik.next()) != null) {
                word.set(lexeme.getLexemeText() + ":");
                context.write(word, one);
            }
        }
    }

    public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values, Context context)
                throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    @Override
    public int run(String[] strings) throws Exception {
        OperHDFS hdfs = new OperHDFS();

        hdfs.deleteFile("hdfs://192.168.1.108:9001/user/hadoop/hotwords/", "hdfs://192.168.1.108:9001/user/hadoop/hotwords/output");
        hdfs.deleteFile("hdfs://192.168.1.108:9001/user/hadoop/hotwords/", "hdfs://192.168.1.108:9001/user/hadoop/hotwords/sort");
        Job job = Job.getInstance(getConf());
        job.setJarByClass(WordCountJob.class);
        job.setMapperClass(IKTokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job, new Path("hdfs://192.168.1.108:9001/user/hadoop/hotwords/train.txt"));
        FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.1.108:9001/user/hadoop/hotwords/output"));
        job.waitForCompletion(true);
        return 0;
    }

    public static void main(String[] args) throws Exception {
        ToolRunner.run(new WordCountJob(), args);
        ToolRunner.run(new SortDscWordCountMRJob(), args);
    }
}

(2)對詞頻進行排序

import java.io.IOException;

import org.apache.commons.io.output.NullWriter;
import org.apache.hadoop.conf.Configurable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;


public class SortDscWordCountMRJob extends Configured implements Tool {

    public static class SortDscWordCountMapper extends Mapper<LongWritable, Text, INTDoWritable, Text> {
        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String[] contents = value.toString().split(":");
            String wc = contents[1].trim();
            String wd = contents[0].trim();
            INTDoWritable iw = new INTDoWritable();
            try {
                iw.num = new IntWritable(Integer.parseInt(wc));
                context.write(iw, new Text(wd));
            } catch (Exception e) {
                System.out.println(e);
            }

        }
    }

    public static class SortDscWordCountReducer extends Reducer<INTDoWritable, Text, NullWritable, Text> {
        public void reduce(INTDoWritable key, Iterable<Text> values, Context context)
                throws IOException, InterruptedException {
            for (Text text : values) {
                text = new Text(text.toString() + ": " + key.num.get());
                context.write(NullWritable.get(), new Text(text));
            }
        }

    }


    @Override
    public int run(String[] allArgs) throws Exception {
        Job job = Job.getInstance(getConf());
        job.setJarByClass(SortDscWordCountMRJob.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        job.setMapOutputKeyClass(INTDoWritable.class);
        job.setMapOutputValueClass(Text.class);
        job.setOutputKeyClass(NullWriter.class);
        job.setMapOutputValueClass(Text.class);

        job.setMapperClass(SortDscWordCountMapper.class);
        job.setReducerClass(SortDscWordCountReducer.class);


        String[] args = new GenericOptionsParser(getConf(), allArgs).getRemainingArgs();
        FileInputFormat.setInputPaths(job, new Path("hdfs://192.168.1.108:9001/user/hadoop/hotwords/output"));
        FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.1.108:9001/user/hadoop/hotwords/sort"));
        job.waitForCompletion(true);

        return 0;
    }

    public static void main(String[] args) throws Exception {
        ToolRunner.run(new SortDscWordCountMRJob(), args);
    }
}

詞頻統計和排序結果如下圖所示：

圖為：詞頻統計結果

根據統計出來的詞頻文件，去掉熱詞、停用詞和詞頻較少的詞彙，最終形成屬性詞典。此專案中，我選取了3000多個詞構成屬性詞典。

2、文字向量TF-IDF的計算

實現KNN演算法首先要實現文件的向量化表示。
計算特徵詞的TF*IDF，每個文件的向量由包含所有特徵詞的TF*IDF值組成，每一維對應一個特徵詞。
TF及IDF的計算公式如下，分別為特徵詞的特徵項頻率和逆文件頻率：

要得到Wij,需要計算IDF和TF。下面為計算IDF值得Java程式碼：

//計算TDF
Map<String, Double> IDFPerWordMap = new TreeMap<String, Double>();
IDFPerWordMap = computeIDF(text, wordMap);

注：computeIDF函式所需的引數為文字資料和屬性詞典， text和wordMap由下面的程式碼獲得：

            //獲得特徵詞詞典wordMap
            Map<String, Double> wordMap = new TreeMap<>();
            String path = "D:\\DataMining\\Title\\labeldict.txt";
            wordMap = countWords(path, wordMap);

            //獲取要讀取的檔案
            File readFile = new File("D:\\DataMining\\Title\\train.txt");
            //輸入IO流宣告
            InputStream in = null;
            InputStreamReader ir = null;
            BufferedReader br = null;
            in = new BufferedInputStream(new FileInputStream(readFile));
            //如果你檔案已utf-8編碼的就按這個編碼來讀取，不然又中文會讀取到亂碼
            ir = new InputStreamReader(in, "utf-8");
            //字元輸入流中讀取文字,這樣可以一行一行讀取
            br = new BufferedReader(ir);
            String line = "";
            List<HashMap<String, Object>> text = new ArrayList<>();
            //一行一行讀取
            while ((line = br.readLine()) != null) {
                HashMap<String, Object> map = new HashMap<>();
                String[] words = line.split("@&");
                String pro = "";
                String info = "";
                if (words.length == 2) {
                    pro = words[0];
                    info = words[1];
                    StringReader reader = new StringReader(info);
                    IKSegmentation ik = new IKSegmentation(reader, true);// 當為true時，分詞器進行最大詞長切分
                    Lexeme lexeme = null;
                    List<String> word = new ArrayList<>();
                    while ((lexeme = ik.next()) != null) {
                        String key = lexeme.getLexemeText();
                        word.add(key);
                    }
                    map.put(pro, word);
                    text.add(map);
                }
            }

            //計算TDF
            Map<String, Double> IDFPerWordMap = new TreeMap<String, Double>();
            IDFPerWordMap = computeIDF(text, wordMap);

下面為computeIDF函式的具體實現：

 /**
     * 計算IDF，即屬性詞典中每個詞在多少個文件中出現過
     *
     * @param words 所有樣本
     * @param wordMap 屬性詞典
     * @param SortedMap<String,Double> IDF值
     * @return 單詞的IDFmap
     * @throws IOException
     */
    public static SortedMap<String, Double> computeIDF(List<HashMap<String, Object>> words, Map<String, Double> wordMap) throws IOException {
        // TODO Auto-generated method stub 

        SortedMap<String, Double> IDFPerWordMap = new TreeMap<String, Double>();
        Set<Map.Entry<String, Double>> wordMapSet = wordMap.entrySet();
        for (Iterator<Map.Entry<String, Double>> pt = wordMapSet.iterator(); pt.hasNext();) {
            Map.Entry<String, Double> pe = pt.next();
            Double coutDoc = 0.0;
            String dicWord = pe.getKey();

            boolean isExited = false;
            for (HashMap<String, Object> word : words) {
                Object[] partword = word.values().toArray();
                for (Object keyword : partword) {
                    List<String> list = (List) keyword;
                    for (String key : list) {
                        if (!key.isEmpty() && key.equals(dicWord)) {
                            isExited = true;
                            break;
                        }
                    }
                    if (isExited) {
                        coutDoc++;
                    }
                }

            }
            if (coutDoc == 0.0) {
                coutDoc = 1.0;
            }
            //計算單詞的IDF  
            Double IDF = Math.log(20000 / coutDoc) / Math.log(10);
            IDFPerWordMap.put(dicWord, IDF);
        }
        return IDFPerWordMap;
    }

得到IDF後，我們計算TF並得到文字向量
函式介面為：

computeTFMultiIDF(text, 0.9, IDFPerWordMap, wordMap);

注：text為所有樣本分詞後構成的連結串列，0.9為訓練樣本與測試資料的樣本的比例，即90%作為訓練樣本，10%為測試樣本。最後把計算出來的TF-IDF值分別寫入訓練和測試文件中，為Train.txt和Test.Txt。

 /**
     * 計算文件的TF-IDF屬性向量,直接寫成二維陣列遍歷形式即可，沒必要遞迴
     *
     * @param words
     * @param trainSamplePercent 訓練樣例集佔每個類目的比例
     * @param iDFPerWordMap
     * @param wordMap 屬性詞典map
     * @throws IOException
     */
    public static void computeTFMultiIDF(List<HashMap<String, Object>> words, double trainSamplePercent, Map<String, Double> iDFPerWordMap, Map<String, Double> wordMap) throws IOException {
        SortedMap<String, Double> TFPerDocMap = new TreeMap<String, Double>();
        //注意可以用兩個寫檔案，一個專門寫測試樣例，一個專門寫訓練樣例，用sampleType的值來表示  
        String trainFileDir = "D:\\DataMining\\Title\\Train.txt";
        String testFileDir = "D:\\DataMining\\Title\\Test.txt";
        FileWriter tsTrainWriter = new FileWriter(new File(trainFileDir));
        FileWriter tsTestWrtier = new FileWriter(new File(testFileDir));
        FileWriter tsWriter = tsTrainWriter;
        int index = 0;
        for (HashMap<String, Object> word : words) {
            index++;
            Object[] partword = word.values().toArray();
            Double wordSumPerDoc = 0.0;//計算每篇文件的總詞數
            for (Object keyword : partword) {
                List<String> list = (List) keyword;
                for (String key : list) {
                    if (!key.isEmpty() && wordMap.containsKey(key)) {//必須是屬性詞典裡面的詞，去掉的詞不考慮  
                        wordSumPerDoc++;
                        if (TFPerDocMap.containsKey(key)) {
                            Double count = TFPerDocMap.get(key);
                            TFPerDocMap.put(key, count + 1);
                        } else {
                            TFPerDocMap.put(key, 1.0);
                        }
                    }

                }
            }

            if (index >= 1 && index <= trainSamplePercent * words.size()) {
                tsWriter = tsTrainWriter;
            } else {
                tsWriter = tsTestWrtier;
            }

            Double wordWeight;
            Set<Map.Entry<String, Double>> tempTF = TFPerDocMap.entrySet();
            for (Iterator<Map.Entry<String, Double>> mt = tempTF.iterator(); mt.hasNext();) {
                Map.Entry<String, Double> me = mt.next();
                wordWeight = (me.getValue() / wordSumPerDoc) * iDFPerWordMap.get(me.getKey());
                //這裡IDF暫時設為1，具體的計算IDF演算法改進和實現見我的部落格中關於kmeans聚類的博文  
                //wordWeight = (me.getValue() / wordSumPerDoc) * 1.0;
                TFPerDocMap.put(me.getKey(), wordWeight);
            }
            Set<String> keyWord = word.keySet();
            for (String label : keyWord) {
                tsWriter.append(label + " ");
            }

            Set<Map.Entry<String, Double>> tempTF2 = TFPerDocMap.entrySet();
            for (Iterator<Map.Entry<String, Double>> mt = tempTF2.iterator(); mt.hasNext();) {
                Map.Entry<String, Double> ne = mt.next();
                tsWriter.append(ne.getKey() + " " + ne.getValue() + " ");
            }
            tsWriter.append("\n");
            tsWriter.flush();
        }

        tsTrainWriter.close();
        tsTestWrtier.close();
        tsWriter.close();
    }

到此為止，訓練樣本和測試樣本已準備好，並轉化為了文字向量，工作已經完成一半多。接下來可以利用KNN演算法進行訓練了。

3、分類器的訓練、測試文字類別的判斷、分類精度的計算

KNN演算法流程：

step1：在訓練文字集中選出與新文字最相似的 K 個文字，相似度用向量夾角餘弦度量，計算公式為：

其中，K值的確定一般採用先定一個初始值，然後根據實驗測試的結果調整 K 值，本專案中K取15。

step2：:在新文字的 K 個鄰居中，依次計算每類的權重，每類的權重等於K個鄰居中屬於該類的訓練樣本與測試樣本的相似度之和。

step3:比較類的權重，將文字分到權重最大的那個類別中。

下面為KNN演算法具體實施的Java程式碼：

        String train = "D:\\DataMining\\Title\\Train.txt";
        String test = "D:\\DataMining\\Title\\Test.txt";
        String result = "D:\\DataMining\\Title\\result.txt";
        double classify = doProcess(train, test, result);
        System.out.print(classify);

注：從訓練和測試文字中讀取資料，進行分類器的訓練，最終分類結果儲存在result.txt文件中。doProcess函式的實現如下：

 public static double doProcess(String trainFiles, String testFiles,
            String kNNResultFile) throws IOException {
        // TODO Auto-generated method stub  
        //首先讀取訓練樣本和測試樣本，用map<String,map<word,TF>>儲存測試集和訓練集，注意訓練樣本的類目資訊也得儲存，  
        //然後遍歷測試樣本，對於每一個測試樣本去計算它與所有訓練樣本的相似度，相似度儲存入map<String,double>有  
        //序map中去，然後取前K個樣本，針對這k個樣本來給它們所屬的類目計算權重得分，對屬於同一個類目的權重求和進而得到  
        //最大得分的類目，就可以判斷測試樣例屬於該類目下，K值可以反覆測試，找到分類準確率最高的那個值  
        //！注意要以"類目_檔名"作為每個檔案的key，才能避免同名不同內容的檔案出現  
        //！注意設定JM引數，否則會出現JAVA heap溢位錯誤  
        //！本程式用向量夾角餘弦計算相似度  
        System.out.println("開始訓練模型：");
        File trainSamples = new File(trainFiles);
        BufferedReader trainSamplesBR = new BufferedReader(new FileReader(trainSamples));
        String line;
        String[] lineSplitBlock;
        Map<String, TreeMap<String, Double>> trainFileNameWordTFMap = new TreeMap<String, TreeMap<String, Double>>();
        TreeMap<String, Double> trainWordTFMap = new TreeMap<String, Double>();
        int index1 = 0;
        while ((line = trainSamplesBR.readLine()) != null) {
            index1++;
            lineSplitBlock = line.split(" ");
            trainWordTFMap.clear();
            for (int i = 1; i < lineSplitBlock.length; i = i + 2) {
                trainWordTFMap.put(lineSplitBlock[i], Double.valueOf(lineSplitBlock[i + 1]));
            }
            TreeMap<String, Double> tempMap = new TreeMap<String, Double>();
            tempMap.putAll(trainWordTFMap);
            trainFileNameWordTFMap.put(lineSplitBlock[0] + "_" + index1, tempMap);
        }
        trainSamplesBR.close();

        File testSamples = new File(testFiles);
        BufferedReader testSamplesBR = new BufferedReader(new FileReader(testSamples));
        Map<String, Map<String, Double>> testFileNameWordTFMap = new TreeMap<String, Map<String, Double>>();
        Map<String, String> testClassifyCateMap = new TreeMap<String, String>();//分類形成的<檔名，類目>對  
        Map<String, Double> testWordTFMap = new TreeMap<String, Double>();
        int index = 0;
        while ((line = testSamplesBR.readLine()) != null) {
            index++;
            lineSplitBlock = line.split(" ");
            testWordTFMap.clear();
            for (int i = 1; i < lineSplitBlock.length; i = i + 2) {
                testWordTFMap.put(lineSplitBlock[i], Double.valueOf(lineSplitBlock[i + 1]));
            }
            TreeMap<String, Double> tempMap = new TreeMap<String, Double>();
            tempMap.putAll(testWordTFMap);
            testFileNameWordTFMap.put(lineSplitBlock[0] + "_" + index, tempMap);
        }
        testSamplesBR.close();
        //下面遍歷每一個測試樣例計算與所有訓練樣本的距離，做分類  
        String classifyResult;
        FileWriter testYangliuWriter = new FileWriter(new File("D:\\DataMining\\Title\\yangliuTest.txt"));
        FileWriter KNNClassifyResWriter = new FileWriter(kNNResultFile);
        Set<Map.Entry<String, Map<String, Double>>> testFileNameWordTFMapSet = testFileNameWordTFMap.entrySet();
        for (Iterator<Map.Entry<String, Map<String, Double>>> it = testFileNameWordTFMapSet.iterator(); it.hasNext();) {
            Map.Entry<String, Map<String, Double>> me = it.next();
            classifyResult = KNNComputeCate(me.getKey(), me.getValue(), trainFileNameWordTFMap, testYangliuWriter);
            System.out.println("分類結果為："+ classifyResult+"；正確結果為："+me.getKey());
            KNNClassifyResWriter.append(me.getKey() + " " + classifyResult + "\n");
            KNNClassifyResWriter.flush();
            testClassifyCateMap.put(me.getKey(), classifyResult);
        }
        KNNClassifyResWriter.close();
        //計算分類的準確率  
        double righteCount = 0;
        Set<Map.Entry<String, String>> testClassifyCateMapSet = testClassifyCateMap.entrySet();
        for (Iterator<Map.Entry<String, String>> it = testClassifyCateMapSet.iterator(); it.hasNext();) {
            Map.Entry<String, String> me = it.next();
            String rightCate = me.getKey().split("_")[0];
            if (me.getValue().equals(rightCate)) {
                righteCount++;
            }
        }
        testYangliuWriter.close();
        return righteCount / testClassifyCateMap.size();
    }

    /**
     * 對於每一個測試樣本去計算它與所有訓練樣本的向量夾角餘弦相似度 相似度儲存入map<String,double>有序map中去，然後取前K個樣本，
     * 針對這k個樣本來給它們所屬的類目計算權重得分，對屬於同一個類 目的權重求和進而得到最大得分的類目，就可以判斷測試樣例屬於該
     * 類目下。K值可以反覆測試，找到分類準確率最高的那個值
     *
     * @param testWordTFMap 當前測試檔案的<單詞,詞頻>向量
     * @param trainFileNameWordTFMap 訓練樣本<類目_檔名,向量>Map
     * @param testYangliuWriter
     * @return String K個鄰居權重得分最大的類目
     * @throws IOException
     */
    public static String KNNComputeCate(
            String testFileName,
            Map<String, Double> testWordTFMap,
            Map<String, TreeMap<String, Double>> trainFileNameWordTFMap, FileWriter testYangliuWriter) throws IOException {
        // TODO Auto-generated method stub  
        HashMap<String, Double> simMap = new HashMap<String, Double>();//<類目_檔名,距離> 後面需要將該HashMap按照value排序  
        double similarity;
        Set<Map.Entry<String, TreeMap<String, Double>>> trainFileNameWordTFMapSet = trainFileNameWordTFMap.entrySet();
        for (Iterator<Map.Entry<String, TreeMap<String, Double>>> it = trainFileNameWordTFMapSet.iterator(); it.hasNext();) {
            Map.Entry<String, TreeMap<String, Double>> me = it.next();
            similarity = computeSim(testWordTFMap, me.getValue());
            simMap.put(me.getKey(), similarity);
        }
        //下面對simMap按照value排序  
        ByValueComparator bvc = new ByValueComparator(simMap);
        TreeMap<String, Double> sortedSimMap = new TreeMap<String, Double>(bvc);
        sortedSimMap.putAll(simMap);

        //在disMap中取前K個最近的訓練樣本對其類別計算距離之和，K的值通過反覆試驗而得  
        Map<String, Double> cateSimMap = new TreeMap<String, Double>();//K個最近訓練樣本所屬類目的距離之和  
        double K = 15;
        double count = 0;
        double tempSim;

        Set<Map.Entry<String, Double>> simMapSet = sortedSimMap.entrySet();
        for (Iterator<Map.Entry<String, Double>> it = simMapSet.iterator(); it.hasNext();) {
            Map.Entry<String, Double> me = it.next();
            count++;
            String categoryName = me.getKey().split("_")[0];
            if (cateSimMap.containsKey(categoryName)) {
                tempSim = cateSimMap.get(categoryName);
                cateSimMap.put(categoryName, tempSim + me.getValue());
            } else {
                cateSimMap.put(categoryName, me.getValue());
            }
            if (count > K) {
                break;
            }
        }
        //下面到cateSimMap裡面把sim最大的那個類目名稱找出來  
        //testYangliuWriter.flush();  
        //testYangliuWriter.close();  
        double maxSim = 0;
        String bestCate = null;
        Set<Map.Entry<String, Double>> cateSimMapSet = cateSimMap.entrySet();
        for (Iterator<Map.Entry<String, Double>> it = cateSimMapSet.iterator(); it.hasNext();) {
            Map.Entry<String, Double> me = it.next();
            if (me.getValue() > maxSim) {
                bestCate = me.getKey();
                maxSim = me.getValue();
            }
        }
        return bestCate;
    }

    /**
     * 計算測試樣本向量和訓練樣本向量的相似度
     *
     * @param testWordTFMap 當前測試檔案的<單詞,詞頻>向量
     * @param trainWordTFMap 當前訓練樣本<單詞,詞頻>向量
     * @return Double 向量之間的相似度 以向量夾角餘弦計算
     * @throws IOException
     */
    public static double computeSim(Map<String, Double> testWordTFMap,
            Map<String, Double> trainWordTFMap) {
        // TODO Auto-generated method stub  
        double mul = 0, testAbs = 0, trainAbs = 0;
        Set<Map.Entry<String, Double>> testWordTFMapSet = testWordTFMap.entrySet();
        for (Iterator<Map.Entry<String, Double>> it = testWordTFMapSet.iterator(); it.hasNext();) {
            Map.Entry<String, Double> me = it.next();
            if (trainWordTFMap.containsKey(me.getKey())) {
                mul += me.getValue() * trainWordTFMap.get(me.getKey());
            }
            testAbs += me.getValue() * me.getValue();
        }
        testAbs = Math.sqrt(testAbs);

        Set<Map.Entry<String, Double>> trainWordTFMapSet = trainWordTFMap.entrySet();
        for (Iterator<Map.Entry<String, Double>> it = trainWordTFMapSet.iterator(); it.hasNext();) {
            Map.Entry<String, Double> me = it.next();
            trainAbs += me.getValue() * me.getValue();
        }
        trainAbs = Math.sqrt(trainAbs);
        return mul / (testAbs * trainAbs);
    }

三、專案總結

由於分類的類別太多，總共有13個類別，所以KNN演算法的分類精度會有影響。另外，演算法沒有采用分散式處理，時間消耗太久，後面需要移到Hadoop架構上進行挖掘，這只是初步嘗試。下面為部分分類結果展示，最後的準確率為70%左右，演算法還需繼續改進。這是本菜鳥進入文字挖掘的第一步，歡迎感興趣的小夥伴進行指正和交流。

文字挖掘——基於TF-IDF的KNN分類演算法實現

一、專案背景

二、專案實施

1、屬性詞典的構造

2、文字向量TF-IDF的計算

3、分類器的訓練、測試文字類別的判斷、分類精度的計算

三、專案總結

文字挖掘——基於TF-IDF的KNN分類演算法實現

DL之RNN：人工智慧為你寫歌詞(林夕寫給陳奕迅)——基於TF利用RNN演算法實現【機器為你作詞】、訓練&測試過程全記錄

DL之RNN：人工智慧為你寫周董歌詞——基於TF利用RNN演算法實現【機器為你作詞】、訓練&測試過程全記錄

TF之NN：基於TF利用NN演算法實現根據三個自變數預測一個因變數的迴歸問題

DL之RNN：基於TF利用RNN演算法實現~機器為你寫詩~、訓練&測試過程全記錄

DL之RNN：人工智慧為你寫歌詞(林夕寫給陳奕迅)——基於TF利用RNN演算法實現【機器為你作詞】、訓練&測試過程全記錄

基於樸素貝葉斯分類演算法實現垃圾郵箱分類

文字挖掘----基於OCR的文件關鍵字提取

Hadoop偽分佈安裝詳解+MapReduce執行原理+基於MapReduce的KNN演算法實現

灰度影象形狀的識別分類演算法實現matlab

AI工程師成長之路-KNN分類演算法實現

K近鄰分類演算法實現 in Python

基於FPGA的CORDIC演算法實現——Verilog版

KNN分類演算法實現By Java

Hadoop/MapReduce 及 Spark KNN分類演算法實現

基於spark的svm演算法實現

大資料及人工智慧基礎系列3 文字挖掘的TF-IDF計算

基於MapReduce的PageRank演算法實現

基於MATLAB的djikstra演算法實現

文字挖掘之特徵選擇（python實現）

文字挖掘——基於TF-IDF的KNN分類演算法實現

一、專案背景

二、專案實施

1、屬性詞典的構造

2、文字向量TF-IDF的計算

3、分類器的訓練、測試文字類別的判斷、分類精度的計算

三、專案總結

相關推薦