文章相似度比較

阿新 • • 發佈：2018-11-23

比較兩個檔案中的文字的相似度(純文字檔案)；
5種檔案：word、excel、ppt、pdf、txt；提取5中檔案中的所有文字，作比對。計算相似度；
1.讀取檔案

1).讀word檔案

    //讀取 word   path引數為檔案絕對路徑

　　// word2003轉換為2007

    public String readWord(String path) {
        String buffer = "";
        try {
            if (path.endsWith(".doc")) {
                InputStream is = new 
 FileInputStream(new File(path));
                WordExtractor ex = new WordExtractor(is);
                buffer = ex.getText();
                ex.close();
            } else if (path.endsWith("docx")) {
                OPCPackage opcPackage = POIXMLDocument.openPackage(path);
                POIXMLTextExtractor extractor  
= new XWPFWordExtractor(opcPackage);
                buffer = extractor.getText();
                extractor.close();
            } else {
                System.out.println("此檔案不是word檔案！");
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
        return buffer;
    }

2）.讀取PDF

        //讀取PDF
    public String readPdf(String file){  
           // 是否排序  
           boolean sort = false;  
           // pdf檔名  
           String pdfFile = file;  
           // 開始提取頁數  
          int startPage = 1;  
           // 結束提取頁數  
           int endPage = Integer.MAX_VALUE;    
           // 記憶體中儲存的PDF Document  
           PDDocument document = null;  
           try {  
            try {  
                
         // 首先當作一個URL來裝載檔案，如果得到異常再從本地檔案系統//去裝載檔案  
             URL url = new URL(pdfFile);  
            //注意引數已不是以前版本中的URL.而是File。  
             document = PDDocument.load(pdfFile); 
             String fileName = url.getFile();  
            } catch (Exception e) {  
             // 如果作為URL裝載得到異常則從檔案系統裝載  
            //注意引數已不是以前版本中的URL.而是File。  
             document = PDDocument.load(pdfFile);   
            }   
            // PDFTextStripper來提取文字  
            PDFTextStripper stripper = null;  
            stripper = new PDFTextStripper();  
            // 設定是否排序  
            stripper.setSortByPosition(sort);  
            // 設定起始頁  
            stripper.setStartPage(startPage);  
            // 設定結束頁  
            stripper.setEndPage(endPage);  
           // 呼叫PDFTextStripper的writeText提取並輸出文字  
            String text = stripper.getText(document);
            return text;
           } finally {  
            if (document != null) {   
            document.close();  
            }  
           }  
          }

3）.讀txt檔案

    //讀取txt檔案
     public static String readTxt(String path){
             File file = new File(path);
            StringBuilder result = new StringBuilder();
            try{
                BufferedReader br = new BufferedReader(new FileReader(file));//構造一個BufferedReader類來讀取檔案
                String s = null;
                while((s = br.readLine())!=null){//使用readLine方法，一次讀一行
                    result.append(System.lineSeparator()+s);
                }
                br.close();    
            }catch(Exception e){
                e.printStackTrace();
            }
            return result.toString();
        }

4.讀取PPT

      //讀取 PPT
     // 讀取Powerpoint97-2003的全部內容 ppt  
     private static String getppt(byte[] file){
        String text = ""; 
        InputStream fis;
        PowerPointExtractor ex;
        try {
            //  圖片不會被讀取  
            fis = new ByteArrayInputStream(file);
            ex = new PowerPointExtractor(fis);
            text = ex.getText();
            ex.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
     return text;
     }
    
     
     // 抽取幻燈片2007+全部內容 pptx  
     private static String getTextFromPPT2007(byte[] file){
         InputStream is;
         XMLSlideShow slide;
             String text = "";
             try {
         is = new ByteArrayInputStream(file);
         slide = new XMLSlideShow(is);
         XSLFPowerPointExtractor extractor = new XSLFPowerPointExtractor(slide);
         text = extractor.getText();
             extractor.close();
             } catch (IOException e) {
                 e.printStackTrace();
             }
            return text;
     }

5.讀Excel

     // 讀取Excel2007+的全部內容 xlsx
     private static String getTextFromExcel2007(byte[] file) {
        InputStream is;
        XSSFWorkbook workBook;
        String text = "";
        try {
        is = new ByteArrayInputStream(file);
        workBook = new XSSFWorkbook(is);
        XSSFExcelExtractor extractor = new XSSFExcelExtractor(workBook);
        extractor.setIncludeSheetNames(false);
        text = extractor.getText();
        extractor.close();
        } catch (IOException e){
        e.printStackTrace();
        }
        return text;
        }

檔案轉換為二進位制的方法：

    //將檔案轉換為二進位制
    public static byte[] File2byte(String filePath){
        byte[] buffer = null;
        try{
            File file = new File(filePath);
            FileInputStream fis = new FileInputStream(file);
            ByteArrayOutputStream bos = new ByteArrayOutputStream();
            byte[] b = new byte[1024];
            int n;
            while ((n = fis.read(b)) != -1){
                bos.write(b, 0, n);
            }
            fis.close();
            bos.close();
            buffer = bos.toByteArray();
        }catch (Exception e){
            e.printStackTrace();
        }
        return buffer;
    }

6.準備工作：

6.1.獲取檔案內容

        String content1 = tp.readPdf("F://test//test.pdf"); 
        String content2 = tp.readWord("F://test//test.docx");
　　　　　getSimilarity(content2, content5);//獲取相似度
　　　　　//0.992801059146564

7.獲取相似度的方法

        /**
         * 獲得兩個句子的相似度
         *
         * @param sentence1
         * @param sentence2
         * @return
         */
        public static double getSimilarity(String sentence1, String sentence2) {
            List<String> sent1Words = getSplitWords(sentence1);
            System.out.println(sent1Words);
            List<String> sent2Words = getSplitWords(sentence2);
            System.out.println(sent2Words);
            List<String> allWords = mergeList(sent1Words, sent2Words);

            int[] statistic1 = statistic(allWords, sent1Words);
            int[] statistic2 = statistic(allWords, sent2Words);

            double dividend = 0;
            double divisor1 = 0;
            double divisor2 = 0;
            for (int i = 0; i < statistic1.length; i++) {
                dividend += statistic1[i] * statistic2[i];
                divisor1 += Math.pow(statistic1[i], 2);
                divisor2 += Math.pow(statistic2[i], 2);
            }

            return dividend / (Math.sqrt(divisor1) * Math.sqrt(divisor2));
        }

        private static int[] statistic(List<String> allWords, List<String> sentWords) {
            int[] result = new int[allWords.size()];
            for (int i = 0; i < allWords.size(); i++) {
                result[i] = Collections.frequency(sentWords, allWords.get(i));
            }
            return result;
        }

        private static List<String> mergeList(List<String> list1, List<String> list2) {
            List<String> result = new ArrayList<>();
            result.addAll(list1);
            result.addAll(list2);
            return result.stream().distinct().collect(Collectors.toList());
        }

        private static List<String> getSplitWords(String sentence) {
            // 去除掉html標籤
            sentence = Jsoup.parse(sentence.replace("&nbsp;","")).body().text();
            // 標點符號會被單獨分為一個Term，去除之
            return HanLP.segment(sentence).stream().map(a -> a.word).filter(s -> !"`[email protected]#$^&*()=|{}':;',\\[\\].<>/?~！@#￥……&*（）——|{}【】‘；：”“'。，、？ ".contains(s)).collect(Collectors.toList());
        }

注：文字比較相似度，主要使用HanLP分詞工具進行對語句分析，去重等操作。
得到的結果為，兩種不同格式檔案文章的相似度。

文章相似度比較

比較兩個檔案中的文字的相似度(純文字檔案)；5種檔案：word、excel、ppt、pdf、txt；提取5中檔案中的所有文字，作比對。計算相似度；1.讀取檔案 1).讀word檔案 //讀取 word path引數為檔案絕對路徑　　// word2003轉換為2007

兩篇文章的相似度比較

僅僅考慮兩篇文章的片語，並未考慮文字的語義資訊。實現原理： 1. 對兩篇文件進行詞頻統計； 2. 利用“TF-IDF和餘弦相似度”原理，計算兩篇文件的相似度。實現過程： 1.利用lucene對大量文章建立索引，建立語料庫，來提高TF-IDF的準確度。

iOS圖片相似度比較

整數 return per 計算 spa 獲取 wid last csb 1. 縮小尺寸：將圖像縮小到8*8的尺寸，總共64個像素。這一步的作用是去除圖像的細節，只保留結構/明暗等基本信息，摒棄不同尺寸/比例帶來的圖像差異；註：實際操作時，采取了兩種尺寸作對比（

句子相似度比較的歸一化

我們將不同長度的句子（預處理並分詞之後的長度）直接做比較其實是不公平的，舉個例子： Sentence 1 = 長度為2 Sentence 2 = 長度為1 Sentence 3 = 長度為3 （在取相似詞TOP4，exp=0.7，的情況下）即便Sent2與Sent1詞的

python兩張圖相似度比較

#!/usr/bin/python # -*- coding: utf-8 -*- import cv2 import numpy as np from PIL import Image,ImageFilter def make_regalur_image(img,

N-Gram 演算法用來做相似度比較

N-Gram 模型基於這樣一種假設，第n個詞的出現只與前面n-1個詞相關，而與其它任何詞都不相關，整句的概率就是各個詞出現概率的乘積。在拼寫檢查裡即是一個字母的出現概率只和前n-1個字母的出現概率相關,並且是前n-1個字母出現概率的乘積。

<tf-idf + 余弦相似度> 計算文章的相似度

eth documents oca word product num users -s box 背景知識: （1）tf-idf 按照詞TF-IDF值來衡量該詞在該文檔中的重要性的指導思想：如果某個詞比較少見，但是它在這篇文章中多次出現，那麽它很可能就反映了這篇文章的特性

JAVA 比較兩張圖片的相似度的代碼

awt ace ktr ngs gin min amp exce value 原文：http://www.open-open.com/code/view/1448334323079 import java.awt.image.BufferedImage; import

Python 連接MongoDB並比較兩個字符串相似度的簡單示例

tab diff port pycharm 步驟 mil microsoft pymongo tro 本文介紹一個示例：使用 pymongo 連接 MongoDB，查詢MongoDB中的字符串記錄，並比較字符串之間的相似度。一，Python連接MongoDB 大致步驟

C#比較兩個字符串的相似度【轉】

出現 href 單詞 mar 情況 base 程序代碼 RR 字符原文地址：http://www.2cto.com/kf/201202/121170.html 我們在做數據系統的時候，經常會用到模糊搜索，但是，數據庫提供的模糊搜索並不具備按照相關度進行排序的功能

JAVA比較兩張圖相似度

利用直方圖原理比較2張圖片相似度 package com.uiwho.com; import javax.imageio.*; import java.awt.image.*; import java.awt.*;//Color import java.io.*; publi

原作者題目：mahout推薦相似度學習總結原文章路徑：http://blog.csdn.net/a674810893/article/details/44729671

原作者題目：mahout推薦相似度學習總結原文章路徑：http://blog.csdn.net/a674810893/article/details/44729671 mahout的推薦主要是基於協同過濾，協同過濾是通過了解使用者與物品之間的關係，

python比較字串相似度

python自帶比較相似度的模組，difflib。比較兩個字串的模組是difflib.SequenceMatcher，使用起來很簡單： import difflibdef string_similar(s1, s2): return difflib.SequenceMatcher(None,

Java程式碼實現餘弦相似度演算法比較兩字串相似度

因工作需要比較兩個兩個字串的相似度比較，由於最短編輯距離演算法不符合需求，就又找其他演算法，在網上看到了另一個演算法：餘弦相似度演算法。於是學習了一下，並寫篇部落格儲存，以便學習以及日後用到。程式碼如下: import java.util.HashMap im

java利用classfier4j實現模糊查詢、文章摘要、餘弦相似度、Tfidf、單詞糾正

jar包下載： https://download.csdn.net/download/dreamzuora/10853888 程式碼使用：餘弦相似度： Double result=cosSimilarityByString("關於王立軍，有幾個基本事實。首先，1月28日我是初次

用docsim/doc2vec/LSH比較兩個文件之間的相似度

在我們做文字處理的時候，經常需要對兩篇文件是否相似做處理或者根據輸入的文件，找出最相似的文件。如需轉載，請註明出處。幸好gensim提供了這樣的工具，具體的處理思路如下，對於中文文字的比較，先需要做分詞處理，根據分詞的結果生成一個字典，然後再根據字典把原文件轉化成向量。

通過直方圖比較影象相似度

#pragma once using namespace cv; using namespace std; class histogram{ private: int histsize; float range[2]; const float *histrange;

如何寫好原創文章？有沒有比較好用的文章原創度檢測平臺?

說到文章的原創度，接觸過自媒體和seo優化的朋友一定不會陌生。搜尋引擎和自媒體平臺對於文章原創度的重視程度是不言而喻的。換句話說，只有文章的原創度越高，平臺的收錄或推薦才會越高，才能獲取更多的流量吸粉或變現。那麼如何寫好原創文章呢？你可以從兩個方向切入。 1、從頭到尾原創（如果你有

使用spark TF-IDF特徵計算文章間相似度

寫在前面計算字串之前的相似度可以使用 Levenshtein distance（最小編輯距離）來實現，JAVA實現可以參考http://blog.csdn.net/ironrabbit/article/details/18736185，計算新聞標題間的相似度

opencv java小應用：比較兩個圖片的相似度

package com.company; import org.opencv.core.*; import org.opencv.imgcodecs.Imgcodecs; import org.opencv.imgproc.Imgproc; import org.opencv.objdetect.Casc

文章相似度比較

相關推薦