基於Lucene、TF-IDF、餘弦相似性實現長文字相似度檢測
阿新 • • 發佈:2019-02-03
什麼是TF-IDF
TF-IDF(Term Frequency-Inverse Document Frequency),漢譯為詞頻-逆文字頻率指數。
TF指一個詞出現的頻率,假設在一篇文章中某個詞出現的次數是n,文章的總詞數是N,那麼TF=n/N
逆文字頻率指數IDF一般用於表示一個詞的權重,其求解辦法為IDFi=log(D/Dw),這裡D指的是文字總量,Dw指的是詞i在Dw篇文字中出現過。
什麼是餘弦相似
餘弦相似度用向量空間中兩個向量夾角的餘弦值作為衡量兩個個體間差異的大小。餘弦值越接近1,就表明夾角越接近0度,也就是兩個向量越相似,這就叫"餘弦相似性"。
對於二維空間,根據向量點積公式 ,顯然可以得知:
假設向量a、b的座標分別為(x1,y1)、(x2,y2) 。則:
TF-IDF和餘弦相似應用
這裡有兩篇文章講解的非常清楚,我就不再多說了,直接上文章連結。
下面就具體講解下程式碼的實現。
新增Gradle依賴
用到了WebMagic爬蟲框架、Jieba分詞java版,Lucene、Apache等一些庫
compile group: 'us.codecraft', name: 'webmagic-core', version: '0.7.3' // https://mvnrepository.com/artifact/us.codecraft/webmagic-extension compile group: 'us.codecraft', name: 'webmagic-extension', version: '0.7.3' // https://mvnrepository.com/artifact/com.huaban/jieba-analysis compile group: 'com.huaban', name: 'jieba-analysis', version: '1.0.2' compile group: 'commons-io', name: 'commons-io', version: '2.6' compile group: 'org.apache.lucene', name: 'lucene-core', version: '3.6.0' compile group: 'org.apache.lucene', name: 'lucene-queryparser', version: '3.6.0'
爬取樣本庫並進行分詞
因為測試演算法的有效性需要大量的文字,我採用WebMagic爬蟲框架,爬取華為應用市場的應用描述資訊來當做樣本庫。
import us.codecraft.webmagic.Page; import us.codecraft.webmagic.Site; import us.codecraft.webmagic.Spider; import us.codecraft.webmagic.processor.PageProcessor; import us.codecraft.webmagic.selector.Selectable; /** * @author wzj * @create 2018-07-17 22:06 **/ public class AppStoreProcessor implements PageProcessor { // 部分一:抓取網站的相關配置,包括編碼、抓取間隔、重試次數等 private Site site = Site.me().setRetryTimes(3).setSleepTime(1000); public void process(Page page) { //獲取名稱 String name = page.getHtml().xpath("//p/span[@class='title']/text()").toString(); page.putField("appName",name ); String desc = page.getHtml().xpath("//div[@id='app_strdesc']/text()").toString(); page.putField("desc",desc ); if (page.getResultItems().get("appName") == null) { //skip this page page.setSkip(true); } //獲取頁面其他連結 Selectable links = page.getHtml().links(); page.addTargetRequests(links.regex("(http://app.hicloud.com/app/C\\d+)").all()); } public Site getSite() { return site; } public static void main(String[] args) { Spider.create(new AppStoreProcessor()) .addUrl("http://app.hicloud.com") .addPipeline(new MyPipeline()) .thread(20) .run(); } }
自定義Piple來儲存爬取的應用資料,因為要對描述資訊進行分詞,需要對資料進行預處理,主要包含
- 通過正則去除中文特殊字元和標點符號 desc.replaceAll("[\\p{P}+~$`^=|<>~`$^+=|<>¥×]", "")
- 通過正則去除回車符、製表符等特殊符號 desc.replaceAll("\\t|\\r|\\n","");
- 通過正則去除空格 desc.replaceAll(" ","");
接著對資料進行分詞,採用jieba分析java版進行分詞處理
import com.huaban.analysis.jieba.JiebaSegmenter;
import org.apache.commons.io.IOUtils;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;
import java.io.FileWriter;
import java.io.IOException;
import java.nio.file.Paths;
import java.util.List;
/**
* @author wzj
* @create 2018-07-17 22:16
**/
public class MyPipeline implements Pipeline
{
/**
* 儲存檔案的路徑
*/
private static final String saveDir = "D:\\cache\\";
/**
* jieba分詞java版
*/
private JiebaSegmenter segmenter = new JiebaSegmenter();
/*
* 統計數目
*/
private int count = 1;
/**
* Process extracted results.
*
* @param resultItems resultItems
* @param task task
*/
public void process(ResultItems resultItems, Task task)
{
String appName = resultItems.get("appName");
String desc = resultItems.get("desc");
//去除標點符號
desc = desc.replaceAll("[\\p{P}+~$`^=|<>~`$^+=|<>¥×]", "");
desc = desc.replaceAll("\\t|\\r|\\n","");
//去除空格
desc = desc.replaceAll(" ","");
List<String> vecList = segmenter.sentenceProcess(desc);
StringBuilder stringBuilder = new StringBuilder();
for (String s : vecList)
{
stringBuilder.append(s + " ");
}
//去除最後一個空格
String writeContent = stringBuilder.toString();
if (writeContent.length() > 0)
{
writeContent = writeContent.substring(0,writeContent.length() - 1);
}
String appSavePath = Paths.get(saveDir, appName + ".txt").toString();
FileWriter fileWriter = null;
try
{
fileWriter = new FileWriter(appSavePath);
fileWriter.write(writeContent);
}
catch (IOException e)
{
e.printStackTrace();
}
finally
{
IOUtils.closeQuietly(fileWriter);
}
System.out.println(String.valueOf(count++) + " " + appName);
}
}
將爬取文字建立Lucene索引
需要指定文字檔案路徑和索引儲存路徑
/**
* 將所有的文件加入lucene中
* @throws IOException
*/
public void indexDocs() throws IOException
{
System.out.println("Number of files : " + docNumbers);
File[] listOfFiles = Paths.get(docPath).toFile().listFiles();
NIOFSDirectory dir = new NIOFSDirectory(new File(saveIndexPath));
IndexWriter indexWriter = new IndexWriter(dir,
new IndexWriterConfig(Version.LUCENE_36, new WhitespaceAnalyzer(Version.LUCENE_36)));
for (File file : listOfFiles)
{
//讀取檔案內容,並去除數字標點符號
String fileContent = fileReader(file);
fileContent = fileContent.replaceAll("\\d+(?:[.,]\\d+)*\\s*", "");
String docName = file.getName();
Document doc = new Document();
doc.add(new Field("docContent", new StringReader(fileContent), Field.TermVector.YES));
doc.add(new Field("docName", new StringReader(docName), Field.TermVector.YES));
indexWriter.addDocument(doc);
}
indexWriter.close();
System.out.println("Add document successful.");
}
TF-IDF演算法實現
首先計算已有文件的TF-IDF
/**
* 獲取所有文件的tf-idf值
* @return 結果
* @throws IOException IOException
* @throws ParseException ParseException
*/
public HashMap<String, Map<String, Float>> getAllTFIDF() throws IOException, ParseException
{
HashMap<String, Map<String, Float>> scoreMap = new HashMap<String, Map<String, Float>>();
IndexReader re = IndexReader.open(NIOFSDirectory.open(new File(saveIndexPath)), true);
for (int k = 0; k < docNumbers; k++)
{
//每一個文件的tf-idf
Map<String, Float> wordMap = new HashMap<String, Float>();
//獲取當前文件的內容
TermFreqVector termsFreq = re.getTermFreqVector(k, "docContent");
TermFreqVector termsFreqDocId = re.getTermFreqVector(k, "docName");
String docName = termsFreqDocId.getTerms()[0];
int[] freq = termsFreq.getTermFrequencies();
String[] terms = termsFreq.getTerms();
int noOfTerms = terms.length;
DefaultSimilarity simi = new DefaultSimilarity();
for (int i = 0; i < noOfTerms; i++)
{
int noOfDocsContainTerm = re.docFreq(new Term("docContent", terms[i]));
float tf = simi.tf(freq[i]);
float idf = simi.idf(noOfDocsContainTerm, docNumbers);
wordMap.put(terms[i], (tf * idf));
}
scoreMap.put(docName, wordMap);
}
return scoreMap;
}
接著輸入一段測試文字,在已有的文字庫中進行查詢,使用上面同樣的方法計算出待查詢文字的TF-IDF,具體的程式碼就不在貼出來。
最後餘弦相似度來找出最相似的文字。
/**
* 計算餘弦相似度
* @param searchTextTfIdfMap 查詢文字的向量
* @param allTfIdfMap 所有文字向量
* @return 計算出當前查詢文字與所有文字的相似度
*/
private static Map<String,Double> cosineSimilarity(Map<String, Float> searchTextTfIdfMap,HashMap<String, Map<String, Float>> allTfIdfMap)
{
//key是相似的文件名稱,value是與當前文件的相似度
Map<String,Double> similarityMap = new HashMap<String,Double>();
//計算查詢文字向量絕對值
double searchValue = 0;
for (Map.Entry<String, Float> entry : searchTextTfIdfMap.entrySet())
{
searchValue += entry.getValue() * entry.getValue();
}
for (Map.Entry<String, Map<String, Float>> docEntry : allTfIdfMap.entrySet())
{
String docName = docEntry.getKey();
Map<String, Float> docScoreMap = docEntry.getValue();
double termValue = 0;
double acrossValue = 0;
for (Map.Entry<String, Float> termEntry : docScoreMap.entrySet())
{
if (searchTextTfIdfMap.get(termEntry.getKey()) != null)
{
acrossValue += termEntry.getValue() * searchTextTfIdfMap.get(termEntry.getKey());
}
termValue += termEntry.getValue() * termEntry.getValue();
}
similarityMap.put(docName,acrossValue/(termValue * searchValue));
}
return similarityMap;
}
最後測試效果還不錯,可以找出最相近的文字。