Lucene索引前對doc pdf html檔案的預處理

阿新 • • 發佈：2019-02-04

初學Lucene，寫下點小小的心得：

Lucene提供的API能夠方便地對文字文件建立索引，但當需要為像.doc 、.pdf 這樣的非文字文件建立索引時就必須先把文件轉換為純文字。

對.pdf檔案的處理

lib：PDFBox

PDFBox是一個在java環境中對pdf檔案進行解析處理的開源軟體，同時它也提供了一個豐富的類庫支援對pdf檔案的操作。PDFBox為使用Lucene的開發者專門提供了LucenePDFDocument類，它的static方法getDocument（ps:該方法被過載三次）能夠直接返回一個Lucene的Document型別結果。所以在為一個pdf檔案（例子中為File型別例項pdfFile）建立索引時只要寫下如下語句就可以了：

document = LucenePDFDocument.getDocument(pdfFile);

getDocument方法的這種形式很好，接下來對.doc以及對.html檔案的處理也參照這種形式。

對.doc檔案的處理

lib：tm-extractors-0.4

這原是Apache的POI類庫的一部分——HWPF，POI提供了一組操作MS-word/excel等檔案的方法，在最近的release版本中HWPF被移出，需要下載獨立的tm-extractors-0.4。下面的程式碼中實現了static方法getDocument(File)返回 Lucene的Document型別結果，主要通過呼叫WordExtractor類的成員方法extractor，該方法能返回一個包含所解析doc文件內容的String型別例項。

public class LuceneDOCDocument {

public static Document getDocument(File doc) {

String docPath = doc.getAbsolutePath();

String title = doc.getName();

InputStream inputStream = null ;

Reader contents = null ;

Document document = new

Document();

try

{

inputStream = new FileInputStream(doc);

}

catch (FileNotFoundException e)

{

e.printStackTrace();

}

WordExtractor extractor = new WordExtractor();

try {

contents = new StringReader(extractor.extractText(inputStream));

}

catch (Exception e) {

e.printStackTrace();

}

document.add( new Field( " title " , title, Field.Store.YES, Field.Index.TOKENIZED));

document.add( new Field( " contents " , contents));

document.add( new Field( " path " , docPath, Field.Store.YES, Field.Index.NO));

return document;

}

HTML檔案雖然是文字，但是由於其中包含的標記無法被Lucene識別，導致也會被編入索引，而使用者在搜尋時並不希望搜尋到這些標籤，所以在對HTML檔案建立所以前必須對其進行去標籤的處理。

對HTML的處理（去標籤）

lib:htmlparser

原本Lucene的DEMO中也附帶了一個HtmlParser，只是這個html解析器功能比較弱。另外不知道是不是筆者使用上的錯誤，在使用索引檔案管理器Luke檢視時發現，DEMO帶的HtmlParser的成員方法getReader所返回的字元流的內容並非是html檔案的全部文字內容，而僅僅是標題內容。

在下面的例子中，筆者使用了更為強大的htmlparser，同樣在程式碼中定義了static方法getDocument(File)返回Document型別。

public class LuceneHTMLDocument {

public static Document getDocument(File html) {

String htmlPath = html.getAbsolutePath();

String text = "" ;

Parser parser = null ;

try {

parser = new Parser(htmlPath);

}

catch (ParserException e) {

e.printStackTrace();

}

try {

parser.setEncoding( " UTF-8 " );

}

catch (ParserException e) {

e.printStackTrace();

}

HtmlPage visitor = new HtmlPage(parser);

try {

parser.visitAllNodesWith(visitor);

}

catch (ParserException e) {

e.printStackTrace();

}

NodeList nodes = visitor.getBody();

int size = nodes.size();

for ( int i = 0 ;i < size;i ++ ) {

Node node = nodes.elementAt(i);

text += node.toPlainTextString();

}

String title = visitor.getTitle();

Reader contents = new StringReader (text);

Document document = new Document();

document.add( new Field( " title " , title, Field.Store.YES, Field.Index.TOKENIZED));

document.add( new Field( " contents " , contents));

document.add( new Field( " path " , htmlPath, Field.Store.YES, Field.Index.NO));

return document;

}

Lucene索引前對doc pdf html檔案的預處理

Lucene索引前對doc pdf html檔案的預處理

Linux 用命令列開啟doc pdf等檔案及必備命令列工具

filter 用於對請求和響應進行預處理操作

記一次800多萬XML文字檔案預處理經歷

dicom檔案預處理

研究基於spring通過對http請求資料的預處理（資料加解密、校驗、日誌）（2）過攔截器篇

企業如何實現對工業大資料的預處理？

使用結巴分詞（jieba）對自然語言進行特徵預處理（Python、Java 實現）

使用BeanPostProcessor介面對受管Bean進行預處理

Lucene對pdf、word、html等檔案的處理

使用java將html原始碼(拼裝、可獲取頁面原始碼)轉換成為（"html頁面"，“doc文件檔案”，“pdf格式”），，doc檔案轉換成為pdf,,檔案的刪除、壓縮...

使用Lucene對doc、docx、pdf、txt文件進行全文檢索功能的實現

[轉載]Python-對Pcap檔案進行處理，獲取指定TCP流 https://www.cnblogs.com/sunpudding/p/9538889.html

前臺使用資料庫model和Html.BeginForm("action", "controller", FormMethod.Post)提交表單，在儲存到資料庫前對錶單項進行判斷的方法（一）

layui檔案上傳回調前對檔案型別及大小判斷

利用lucene和pdfBox對PDF文字進行內容的解析

lucene索引檔案大小優化小結

java程式設計將HTML檔案轉換成PDF檔案

如何在html中下載pdf等檔案而不是直接開啟

直接sftp對傳pdf檔案,並建立相對應的目錄當日的,如:20171026

Lucene索引前對doc pdf html檔案的預處理

相關推薦