springboot整合ES實現磁碟檔案全文檢索的示例程式碼
最近有個朋友諮詢如何實現對海量磁碟資料進行目錄、檔名及檔案正文進行搜尋,要求實現簡單高效、維護方便、成本低廉。我想了想利用ES來實現文件的索引及搜尋是適當的選擇,於是就著手寫了一些程式碼來實現,下面就將設計思路及實現方法作以介紹。
整體架構
考慮到磁碟檔案分佈到不同的裝置上,所以採用磁碟掃瞄代理的模式構建系統,即把掃描服務以代理的方式部署到目標磁碟所在的伺服器上,作為定時任務執行,索引統一建立到ES中,當然ES採用分散式高可用部署方法,搜尋服務和掃描代理部署到一起來簡化架構並實現分散式能力。
磁碟檔案快速檢索架構
部署ES
ES(elasticsearch)是本專案唯一依賴的第三方軟體,ES支援docker方式部署,以下是部署過程
docker pull docker.elastic.co/elasticsearch/elasticsearch:6.3.2 docker run -e ES_JAVA_OPTS="-Xms256m -Xmx256m" -d -p 9200:9200 -p 9300:9300 --name es01 docker.elastic.co/elasticsearch/elasticsearch:6.3.2
部署完成後,通過瀏覽器開啟http://localhost:9200,如果正常開啟,出現如下介面,則說明ES部署成功。
ES介面
工程結構
工程結構
依賴包
本專案除了引入springboot的基礎starter外,還需要引入ES相關包
<dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-data-elasticsearch</artifactId> </dependency> <dependency> <groupId>io.searchbox</groupId> <artifactId>jest</artifactId> <version>5.3.3</version> </dependency> <dependency> <groupId>net.sf.jmimemagic</groupId> <artifactId>jmimemagic</artifactId> <version>0.1.4</version> </dependency> </dependencies>
配置檔案
需要將ES的訪問地址配置到application.yml裡邊,同時為了簡化程式,需要將待掃描磁碟的根目錄(index-root)配置進去,後面的掃描任務就會遞迴遍歷該目錄下的全部可索引檔案。
server: port: @elasticsearch.port@ spring: application: name: @project.artifactId@ profiles: active: dev elasticsearch: jest: uris: http://127.0.0.1:9200 index-root: /Users/crazyicelee/mywokerspace
索引結構資料定義
因為要求檔案所在目錄、檔名、檔案正文都有能夠檢索,所以要將這些內容都作為索引欄位定義,而且新增ES client要求的JestId來註解id。
package com.crazyice.lee.accumulation.search.data; import io.searchbox.annotations.JestId; import lombok.Data; @Data public class Article { @JestId private Integer id; private String author; private String title; private String path; private String content; private String fileFingerprint; }
掃描磁碟並建立索引
因為要掃描指定目錄下的全部檔案,所以採用遞迴的方法遍歷該目錄,並標識已經處理的檔案以提升效率,在檔案型別識別方面採用兩種方式可供選擇,一個是檔案內容更為精準判斷(Magic),一種是以副檔名粗略判斷。這部分是整個系統的核心元件。
這裡有個小技巧
對目標檔案內容計算MD5值並作為檔案指紋儲存到ES的索引欄位裡邊,每次在重建索引的時候判斷該MD5是否存在,如果存在就不用重複建立索引了,可以避免檔案索引重複,也能避免系統重啟後重復遍歷檔案。
package com.crazyice.lee.accumulation.search.service; import com.alibaba.fastjson.JSONObject; import com.crazyice.lee.accumulation.search.data.Article; import com.crazyice.lee.accumulation.search.utils.Md5CaculateUtil; import io.searchbox.client.JestClient; import io.searchbox.core.Index; import io.searchbox.core.Search; import io.searchbox.core.SearchResult; import lombok.extern.slf4j.Slf4j; import net.sf.jmimemagic.*; import org.apache.poi.hwpf.extractor.WordExtractor; import org.apache.poi.xwpf.extractor.XWPFWordExtractor; import org.apache.poi.xwpf.usermodel.XWPFDocument; import org.elasticsearch.index.query.QueryBuilders; import org.elasticsearch.search.builder.SearchSourceBuilder; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.stereotype.Component; import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.IOException; @Component @Slf4j public class DirectoryRecurse { @Autowired private JestClient jestClient; //讀取檔案內容轉換為字串 private String readToString(File file,String fileType) { StringBuffer result = new StringBuffer(); switch (fileType) { case "text/plain": case "java": case "c": case "cpp": case "txt": try (FileInputStream in = new FileInputStream(file)) { Long filelength = file.length(); byte[] filecontent = new byte[filelength.intValue()]; in.read(filecontent); result.append(new String(filecontent,"utf8")); } catch (FileNotFoundException e) { log.error("{}",e.getLocalizedMessage()); } catch (IOException e) { log.error("{}",e.getLocalizedMessage()); } break; case "doc": //使用HWPF元件中WordExtractor類從Word文件中提取文字或段落 try (FileInputStream in = new FileInputStream(file)) { WordExtractor extractor = new WordExtractor(in); result.append(extractor.getText()); } catch (Exception e) { log.error("{}",e.getLocalizedMessage()); } break; case "docx": try (FileInputStream in = new FileInputStream(file); XWPFDocument doc = new XWPFDocument(in)) { XWPFWordExtractor extractor = new XWPFWordExtractor(doc); result.append(extractor.getText()); } catch (Exception e) { log.error("{}",e.getLocalizedMessage()); } break; } return result.toString(); } //判斷是否已經索引 private JSONObject isIndex(File file) { JSONObject result = new JSONObject(); //用MD5生成檔案指紋,搜尋該指紋是否已經索引 String fileFingerprint = Md5CaculateUtil.getMD5(file); result.put("fileFingerprint",fileFingerprint); SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); searchSourceBuilder.query(QueryBuilders.termQuery("fileFingerprint",fileFingerprint)); Search search = new Search.Builder(searchSourceBuilder.toString()).addIndex("diskfile").addType("files").build(); try { //執行 SearchResult searchResult = jestClient.execute(search); if (searchResult.getTotal() > 0) { result.put("isIndex",true); } else { result.put("isIndex",false); } } catch (IOException e) { log.error("{}",e.getLocalizedMessage()); } return result; } //對檔案目錄及內容建立索引 private void createIndex(File file,String method) { //忽略掉臨時檔案,以~$起始的檔名 if (file.getName().startsWith("~$")) return; String fileType = null; switch (method) { case "magic": Magic parser = new Magic(); try { MagicMatch match = parser.getMagicMatch(file,false); fileType = match.getMimeType(); } catch (MagicParseException e) { //log.error("{}",e.getLocalizedMessage()); } catch (MagicMatchNotFoundException e) { //log.error("{}",e.getLocalizedMessage()); } catch (MagicException e) { //log.error("{}",e.getLocalizedMessage()); } break; case "ext": String filename = file.getName(); String[] strArray = filename.split("\\."); int suffixIndex = strArray.length - 1; fileType = strArray[suffixIndex]; } switch (fileType) { case "text/plain": case "java": case "c": case "cpp": case "txt": case "doc": case "docx": JSONObject isIndexResult = isIndex(file); log.info("檔名:{},檔案型別:{},MD5:{},建立索引:{}",file.getPath(),fileType,isIndexResult.getString("fileFingerprint"),isIndexResult.getBoolean("isIndex")); if (isIndexResult.getBoolean("isIndex")) break; //1. 給ES中索引(儲存)一個文件 Article article = new Article(); article.setTitle(file.getName()); article.setAuthor(file.getParent()); article.setPath(file.getPath()); article.setContent(readToString(file,fileType)); article.setFileFingerprint(isIndexResult.getString("fileFingerprint")); //2. 構建一個索引 Index index = new Index.Builder(article).index("diskfile").type("files").build(); try { //3. 執行 if (!jestClient.execute(index).getId().isEmpty()) { log.info("構建索引成功!"); } } catch (IOException e) { log.error("{}",e.getLocalizedMessage()); } break; } } public void find(String pathName) throws IOException { //獲取pathName的File物件 File dirFile = new File(pathName); //判斷該檔案或目錄是否存在,不存在時在控制檯輸出提醒 if (!dirFile.exists()) { log.info("do not exit"); return; } //判斷如果不是一個目錄,就判斷是不是一個檔案,時檔案則輸出檔案路徑 if (!dirFile.isDirectory()) { if (dirFile.isFile()) { createIndex(dirFile,"ext"); } return; } //獲取此目錄下的所有檔名與目錄名 String[] fileList = dirFile.list(); for (int i = 0; i < fileList.length; i++) { //遍歷檔案目錄 String string = fileList[i]; File file = new File(dirFile.getPath(),string); //如果是一個目錄,輸出目錄名後,進行遞迴 if (file.isDirectory()) { //遞迴 find(file.getCanonicalPath()); } else { createIndex(file,"ext"); } } } }
掃描任務
這裡採用定時任務的方式來掃描指定目錄以實現動態增量建立索引。
package com.crazyice.lee.accumulation.search.service; import lombok.extern.slf4j.Slf4j; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.beans.factory.annotation.Value; import org.springframework.context.annotation.Configuration; import org.springframework.scheduling.annotation.Scheduled; import org.springframework.stereotype.Component; import java.io.IOException; @Configuration @Component @Slf4j public class CreateIndexTask { @Autowired private DirectoryRecurse directoryRecurse; @Value("${index-root}") private String indexRoot; @Scheduled(cron = "* 0/5 * * * ?") private void addIndex(){ try { directoryRecurse.find(indexRoot); directoryRecurse.writeIndexStatus(); } catch (IOException e) { log.error("{}",e.getLocalizedMessage()); } } }
搜尋服務
這裡以restFul的方式提供搜尋服務,將關鍵字以高亮度模式提供給前端UI,瀏覽器端可以根據返回的JSON進行展示。
package com.crazyice.lee.accumulation.search.web; import com.alibaba.fastjson.JSONObject; import com.crazyice.lee.accumulation.search.data.Article; import io.searchbox.client.JestClient; import io.searchbox.core.Search; import io.searchbox.core.SearchResult; import io.swagger.annotations.ApiImplicitParam; import io.swagger.annotations.ApiImplicitParams; import io.swagger.annotations.ApiOperation; import lombok.extern.slf4j.Slf4j; import org.elasticsearch.index.query.BoolQueryBuilder; import org.elasticsearch.index.query.QueryBuilders; import org.elasticsearch.search.builder.SearchSourceBuilder; import org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.lang.NonNull; import org.springframework.web.bind.annotation.PathVariable; import org.springframework.web.bind.annotation.RequestMapping; import org.springframework.web.bind.annotation.RequestMethod; import org.springframework.web.bind.annotation.RestController; import java.io.IOException; import java.util.HashMap; import java.util.List; import java.util.Map; @RestController @Slf4j public class Controller { @Autowired private JestClient jestClient; @RequestMapping(value = "/search/{keyword}",method = RequestMethod.GET) @ApiOperation(value = "全部欄位搜尋關鍵字",notes = "es驗證") @ApiImplicitParams( @ApiImplicitParam(name = "keyword",value = "全文檢索關鍵字",required = true,paramType = "path",dataType = "String") ) public List search(@PathVariable String keyword){ SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); searchSourceBuilder.query(QueryBuilders.queryStringQuery(keyword)); HighlightBuilder highlightBuilder = new HighlightBuilder(); //path屬性高亮度 HighlightBuilder.Field highlightPath = new HighlightBuilder.Field("path"); highlightPath.highlighterType("unified"); highlightBuilder.field(highlightPath); //title欄位高亮度 HighlightBuilder.Field highlightTitle = new HighlightBuilder.Field("title"); highlightTitle.highlighterType("unified"); highlightBuilder.field(highlightTitle); //content欄位高亮度 HighlightBuilder.Field highlightContent = new HighlightBuilder.Field("content"); highlightContent.highlighterType("unified"); highlightBuilder.field(highlightContent); //高亮度配置生效 searchSourceBuilder.highlighter(highlightBuilder); log.info("搜尋條件{}",searchSourceBuilder.toString()); //構建搜尋功能 Search search = new Search.Builder(searchSourceBuilder.toString()).addIndex( "gf" ).addType( "news" ).build(); try { //執行 SearchResult result = jestClient.execute( search ); return result.getHits(Article.class); } catch (IOException e) { log.error("{}",e.getLocalizedMessage()); } return null; } }
搜尋restFul結果測試
這裡以swagger的方式進行API測試。其中keyword是全文檢索中要搜尋的關鍵字。
搜尋結果
使用thymeleaf生成UI
整合thymeleaf的模板引擎直接將搜尋結果以web方式呈現。模板包括主搜尋頁和搜尋結果頁,通過@Controller註解及Model物件實現。
<body> <div class="container"> <div class="header"> <form action="./search" class="parent"> <input type="keyword" name="keyword" th:value="${keyword}"> <input type="submit" value="搜尋"> </form> </div> <div class="content" th:each="article,memberStat:${articles}"> <div class="c_left"> <p class="con-title" th:text="${article.title}"/> <p class="con-path" th:text="${article.path}"/> <p class="con-preview" th:utext="${article.highlightContent}"/> <a class="con-more">更多</a> </div> <div class="c_right"> <p class="con-all" th:utext="${article.content}"/> </div> </div> <script language="JavaScript"> document.querySelectorAll('.con-more').forEach(item => { item.onclick = () => { item.style.cssText = 'display: none'; item.parentNode.querySelector('.con-preview').style.cssText = 'max-height: none;'; }}); </script> </div>
以上就是本文的全部內容,希望對大家的學習有所幫助,也希望大家多多支援我們。