使用solr建立附件[word pdf txt等檔案索引]

阿新 • • 發佈：2019-02-13

官方給出的ContentStreamUpdateRequest樣例:

1 package javaapplicationsolrcell;
   2 
   3 import java.io.File;
   4 import java.io.IOException;
   5 import org.apache.solr.client.solrj.SolrServer;
   6 import org.apache.solr.client.solrj.SolrServerException;
   7 
   8 import org.apache.solr.client.solrj.request.AbstractUpdateRequest 
;
   9 import org.apache.solr.client.solrj.response.QueryResponse;
  10 import org.apache.solr.client.solrj.SolrQuery;
  11 import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;
  12 import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest;
  13 
  14 /**
  15  * @author EDaniel 

  16  */
  17 public class SolrExampleTests {
  18 
  19   public static void main(String[] args) {
  20     try {
  21       //Solr cell can also index MS file (2003 version and 2007 version) types.
  22       String fileName = "c:/Sample.pdf"; 
  23       //this will be unique Id used by Solr to index the file contents. 

  24       String solrId = "Sample.pdf"; 
  25       
  26       indexFilesSolrCell(fileName, solrId);
  27       
  28     } catch (Exception ex) {
  29       System.out.println(ex.toString());
  30     }
  31   }
  32   
  33   /**
  34    * Method to index all types of files into Solr. 
  35    * @param fileName
  36    * @param solrId
  37    * @throws IOException
  38    * @throws SolrServerException
  39    */
  40   public static void indexFilesSolrCell(String fileName, String solrId) 
  41     throws IOException, SolrServerException {
  42     
  43     String urlString = "http://localhost:8983/solr"; 
  44     SolrServer solr = new CommonsHttpSolrServer(urlString);
  45     
  46     ContentStreamUpdateRequest up 
  47       = new ContentStreamUpdateRequest("/update/extract");
  48     
  49     up.addFile(new File(fileName));
  50     
  51     up.setParam("literal.id", solrId);
  52     up.setParam("uprefix", "attr_");
  53     up.setParam("fmap.content", "attr_content");
  54     
  55     up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
  56     
  57     solr.request(up);
  58     
  59     QueryResponse rsp = solr.query(new SolrQuery("*:*"));
  60     
  61     System.out.println(rsp);
  62   }
  63 }

通過以上方法可以看出給solr上傳檔案使用ContentStreamUpdateRequest 封裝請求物件,利用

solr.request(up) 實現post請求.

literal 設定的id 是你的schema.xml 中key值 對應的 solrId為其value值,

在schema.xml 中 配置 <field name="id" type="string" indexed="true" stored="true" required="false" multiValued="false" />

官網請求引數詳細介紹如下：

Input Parameters

fmap.<source_field>=<target_field> - Maps (moves) one field name to another. Example: fmap.content=text will cause the content field normally generated by Tika to be moved to the "text" field.
boost.<fieldname>=<float> - Boost the specified field.
literal.<fieldname>=<value> - Create a field with the specified value. May be multivalued if the Field is multivalued.
uprefix=<prefix> - Prefix all fields that are not defined in the schema with the given prefix. This is very useful when combined with dynamic field definitions. Example: uprefix=ignored_ would effectively ignore all unknown fields generated by Tika given the example schema contains <dynamicField name="ignored_*" type="ignored"/>
defaultField=<Field Name> - If uprefix is not specified and a Field cannot be determined, the default field will be used.
extractOnly=true|false - Default is false. If true, return the extracted content from Tika without indexing the document. This literally includes the extracted XHTML as a string in the response. When viewing manually, it may be useful to use a response format other than XML to aid in viewing the embedded XHTML tags. See TikaExtractOnlyExampleOutput.
resource.name=<File Name> - The optional name of the file. Tika can use it as a hint for detecting mime type.
capture=<Tika XHTML NAME> - Capture XHTML elements with the name separately for adding to the Solr document. This can be useful for grabbing chunks of the XHTML into a separate field. For instance, it could be used to grab paragraphs (<p>) and index them into a separate field. Note that content is also still captured into the overall "content" field.
captureAttr=true|false - Index attributes of the Tika XHTML elements into separate fields, named after the element. For example, when extracting from HTML, Tika can return the href attributes in <a> tags as fields named "a". See the examples below.
xpath=<XPath expression> - When extracting, only return Tika XHTML content that satisfies the XPath expression. See http://tika.apache.org/1.2/parser.html for details on the format of Tika XHTML. See also TikaExtractOnlyExampleOutput.
lowernames=true|false - Map all field names to lowercase with underscores. For example, Content-Type would be mapped to content_type.
literalsOverride=true|false - Solr4.0 When true, literal field values will override other values with same field name, such as metadata and content. If false, then literal field values will be appended to any extracted data from Tika, and the resulting field needs to be multi valued. Default: true
resource.password=<password> - Solr4.0 The optional password for a password protected PDF or OOXML file. File format support depends on Tika.
passwordsFile=<file name> - Solr4.0 The optional name of a file containing file name pattern to password mappings. See chapter "Encrypted Files" below

If extractOnly is true, additional input parameters:

extractFormat=xml|text - Default is xml. Controls the serialization format of the extract content. xml format is actually XHTML, like passing the -x command to the tika command line application, while text is like the -t command.

Order of field operations

fields are generated by Tika or passed in as literals via literal.fieldname=value. Before Solr4.0 or if literalsOverride=false, then literals will be appended as multi-value to tika generated field.
if lowernames==true, fields are mapped to lower case
mapping rules fmap.source=target are applied
if uprefix is specified, any unknown field names are prefixed with that value, else if defaultField is specified, unknown fields are copied to that.

程式碼中存在以上實現還不夠,

此請求到/update/extract 這個請求處理器在solrconfig.xml中必須有相應的配置

  <lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib" regex=".*\.jar" />
  <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-cell-\d.*\.jar" />

 <requestHandler name="/update/extract" 
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>


      <!-- capture link hrefs but ignore div attributes -->
      <str name="captureAttr">true</str>
      <str name="fmap.a">links</str>
      <str name="fmap.div">ignored_</str>
    </lst>
  </requestHandler>

且在solr的lib下存放

contrib/extraction/lib 及

solr-cell-的jar包

使用solr建立附件[word pdf txt等檔案索引]

官方給出的ContentStreamUpdateRequest樣例: 1 package javaapplicationsolrcell; 2 3 import java.io.File; 4 import java.io.IOException;

C# winfrom 寫的一個搜尋助手，可以按照標題和內容搜尋，支援doc,xls,ppt,pdf,txt等格式的檔案搜尋

C# winfrom 寫的一個搜尋助手，可以按照標題和內容搜尋，指定目錄後，遍歷搜尋檔案和子目，現在只寫了支援.DOC.DOCX.XLS.XLSX.PPT.PPTX.PDF.HTML.HTM.TXT等格式的檔案搜尋，讀取execl 內容使用的是NPOI元件，doc,html,txt 格式的

Lucene對pdf、word、html等檔案的處理

Lucene在建立索引的過程中，原生只支援純文字格式（但是你掃描的過程中，如果你不設定檔案格式，會發現不管啥檔案，他都會去啃兩口） =====PDF 用到的庫：PDFBox / XPdf PDFBox

Ubuntu下用命令列開啟pdf doc等檔案

對Linux命令不熟悉的可以參考小白學習Linux命令用了幾天Ubuntu，感覺linux的終端使用起來非常的方便。但是怎麼在linux下用命令的方式開啟pdf檔案呢？開啟pdf的命令： evince [filename] & 如果不在當前資料夾則需要使用cd命令或者把路

爬蟲寫入csv txt 等檔案

1、嵌有字典列表----csv 1.1 def nestedlist2csv(data,file_path): with open(file_path, 'w+',encoding='utf-8') as f: #寫入檔案，出現編碼問題，可以改變目標檔案的編碼方

java 將 jar包、excel、txt等檔案儲存於mysql

首先要注意mysql的欄位型別，可以選擇BLOB型別或text型別，這兩種型別是進行儲存檔案位元組碼的型別；具體使用要根據傳入的檔案的大小限制，這裡我們是不能大於16M,所以這裡使用的是 mediumtext型別的欄位；本次通過儲存檔案的Base64字串

jpg，png，txt等檔案的讀取 matlab

第一個程式碼解決問題， jpg，png，txt等檔案的讀取第二個程式碼除了解決上述問題，還要從更多的圖片中找到與上面對應的圖片，選取出來 namelist = dir('C:\Users\DELL\Desktop\data\helen\train\train_res\

Office檔案的奧祕——.NET平臺下不借助Office實現Word、Powerpoint等檔案的解析

【題外話】這是2010年參加比賽時候做的研究，當時為了實現對Word、Excel、PowerPoint檔案文字內容的抽取研究了很久，由於Java有POI庫，可以輕鬆的抽取各種Office文件，而.NET雖然有移植的NPOI，但是隻實現了最核心的Excel檔案的讀寫，

Office檔案的奧祕——.NET平臺下不借助Office實現Word、Powerpoint等檔案的解析(一)

【題外話】這是2010年參加比賽時候做的研究，當時為了實現對Word、Excel、PowerPoint檔案文字內容的抽取研究了很久，由於Java有POI庫，可以輕鬆的抽取各種Office文件，而.NET雖然有移植的NPOI，但是隻實現了最核心的Excel檔案的讀寫，所以之後查了很多資料才實現了

Office檔案的奧祕——.NET平臺下不借助Office實現Word、Powerpoint等檔案的解析(三)

【題外話】我突然發現現在做Office文件的解析要比2010年的時候容易得多，因為文件從2010年開始更新了好多好多次，讀起來也越來越容易。寫前兩篇文章的時候參考的好多還是微軟的舊文件（2010年的），寫這篇的時候重下了所有的文件，發現每個文件都好讀得多，整理得也更系統，感覺微軟真的是用心在做這個開放的

doc,docx,pdf,ppt等檔案型別讀取方法

//讀取ppt檔案 public String readPPT(String file) throws IOException { String re=""; InputStream is = null;

Apache tika -- 解析多種型別（word、pdf、txt 等）檔案！

http://cloudera.iteye.com/blog/737629 apache 是個偉大的組織。在lucene 檢索如火如荼時， apache不忘繼續努力，近期提供了對各種格式檔案進行解析的解決方案 -- apache旗下的tika. 雖然還沒有1.0版，但已經很好用： Jav

Android應用內展示word、excel、pdf、ppt等檔案

筆者最近兩個專案裡頭都有需要展示檔案的功能，於是做了一番調研，發現asce1885給出一份方案，不過都是關於pdf的展示：http://www.jianshu.com/p/1bf49af6584d，顯然不符合筆者的要求，筆者的專案裡需要展示的檔案格式並不單一，後來經過一番搜尋最後敲定使用

android使用webview預覽png,pdf,doc,xls,txt，等檔案

最近有專案有一個需求，就是線上直接預覽pdf,doc,xls,txt等檔案，ios的webview比較強大，可以直接解析地址，然後預覽。但是android的webview就比較差強人意了。當然，開啟各種型別的檔案，我麼可以使用intent來做，但是這個明顯跟我們的需求不一致啊

如何從word、excel、pdf等檔案中提取文字（Tika）

Tika-內容分析工具包官方網站：https://tika.apache.org/ 在maven倉庫下載最新版依賴 https://mvnrepository.com/artifact/org.apache.tika/tika-parsers 懶得去的同學，提供一個筆者正在使

前端實現線上預覽pdf、word、xls、ppt等檔案

1、前端實現pdf檔案線上預覽功能方式一: 通過a標籤href屬性實現 pdf檔案理論上可以在瀏覽器直接開啟預覽但是需要開啟新頁面。在僅僅是預覽pdf檔案且UI要求不高的情況下可以直接通過a標籤href屬性實現預覽 <a href="文件地址"></

python 實現將office檔案 (word, ppt, excel等) 轉為pdf

python 實現將office檔案 (word, ppt, excel等) 轉為pdf 在 windows 系統（需安裝office套件）中，利用 python 的 win32com 包可以實現對Office檔案的操作，可以批量轉換為pdf檔案。支援 doc, docx, ppt,

java操作word/excel/pdf等檔案技術方案

最近專案中遇到很多對word/excel/pdf等檔案的操作，解決方案有好多，開源免費有:利用openoffice元件（需要安裝openoffice軟體），poi，itext等。也有收費的服務：aspose（特別好用，也有點貴）。我專案中需求可以概括為匯出w

java語言程式設計實現開啟exe可執行檔案以及txt,word等檔案功能

java語言程式設計實現，開啟exe可執行檔案及txt、word等檔案功能，程式碼如下： public class OpenSpecialFile{ public static void mai

使用Jacob批量轉換word為txt、pdf、xps、html、xml等文件

Jacob全稱位java com bridge，通過該外掛，可以使用Java語言編寫程式，呼叫COM、ActiveX元件來操作Windows本地程式。參考一位網友的例子，我寫了一個程式，用來將word批量轉換為txt等格式的文件。該程式核心部分，僅僅是呼叫了Jacob的

使用solr建立 附件[word pdf txt等檔案索引]

Order of field operations

相關推薦

使用solr建立附件[word pdf txt等檔案索引]