solr進階四：建立檔案索引

阿新 • • 發佈：2019-01-13

索引資料來源並不會一定來自於資料庫、XML、JSON、CSV這類結構化資料，很多時候也來自於PDF、word、html、word、MP3等這類非結構化資料，從這類非結構化資料建立索引，solr也給我們提供了很好的支援，利用的是apache tika。

下面我們來看看在solr4.10中如何從pdf檔案建立索引。

先配置檔案索引

新建core，儲存檔案型索引，具體步驟參考：

匯入jar包

在工作目錄下新建一個extract資料夾，用來存放solr擴充套件的jar包。

\solr_tomcat\solr\pdf_core\extract

拷貝\solr-4.10.2\dist下的solr-cell-4.10.2.jar

到extract資料夾中，接著把

\solr-4.10.2\contrib\extraction\lib下的索引jar包拷貝到extract資料夾中。

配置solrconfig.xml

新增請求解析配置：

<requestHandler name="/extract" class="solr.extraction.ExtractingRequestHandler" >  
       <lst name="defaults">  
        <str name="fmap.content">text</str>  
        <str name="lowernames">true</str>  
        <str name="uprefix">attr_</str>  
        <str name="captureAttr">true</str>  
       </lst>  
</requestHandler>

指定依賴包位置：

<span style="font-size:18px;"><lib dir="extract" regex=".*\.jar" /></span>

注意，這個相對位置不是相對於配置檔案所在資料夾位置，而是相對core主目錄的。比如我的配置檔案在\solr_tomcat\solr\pdf_core\conf，但是我的jar包在\solr_tomcat\solr\pdf_core\extract那麼我的相對路徑就是extract而不是../extract。

配置schema.xml，配置索引欄位的型別，也就是field型別。

其中text_general

型別我們用到2個txt檔案（stopwords.txt、synonyms.txt），這2個txt檔案在釋出包示例core裡面有位置在：\solr_tomcat\solr\collection1\conf，複製這2個txt檔案到新建的core下面的conf目錄下，和schema.xml一個位置。

注意：如果是複製貼上core來新建core的話，原來的配置檔案有些field是已經定義的，要注意把重複定義的去掉一個！

<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>  
  <fieldtype name="string"  class="solr.StrField" sortMissingLast="true" omitNorms="true"/>  
  <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">  
     <analyzer type="index">  
       <tokenizer class="solr.StandardTokenizerFactory"/>  
       <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />  
       <filter class="solr.LowerCaseFilterFactory"/>  
     </analyzer>  
     <analyzer type="query">  
       <tokenizer class="solr.StandardTokenizerFactory"/>  
       <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />  
       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>  
       <filter class="solr.LowerCaseFilterFactory"/>  
     </analyzer>  
   </fieldType>

配置索引欄位，也就是field

其中有個動態型別欄位，attr_*，這個是什麼意思呢。也就是solr在解析檔案的時候，檔案本身有很多屬性，具體有哪些屬性是不確定的，solr全部把他解析出來以attr作為字首加上檔案本身的屬性名，組合在一起就成了field的名稱。

<field name="id"        type="string"       indexed="true"  stored="true"  multiValued="false" required="true"/>  
 <field name="text"      type="text_general" indexed="true"  stored="true"/>  
 <field name="_version_" type="long"         indexed="true"  stored="true"/>  
   
 <dynamicField name="attr_*" type="text_general" indexed="true" stored="true" multiValued="true"/>

到這裡solr服務端的配置以及完成了。

測試類CreateIndexFromPDF.java

Solrj4.10裡面ContentStreamUpdateRequest的addFile方法多了一個contentType引數，指明內容型別。ContentType請參看：ContentType

import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.client.solrj.request.AbstractUpdateRequest;
import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest;
import org.apache.solr.client.solrj.response.QueryResponse;

import java.io.File;
import java.io.IOException;
/**
 * Created by Lhx on 14-12-4.
 */
public class CreateIndexFromPDF {

    public static void indexFilesSolr(String fileName, String solrId) throws IOException, SolrServerException {
        String urlString = "http://localhost:8080/solr/pdf_core";
        SolrServer solr = new HttpSolrServer(urlString);
        ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/extract");
        String contentType = "application/pdf";
        up.addFile(new File(fileName), contentType);
        up.setParam("literal.id", solrId);
        up.setParam("uprefix","attr_");
        up.setParam("fmap.content","attr_content");
        up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);

        solr.request(up);

        QueryResponse rsp = solr.query(new SolrQuery("*:*"));
        System.out.println(rsp);
    }

    public static void main(String[] args) {
        String fileName = "F:\\Sencha_Touch_2.0使用者指南(中文版).pdf";
        String solrId = "Sencha_Touch_2.0使用者指南(中文版).pdf";
        try {
            indexFilesSolr(fileName,solrId);
        } catch (IOException e) {
            e.printStackTrace();
        } catch (SolrServerException e) {
            e.printStackTrace();
        }
    }
}

執行上面程式碼，便把我們的pdf檔案上傳到solr伺服器，解析、建立索引。

後面的solr.query是執行一個查詢，查詢解析索引後結果。解析後pdf就變成了純文字的內容，在控制檯可以看到很多文件其他資訊。

Solr解析完pdf、建立索引後，我們也可以在solr的管理介面檢視索引結果。如下圖。

選擇“Query”，直接點選“Execute Query”按鈕就可以了：

後記：

重啟tomcat後報重複定義欄位的錯誤，這個在前面的實踐中就有這個錯誤，所以很快就在schema.xml中找到重複定義的id和long等型別欄位，刪掉就可以了。

接著啟動tomcat，還是報出無法載入某某jar包的提示錯誤，後來才發現

<lib dir="extract" regex=".*\.jar" />

這個dir指定的目錄地址寫錯了，導致tomcat報錯。

啟動tomcat後再也沒有報錯，在java控制檯執行程式碼，報出以下錯誤：

原來是我把urlString地址寫錯了，寫成了：

http://localhost:8080/solr

沒有指定究竟上傳到哪個指定的core裡面，修改後就能提交PDF文件資訊了。

附錄：

solrconfig.xml

<?xml version="1.0" encoding="UTF-8" ?>
<!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
-->

<!--
 This is a stripped down config file used for a simple example...  
 It is *not* a good example to work from. 
-->
<config>
    <luceneMatchVersion>4.10.2</luceneMatchVersion>
    <!--  The DirectoryFactory to use for indexes.
          solr.StandardDirectoryFactory, the default, is filesystem based.
          solr.RAMDirectoryFactory is memory based, not persistent, and doesn't work with replication. -->
    <directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/>

    <dataDir>${solr.core0.data.dir:}</dataDir>

    <!-- To enable dynamic schema REST APIs, use the following for <schemaFactory>:
    
         <schemaFactory class="ManagedIndexSchemaFactory">
           <bool name="mutable">true</bool>
           <str name="managedSchemaResourceName">managed-schema</str>
         </schemaFactory>
         
         When ManagedIndexSchemaFactory is specified, Solr will load the schema from
         he resource named in 'managedSchemaResourceName', rather than from schema.xml.
         Note that the managed schema resource CANNOT be named schema.xml.  If the managed
         schema does not exist, Solr will create it after reading schema.xml, then rename
         'schema.xml' to 'schema.xml.bak'. 
         
         Do NOT hand edit the managed schema - external modifications will be ignored and
         overwritten as a result of schema modification REST API calls.
  
         When ManagedIndexSchemaFactory is specified with mutable = true, schema
         modification REST API calls will be allowed; otherwise, error responses will be
         sent back for these requests. 
    -->
    <schemaFactory class="ClassicIndexSchemaFactory"/>

    <updateHandler class="solr.DirectUpdateHandler2">
        <updateLog>
            <str name="dir">${solr.core0.data.dir:}</str>
        </updateLog>
    </updateHandler>

    <!-- realtime get handler, guaranteed to return the latest stored fields 
      of any document, without the need to commit or open a new searcher. The current 
      implementation relies on the updateLog feature being enabled. -->
    <requestHandler name="/get" class="solr.RealTimeGetHandler">
        <lst name="defaults">
            <str name="omitHeader">true</str>
        </lst>
    </requestHandler>

    <requestHandler name="/replication" class="solr.ReplicationHandler" startup="lazy"/>

    <requestDispatcher handleSelect="true">
        <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" formdataUploadLimitInKB="2048"/>
    </requestDispatcher>

    <requestHandler name="standard" class="solr.StandardRequestHandler" default="true"/>
    <requestHandler name="/analysis/field" startup="lazy" class="solr.FieldAnalysisRequestHandler"/>
    <requestHandler name="/update" class="solr.UpdateRequestHandler"/>
    <requestHandler name="/admin/" class="org.apache.solr.handler.admin.AdminHandlers"/>

    <requestHandler name="/admin/ping" class="solr.PingRequestHandler">
        <lst name="invariants">
            <str name="q">solrpingquery</str>
        </lst>
        <lst name="defaults">
            <str name="echoParams">all</str>
        </lst>
    </requestHandler>

    <!--新新增的內容-->
    <requestHandler name="/extract" class="solr.extraction.ExtractingRequestHandler">
        <lst name="defaults">
            <str name="fmap.content">text</str>
            <str name="lowernames">true</str>
            <str name="uprefix">attr_</str>
            <str name="captureAttr">true</str>
        </lst>
    </requestHandler>

    <lib dir="extract" regex=".*\.jar"/>


    <!-- config for the admin interface -->
    <admin>
        <defaultQuery>solr</defaultQuery>
    </admin>

</config>

schema.xml

<?xml version="1.0" ?>
<!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
-->

<schema name="example core zero" version="1.1">

    <!-- general -->

    <field name="type" type="string" indexed="true" stored="true" multiValued="false"/>
    <field name="name" type="string" indexed="true" stored="true" multiValued="false"/>
    <field name="core0" type="string" indexed="true" stored="true" multiValued="false"/>

    <!-- field to use to determine and enforce document uniqueness. -->
    <uniqueKey>id</uniqueKey>

    <!-- field for the QueryParser to use when an explicit fieldname is absent -->
    <defaultSearchField>name</defaultSearchField>

    <!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
    <solrQueryParser defaultOperator="OR"/>

    <!--新新增的，其中long、String等欄位原來配置檔案就有，注意刪除-->
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
    <fieldtype name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
            <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
            <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
    </fieldType>

    <field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"/>
    <field name="text" type="text_general" indexed="true" stored="true"/>
    <field name="_version_" type="long" indexed="true" stored="true"/>

    <dynamicField name="attr_*" type="text_general" indexed="true" stored="true" multiValued="true"/>

</schema>

參考文章：

solr進階四：建立檔案索引

solr進階四：建立檔案索引

Android進階四：Databinding的使用(基礎篇)

solr進階九：solr對數字和單個字元的搜尋

分針網——每日分享：JavaScript進階(四)js字符串轉換成數字的三種方法

Java進階(四十二)Java中多執行緒使用匿名內部類的方式進行建立3種方式

android進階4step1：android小知識檔案儲存

服務端技術進階(四)一篇文讀懂分散式系統本質：高吞吐、高可用、可擴充套件

websphere服務四：建立概要檔案、刪除概要檔案

進階教程：用Python建立全新二層神經結構

四：建立高級web測試計劃

HTML5 進階系列：indexedDB 數據庫

Python學習之旅—面向對象進階知識：類的命名空間，類的組合與繼承

Android進階筆記：AIDL內部實現詳解（二）

java 進階一：代理和動態代理

用裝飾器做一個登陸功能（進階）：

面向對象進階6：元類

Python爬蟲新手進階版：怎樣讀取非結構化、圖像、視頻、語音數據

T-SQL查詢進階--理解SQL Server中索引的概念，原理以及其他

進階篇：3.2.4）鈑金件-材料選擇

Xadmin進階一：如何增加一列

solr進階四：建立檔案索引

相關推薦