海量搜尋服務架構搭建1-基於spring的搜尋服務

阿新 • • 發佈：2019-01-05

相當於一個百度搜索系統

幾個名詞解釋

Lucene簡介
- 1.什麼是lucene？
  Lucene是一個全文搜尋框架，而不是應用產品。因此它並不像http://www.baidu.com/ 或者google Desktop那麼拿來就能用，它只是提供了一種工具讓你能實現這些產品。
- 2.lucene能做什麼？
  要回答這個問題，先要了解lucene的本質。實際上lucene的功能很單一，說到底，就是你給它若干個字串，然後它為你提供一個全文搜尋服務，告訴你你要搜尋的關鍵詞出現在哪裡。知道了這個本質，你就可以發揮想象做任何符合這個條件的事情了。你可以把站內新聞都索引了，做個資料庫；你可以把一個數據庫表的若干個欄位索引起來，那就不用再擔心因為“%like%”而鎖表了；你也可以寫個自己的搜尋引擎……
- 3.你該不該選擇lucene
  下面給出一些測試資料，如果你覺得可以接受，那麼可以選擇。
  測試一：250萬記錄，300M左右文字，生成索引380M左右，800執行緒下平均處理時間300ms。
  測試二：37000記錄，索引資料庫中的兩個varchar欄位，索引檔案2.6M，800執行緒下平均處理時間1.5ms。
- 4.lucene為什麼這麼快
  - 倒排索引
  - 壓縮演算法
  - 二元搜尋
- 4.1 倒排索引
  根據屬性的值來查詢記錄。這種索引表中的每一項都包括一個屬性值和具有該屬性值的各記錄的地址。由於不是由記錄來確定屬性值，而是由屬性值來確定記錄的位置，因而稱為倒排索引(invertedindex)
- 5.lucene的工作方式
  lucene提供的服務實際包含兩部分：一入一出。所謂入是寫入，即將你提供的源（本質是字串）寫入索引或者將其從索引中刪除；所謂出是讀出，即向用戶提供全文搜尋服務，讓使用者可以通過關鍵詞定位源
  - 寫入流程
    源字串首先經過analyzer處理，包括：分詞，分成一個個單詞；去除stopword（可選）。將源中需要的資訊加入Document的各個Field中，並把需要索引的Field索引起來，把需要儲存的Field儲存起來。將索引寫入儲存器，儲存器可以是記憶體或磁碟。
  - 讀出流程
    使用者提供搜尋關鍵詞，經過analyzer處理。對處理後的關鍵詞搜尋索引找出對應的Document。使用者根據需要從找到的Document中提取需要的Field。
第一步，建立索引和查詢索引
- IDEA建立webapp工程，匯入lucene響應jar包
- 建立兩個目錄：文章資源目錄、索引檔案目錄
- 文章資源目錄可以直接爬取wget -o /tmp/wget.log -P /root/data --no-parent --no-verbose -m -D www.bjsxt.com -N --convert-links --random-wait -A html,HTML http://www.bjsxt.com
- 兩個函式：建立索引CreateIndex、查詢索引SearchIndex
- 測試結果

package com.sxt.lucene;

import org.apache.commons.io.FileUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.junit.Test;

import java.io.File;
import java.io.IOException;
/**
 * @author: ZouTai
 * @date: 2018/3/28
 * @description: 建立索引
 */

public class CreateIndex {

    // 靜態變數，資源位置
    static String dataDir = "E:/JavaEE_IJ_WorkSpace/lucene/Data/data";
    static String indexDir = "E:/JavaEE_IJ_WorkSpace/lucene/Data/index";

    @Test
    public void createIndex() {
        try {
            // 檔案和分析器
            Directory dir = FSDirectory.open(new File(indexDir));
            Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_9);

            // 寫入索引配置
            IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_4_9, analyzer);
            indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
            IndexWriter indexWriter = new IndexWriter(dir, indexWriterConfig);

            // 遍歷檔案
            File file = new File(dataDir);
            File[] files = file.listFiles();

            for(File f : files) {
                Document document = new Document();
                // 檔名、內容、最後修改時間
                document.add(new StringField("filename", f.getName(), Field.Store.YES));
                document.add(new TextField("content", FileUtils.readFileToString(f), Field.Store.YES));
                document.add(new LongField("lastModify", f.lastModified(), Field.Store.YES));
                indexWriter.addDocument(document);
            }
            indexWriter.close();

        } catch (IOException e) {
            e.printStackTrace();
        }


    }

}

package com.sxt.lucene;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.junit.Test;

import java.io.File;

/**
 * @author: ZouTai
 * @date: 2018/3/29
 * @description: 查詢索引
 */
public class SearchIndex {

    @Test
    public void searchIndex() {

        try {
            Directory directory = FSDirectory.open(new File(CreateIndex.indexDir));
            IndexReader indexReader = DirectoryReader.open(directory);
            IndexSearcher indexSearcher = new IndexSearcher(indexReader);

            Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_9);
            QueryParser queryParser = new QueryParser(Version.LUCENE_4_9, "content", analyzer);
            Query query = queryParser.parse("form");

            TopDocs topDocs = indexSearcher.search(query, 10);
            ScoreDoc[] scoreDocs = topDocs.scoreDocs;
            for (ScoreDoc sd : scoreDocs) {
                int docId = sd.doc;
                Document document = indexReader.document(docId);
                System.out.println(document.get("filename"));

            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

中文分詞的支援
- 目前有11大中文分詞器，本文使用的是IKAnalyzer。
- 中文分詞器不同於英文單詞的單個獨立性，需要對多義詞單詞進行識別，這歌就需要基於統計學來判斷，比如“請把手拿開”中的“把手”，既可以作為自個名字，也可以作為動詞，這裡作為動詞，所以，有時候，需要通過上下文或者廣泛的概率來判斷，那個是最可能的。
- 導包：目前的包管理情況
- 構建SpringMVC專案結構：
  - 匯入Springmvc的8個包：spring7+jstl
  - 配置xml檔案：web.xml和spring-servlet.xml
- 編寫相應的mvc的java檔案
- 這裡注意幾個問題：
  - webapp轉web，需要配置tomcat容器參考
  - 啟動報錯Intellij idea 出現錯誤 error:java: 無效的源發行版: 8spring3.2對應為jdk1.7
  - (拒絕訪問。)：資原始檔被佔用
  - 流程

海量搜尋服務架構搭建1-基於spring的搜尋服務

相當於一個百度搜索系統幾個名詞解釋 Lucene簡介 1.什麼是lucene？ Lucene是一個全文搜尋框架，而不是應用產品。因此它並不像http://www.baidu.com/ 或者google Desktop那麼拿來就能用，它只是提供

IDEA基於Spring Cloud Netflix(2.1.0RC3)的Spring Cloud Eureka來實現服務治理的微服務架構搭建以及和SSM框架的整合——實戰教程

這裡開始spring cloud微服務架構搭建，用的maven還是之前自己的本地安裝的，repository倉庫也是本地的。在搭建專案框架之前先簡單學習一下spring cloud。 Spring Cloud 簡介 Spring

spring cloud微服務架構搭建（1）

一、搭建Eureka服務 1、利用maven構建工具，快速搭建spring boot專案 1.1：(輸入相關專案名稱，選擇相關依賴等) 將壓縮包解壓到順手的盤，用編輯器開啟。 1.2：完善相關pom檔案和配置檔案application.propertie

Spring Cloud 微服務架構搭建

Spring Cloud 微服務架構搭建（使用jenkins+docker自動部署） Author:周留名前言：由於專案框架升級，由SSM框架改為Springboot框架,然後整合Spring Cloud 1.SpringCloud簡介 Spring Cloud 是一個相對

Spring-cloud 微服務架構搭建 03 - Hystrix 深入理解與配置使用

文章目錄 1. hystrix簡介 2. hystrix-service 模組快速搭建 3. hystrix 回退機制 4. hystrix 執行緒池隔離和引數微調 5. hystrix 快取配置

Spring-cloud 微服務架構搭建 02 - config-server 整合git動態重新整理配置及安全管理

文章目錄 1. sping-cloud config簡介 2. sping-cloud config 服務特點 3. Config-Server 服務端搭建 4. Config-Client 端搭建 5. 動

Spring-cloud 微服務架構搭建 01 - Eureka服務搭建及高可用配置

文章目錄 1. Eureka簡介 2. Eureka 服務特點 3. Eureka-Server 服務端搭建 4. Eureka-Client端進行服務註冊 5. 高可用配置

Spring-cloud 微服務架構搭建 04 - Hystrix 監控配合turbine的配置使用

文章目錄 1. Hystrix儀表盤和Turbine叢集監控簡介 2. hystrix-dashboard-turbine 模組快速搭建 1. Hystrix儀表盤和Turbine叢集監控簡介

【SpringCloud】(1)---基於RestTemplate微服務項目案例

mys cee 父類 image 沒有 idl 1.3 start aps 基於RestTemplate微服務項目在寫SpringCloud搭建微服務之前，我想先搭建一個不通過springcloud只通過SpringBoot和Mybatis進行模塊之間額通訊。

springcloud架構搭建（一） Eureka服務器搭建及配置

yml 任務到你檢查 -- pro asi profile 啟動 springcloud架構搭建（一） Eureka服務器搭建及配置今天開始準備學習一下springcloud的相關知識以及環境部署，並且搭建一套springcloud分布式框架：本文只針對剛開始接觸或者

單體架構、SOA架構、微服務架構的淺析，微服務架構搭建

單體架構Monolithic：單個Java WAR檔案。單個Rails或者NodeJS程式碼目錄層級。單體架構比較適合小專案，優點是：開發簡單直接，集中式管理基本不會重複開發功能都在本地，沒有分散式的管理開銷和呼叫開銷 &nb

1、spring cloud服務註冊中心eureka---服務提供者(第二章)

服務提供我們假設服務提供者有一個hello方法，可以根據傳入的引數，提供輸出“hello xxx，this is first messge”的服務 1、pom包配置建立一個springboot專案，pom.xml中新增如下配置： <?xml version="1.0"

1、spring cloud服務註冊中心eureka---單節點配置(第一章)

Eureka Server—單節點配置 spring cloud已經幫我實現了服務註冊中心，我們只需要很簡單的幾個步驟就可以完成。 1、pom中新增依賴 <?xml version="1.0" encoding="UTF-8"?> <project xmlns=

微服務架構（1）

1、什麼是EureKa？ Eureka是Spring Cloud Netflix微服務套件中的一部分，可以與Springboot構建的微服務很容易的整合起來。Eureka包含了伺服器端和客戶端元件。伺服器端，也被稱作是服務註冊中心，用於提供服務的註冊與發現。Eureka支援高

二、REST風格微服務架構搭建

使用SpringBoot、SpringCloud、Mybatis建立一個簡單CURD的Rest風格微服務架構。專案程式碼結構： 1、父工程建立首先建立一個父專案microservice，用來統一管理專案依賴版本，注意建立的是maven pom專

SpringCloud入門 - 微服務架構搭建(註冊中心、服務提供者、服務消費者)

前言：以maven多模組化的方法搭建

MySQL5.7雙主架構搭建（基於GTID方式）

系統：Centos6.5資料庫IP：192.168.0.103、192.168.0.104資料庫埠:都是3306搭建MySQL步驟略（詳見：https://blog.csdn.net/xiaoyi23000/article/details/53200205）1、在103節點

Spring Cloud構建微服務架構（六）高可用服務註冊中心

近期因工作原因減緩了更新頻率，同時為了把Spring Cloud中文社群搭建起來也費了不少時間，幾乎每天都在擠牙膏般的湊時間出來做一些有意義的事。未能按原計劃更新博文，在此對持續關注我部落格的朋友們深表歉意。之前在寫spring Cloud系列文章的時候，列過一個較粗的計劃，現在由於收到不少反饋和問

微服務架構框架選擇：Spring Cloud 和 Dubbo對比

知乎轉載樓層1：從專案的背景來看，Dubbo 國內用的公司挺多，國內影響力大，Spring Cloud 自然在國外影響力較大，所以這個來看不分伯仲了，畢竟都有大公司在使用。從社群的活躍度來看，可以看下各自的Github託管專案來區分。Dubbo ·

Spring Cloud構建微服務架構（三）高可用服務註冊中心

前言在Spring Cloud系列文章的開始，我們就介紹了服務註冊與發現，其中，主要演示瞭如何構建和啟動服務註冊中心Eureka Server，以及如何將服務註冊到Eureka Server中，但是在之前的示例中，這個服務註冊中心是單點的，顯然這並不適合應用於線上生產環境，那

海量搜尋服務架構搭建1-基於spring的搜尋服務

相關推薦