elasticsearch 外掛開發-自定義分詞方法
自定義elasticsearch外掛實現
1 外掛專案結構
這是一個傳統的maven專案結構,主要是多了一些外掛需要的的目錄和檔案
plugin.xml
和plugin-descriptor.properties
這兩個是外掛的主要配置和描述
pom.xml
裡面也有一些外掛的配置pom.xml檔案
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi: schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<name>analysis-gridsum</name>
<groupId>org.elasticsearch</groupId>
<artifactId>gridsum-plugin</artifactId>
<version >0.0.1</version>
<description>gridsum elasticsearch plugin 國雙elasticsearch自定義分詞外掛</description>
<properties>
<elasticsearch.version>6.4.1</elasticsearch.version>
<lucene.version>7.5.0</lucene.version>
<maven.compiler.target >1.8</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch</artifactId>
<version>${elasticsearch.version}</version>
<scope>provided</scope>
</dependency>
<!-- Testing -->
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api</artifactId>
<version>2.7</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.7</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.elasticsearch.test</groupId>
<artifactId>framework</artifactId>
<version>${elasticsearch.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-test-framework</artifactId>
<version>${lucene.version}</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<resources>
<resource>
<directory>src/main/resources</directory>
<filtering>false</filtering>
<excludes>
<exclude>*.properties</exclude>
</excludes>
</resource>
</resources>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.6</version>
<configuration>
<appendAssemblyId>false</appendAssemblyId>
<outputDirectory>${project.build.directory}/releases/</outputDirectory>
<descriptors>
<descriptor>${basedir}/src/main/assemblies/plugin.xml</descriptor>
</descriptors>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5.1</version>
<configuration>
<source>${maven.compiler.target}</source>
<target>${maven.compiler.target}</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
plugin.xml檔案
<?xml version="1.0"?>
<assembly>
<id>analysis-gridsum</id>
<formats>
<format>zip</format>
</formats>
<includeBaseDirectory>false</includeBaseDirectory>
<fileSets>
<fileSet>
<directory>${project.basedir}/config</directory>
<outputDirectory>config</outputDirectory>
</fileSet>
</fileSets>
<files>
<file>
<source>${project.basedir}/src/main/resources/plugin-descriptor.properties</source>
<outputDirectory/>
<filtered>true</filtered>
</file>
</files>
<dependencySets>
<dependencySet>
<outputDirectory/>
<useProjectArtifact>true</useProjectArtifact>
<useTransitiveFiltering>true</useTransitiveFiltering>
<excludes>
<exclude>org.elasticsearch:elasticsearch</exclude>
</excludes>
</dependencySet>
</dependencySets>
</assembly>
plugin-descriptor.properties檔案
description=${project.description}
version=${project.version}
name=${project.name}
classname=org.elasticsearch.gridsum.plugin.GridsumPlugin
java.version=${maven.compiler.target}
elasticsearch.version=${elasticsearch.version}
把專案結構和這幾個檔案新增好之後就可以編寫外掛了。
2 外掛主要實現類和方法
2.1 開發外掛只需要繼承Plugin實現AnalysisPlugin就可以了
GridsumTokenizer是分詞器,繼承Tokenizer,通過重寫incrementToken方法來實現自己的分詞程式
GridsumAnalyzer是分析器,繼承Analyzer,裡面需要塞一個分詞器
GridsumAnalyzerProvider是分析器提供程式,繼承AbstractIndexAnalyzerProvider,通過重寫get方法返回自定義分析器
GridsumTokenizerFactory是分詞器工廠,繼承AbstractTokenizerFactory,通過重寫create方法返回自定義的分詞器
GridsumPlugin自定義外掛的主要實現,繼承Plugin實現AnalysisPlugin,通過重寫getTokenizers將分詞器工廠放入map,通過重寫getAnalyzers將分析器放入map(這裡的key後面會用到
)
結構圖如下
先來看一下自定義Tokenzier,最主要的是incrementToken方法
再看一下自定義Tokenizer工廠,主要的方法是create方法返回自定義Tokenizer
看一下自定義Analyzer
在createComponents方法中返回TokenStreamComponents,裡面塞了一個我們的自定義Tokenizer
再看一下Analyzer工廠
主要返回一個自定義Analyzer
最終我們看一下自定義Plugin
到這裡整個外掛的結構就完成了。
2.2 實現自己的分詞程式
整個自定義分詞的最關鍵方法就是自定義分詞器GridsumTokenizer的incrementToken方法,通過重寫該方法來實現自定義分詞功能
在網上找的一個空格分詞的實現
package org.elasticsearch.gridsum.plugin.extend;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import java.io.IOException;
public class GridsumTokenizer extends Tokenizer {
private final static Logger LOGGER = LogManager.getLogger(GridsumTokenizer.class);
private final static String PUNCTION = " -()/";
private final StringBuilder buffer = new StringBuilder();
private int suffixOffset;
private int tokenStart = 0, tokenEnd = 0;
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
@Override
public final boolean incrementToken() throws IOException {
clearAttributes();
buffer.setLength(0);
int ci;
char ch;
tokenStart = tokenEnd;
ci = input.read();
if(ci>64&&ci<91){
ci=ci+32;
}
ch = (char) ci;
while (true) {
if (ci == -1){
if (buffer.length() == 0)
return false;
else {
termAtt.setEmpty().append(buffer);
offsetAtt.setOffset(correctOffset(tokenStart),
correctOffset(tokenEnd));
return true;
}
}
else if (PUNCTION.indexOf(ch) != -1) {
//buffer.append(ch);
tokenEnd++;
if(buffer.length()>0){
termAtt.setEmpty().append(buffer);
offsetAtt.setOffset(correctOffset(tokenStart),
correctOffset(tokenEnd));
return true;
}else
{
ci = input.read();
if(ci>64&&ci<91){
ci=ci+32;
}
ch = (char) ci;
}
} else {
buffer.append(ch);
tokenEnd++;
ci = input.read();
if(ci>64&&ci<91){
ci=ci+32;
}
ch = (char) ci;
}
}
}
@Override
public final void end() {
final int finalOffset = correctOffset(suffixOffset);
this.offsetAtt.setOffset(finalOffset, finalOffset);
}
@Override
public void reset() throws IOException {
super.reset();
tokenStart = tokenEnd = 0;
}
}
3 驗證與安裝
本地測試方法如下
package org.elasticsearch.gridsum.plugin;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.elasticsearch.gridsum.plugin.extend.GridsumAnalyzer;
import org.junit.Test;
public class GridsumAnalyzerTest {
@Test
public void testAnalyzer() throws Exception {
GridsumAnalyzer analyzer = new GridsumAnalyzer();
TokenStream ts = analyzer.tokenStream("text", "我愛北京 天安門");
CharTermAttribute term = ts.addAttribute(CharTermAttribute.class);
ts.reset();
while (ts.incrementToken()) {
System.out.println(term.toString());
}
ts.end();
ts.close();
}
}
輸出結果
我愛北京
天安門
程式寫完後打包
mvn clean package
打包後會生成一個本地的zip包,用來在elasticsearch進行安裝
安裝命令(windows)
elasticsearch bin目錄> elasticsearch-plugin.bat install file:D:/elastic-gridsum-plugin/target/releases/gridsum-plugin-0.0.1.zip
如果提示已經存在了,請先解除安裝,不過前提是不要和系統的其他外掛名稱一致,名稱是通過plugin.xml裡面的<assembly><id>
來定義的
解除安裝方法
elasticsearch bin目錄> elasticsearch-plugin.bat remove analysis-gridsum
安裝成功後啟動elasticsearch
elasticsearch bin目錄> elasticsearch.bat
啟動之後我們可以在postman驗證一下
注意這裡analyzer的key就是第二步重寫getAnalyzers時map裡面的key
完整專案下載地址