lucene使用hanlp分詞

阿新 • • 發佈：2018-12-03

maven依賴

4.0.0

ff
dd
0.0.1-SNAPSHOT
jar

org.apache.lucene lucene-core ${lucene.version} org.apache.lucene lucene-queryparser ${lucene.version} org.apache.lucene lucene-analyzers-smartcn ${lucene.version} org.apache.lucene lucene-analyzers-common ${lucene.version} com.hankcs.nlp hanlp-lucene-plugin 1.1.2

<dependency>
  <groupId>junit</groupId>
  <artifactId>junit</artifactId>
  <version>3.8.1</version>
  <scope>test</scope>
</dependency>

public static void main( String[] args ) throws ParseException, IOException
{
long time=System.currentTimeMillis();

String text = "以前釋出過HanLP的Lucene外掛，後來很多人跟我說山東人比武漢人聽說過吃一線，長一智更好其實Solr更流行（反正我是覺得既然Solr是Lucene的子專案，那麼稍武漢輕工大學微改改配置就能紅安以及黃石路支援Solr），於是就抽空做了個Solr外掛出來，開源在Github上，歡迎改進來自王寶強";


    ////////////////////////////////標準分詞器(長詞不做切分的分詞器)//////////////////////////////
    Analyzer analyzer = new HanLPAnalyzer();
    TokenStream ts = analyzer.tokenStream("field",text);
    ts.reset();
    while(ts.incrementToken()){
        CharTermAttribute attribute = ts.getAttribute(CharTermAttribute.class);  //The term text of a Token.
        OffsetAttribute offsetAttribute =ts.getAttribute(OffsetAttribute.class);  //偏移量
        PositionIncrementAttribute positionIncrementAttribute = ts.getAttribute(PositionIncrementAttribute.class); //距離
        System.out.println(attribute+"  "
                +offsetAttribute.startOffset()+"  "+offsetAttribute.endOffset()+"  "
                +positionIncrementAttribute.getPositionIncrement());
    }
    ts.close();
    System.out.println(11111111);


    /////////////////////////////////索引分詞器(長詞全切分的分詞器)/////////////////////////////
    Analyzer indexAnalyzer = new HanLPIndexAnalyzer();
    TokenStream indexTs = indexAnalyzer.tokenStream("field",text);
    indexTs.reset();
    while(indexTs.incrementToken()){
        CharTermAttribute attribute = indexTs.getAttribute(CharTermAttribute.class);  //The term text of a Token.
        OffsetAttribute offsetAttribute =indexTs.getAttribute(OffsetAttribute.class);  //偏移量
        PositionIncrementAttribute positionIncrementAttribute = indexTs.getAttribute(PositionIncrementAttribute.class); //距離
        System.out.println(attribute+"  "
                +offsetAttribute.startOffset()+"  "+offsetAttribute.endOffset()+"  "
                +positionIncrementAttribute.getPositionIncrement());
    }
    
    indexTs.close();
   // System.out.println("2222222");


    /////////////////////////////通過query檢視分詞結果//////////////////////////////
    QueryParser queryParser = new QueryParser( "txt",analyzer);
    Query query = queryParser.parse(text);
  //  System.out.println(query.toString("txt"));
    queryParser = new QueryParser("txt",indexAnalyzer);
    query = queryParser.parse(text);
  
    
    System.out.println(query.toString("txt"));
  //  System.out.println(HanLP.parseDependency("把市場經濟奉行的等價交換原則引入黨的生活和國家機關政務活動中"));;
    
    System.out.println(System.currentTimeMillis()-time);

// System.out.println(HanLP.DemoTextClassification);
}

以前 0 2 1
釋出 2 4 1
過 4 5 1
HanLP 5 10 1
的 10 11 1
Lucene 11 17 1
外掛 17 19 1
， 19 20 1
後來 20 22 1
很多 22 24 1
人 24 25 1
跟 25 26 1
我 26 27 1
說 27 28 1
山東 28 30 1
人 30 31 1
比 31 32 1
武漢 32 34 1
人 34 35 1
聽說 35 37 1
過 37 38 1
吃 38 39 1
一線 39 41 1
， 41 42 1
長一智 42 45 1
更好 45 47 1
其實 47 49 1
Solr 49 53 1
更 53 54 1
流行 54 56 1
（ 56 57 1

文章來源於公眾號火炎一笑傾城的部落格

lucene使用hanlp分詞

漢語分詞工具的研發-----

python 讀寫txt文件並用jieba庫進行中文分詞

.NET Core中文分詞組件jieba.NET Core

PHP中文分詞擴展 SCWS

Centos下Sphinx中文分詞編譯安裝測試---CoreSeek

13.solr學習速成之IK分詞器

es5.4安裝head、ik中文分詞插件

最大概率法分詞及性能測試

python機器學習——分詞

jieba分詞工具的使用

中文分詞中的戰鬥機

php+中文分詞scws+sphinx+mysql打造千萬級數據全文搜索

elasticksearch分詞，導致kibana的url出現問題

大數據DDos檢測——DDos攻擊本質上是時間序列數據，t+1時刻的數據特點和t時刻強相關，因此用HMM或者CRF來做檢測是必然！和一個句子的分詞算法CRF沒有區別！

python結巴(jieba)分詞

IK分詞器安裝

11款開放中文分詞引擎大比拼

Python 結巴分詞關鍵詞抽取分析

靈玖Nlpir Parser智能挖掘漢語精準分詞

Ubuntu16.04下安裝elasticsearch+kibana實現php客戶端的中文分詞

lucene使用hanlp分詞

相關推薦