lingpipe: 文字分詞識別例子

阿新 • • 發佈：2019-01-08

1）什麼是lingpipe？

詳細見百度，簡而言之是自然語言處理軟體包（Natural Language Processing，NLP）。

lingpipe主要包含以下模組：
主題分類（Top Classification）
命名實體識別（Named Entity Recognition，NER）（什麼是NER？繼續百度。。。簡而言之是人名、地名、機構名等文字識別）
詞性標註（Part-of Speech Tagging）
句題檢測（Sentence Detection）
查詢拼寫檢查（Query Spell Checking）
興趣短語檢測（Interseting Phrase Detection）
聚類（Clustering）
字元語言建模（Character Language Modeling）
醫學文獻下載/解析/索引（MEDLINE Download, Parsing and Indexing）
資料庫文字挖掘（Database Text Mining）
中文分詞（Chinese Word Segmentation）
情感分析（Sentiment Analysis）
語言辨別（Language Identification）

Reference

lingpipe官方文件：http://alias-i.com/lingpipe/demos/tutorial/cluster/read-me.html

背後NLP模型：http://nlp.stanford.edu/software/corenlp.shtml

2）我做了個分詞的例子作為參考（應用到：命名實體識別、句題檢測，用到 lingpipe-4.1.0.jar），e.g.

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import com.aliasi.chunk.CharLmHmmChunker;
import com.aliasi.chunk.Chunk;
import com.aliasi.chunk.Chunker;
import com.aliasi.chunk.Chunking;
import com.aliasi.corpus.Parser;
import com.aliasi.dict.DictionaryEntry;
import com.aliasi.dict.MapDictionary;
import com.aliasi.dict.ExactDictionaryChunker;
import com.aliasi.hmm.HmmCharLmEstimator;
import com.aliasi.sentences.IndoEuropeanSentenceModel;
import com.aliasi.sentences.SentenceModel;
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.Tokenizer;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.util.AbstractExternalizable;

public class TextAnalyzer {

    static final double CHUNK_SCORE = 1.0;
    static final TokenizerFactory TOKENIZER_FACTORY = IndoEuropeanTokenizerFactory.INSTANCE;
    static final SentenceModel SENTENCE_MODEL = new IndoEuropeanSentenceModel();

    public static void main(String[] args) {
//    	testChunkSentences();
//    	testChunkDictionary();
    	test();
    }
    
    private static void test() {
    }
    
    // Sentences - Sentences Chunking（分句）
    private static void testChunkSentences() {
    	String text = "50 Cent XYZ120 DVD Player 50 Cent lawyer. Person is john, he is a lawyer.";
    	List<String> result = new ArrayList<String>();
    	
    	List<String> tokenList = new ArrayList<String>();
		List<String> whiteList = new ArrayList<String>();
		Tokenizer tokenizer = TOKENIZER_FACTORY.tokenizer(text.toCharArray(),
				0, text.length());
		tokenizer.tokenize(tokenList, whiteList);
		String[] tokens = new String[tokenList.size()];
		String[] whites = new String[whiteList.size()];
		tokenList.toArray(tokens);
		whiteList.toArray(whites);
		int[] sentenceBoundaries = SENTENCE_MODEL.boundaryIndices(tokens,
				whites);
		int sentStartTok = 0;
		int sentEndTok = 0;
		for (int i = 0; i < sentenceBoundaries.length; ++i) {
			System.out.println("Sentense " + (i + 1) + ", sentense's length(from 0):" + (sentenceBoundaries[i]));
			StringBuilder sb = new StringBuilder();
			sentEndTok = sentenceBoundaries[i];
			for (int j = sentStartTok; j <= sentEndTok; j++) {
				sb.append(tokens[j]).append(whites[j + 1]);
			}
			sentStartTok = sentEndTok + 1;
			result.add(sb.toString());
		}
		System.out.println("Final result:" + result);
    }
    
    // NER(named entity recognition) - Exact Dictionary-Based Chunking（分詞）
    private static void testChunkDictionary() {
    	String[] args1 = {"50 Cent XYZ120 DVD Player 50 Cent lawyer.", "person is john, he is a lawyer."};

        MapDictionary<String> dictionary = new MapDictionary<String>();
        dictionary.addEntry(new DictionaryEntry<String>("50 Cent","PERSON",CHUNK_SCORE));
        dictionary.addEntry(new DictionaryEntry<String>("XYZ120 DVD Player","DB_ID_1232",CHUNK_SCORE));
        dictionary.addEntry(new DictionaryEntry<String>("cent","MONETARY_UNIT",CHUNK_SCORE));
        dictionary.addEntry(new DictionaryEntry<String>("dvd player","PRODUCT",CHUNK_SCORE));


        ExactDictionaryChunker dictionaryChunkerTT
            = new ExactDictionaryChunker(dictionary,
                                         IndoEuropeanTokenizerFactory.INSTANCE,
                                         true,true);

        ExactDictionaryChunker dictionaryChunkerTF
            = new ExactDictionaryChunker(dictionary,
                                         IndoEuropeanTokenizerFactory.INSTANCE,
                                         true,false);

        // returnAllMatches is false means bypassing the matched text from further matching process
        ExactDictionaryChunker dictionaryChunkerFT
            = new ExactDictionaryChunker(dictionary,
                                         IndoEuropeanTokenizerFactory.INSTANCE,
                                         false,true);

        ExactDictionaryChunker dictionaryChunkerFF
            = new ExactDictionaryChunker(dictionary,
                                         IndoEuropeanTokenizerFactory.INSTANCE,
                                         false,false);



        System.out.println("\nDICTIONARY\n" + dictionary);

        for (int i = 0; i < args1.length; ++i) {
            String text = args1[i];
            System.out.println("\n\nTEXT=" + text);

            chunk(dictionaryChunkerTT,text);
            chunk(dictionaryChunkerTF,text);
            chunk(dictionaryChunkerFT,text);
            chunk(dictionaryChunkerFF,text);
        }
    }

    static void chunk(ExactDictionaryChunker chunker, String text) {
        System.out.println("\nChunker."
                           + " All matches=" + chunker.returnAllMatches()
                           + " Case sensitive=" + chunker.caseSensitive());
        Chunking chunking = chunker.chunk(text);
        for (Chunk chunk : chunking.chunkSet()) {
            int start = chunk.start();
            int end = chunk.end();
            String type = chunk.type();
            double score = chunk.score();
            String phrase = text.substring(start,end);
            System.out.println("     phrase=|" + phrase + "|"
                               + " start=" + start
                               + " end=" + end
                               + " type=" + type
                               + " score=" + score);
        }
    }
}

lingpipe: 文字分詞識別例子

lingpipe: 文字分詞識別例子

[python] jieba 模組 -- 給中文文字分詞

期末作業——波士頓房價預測及中文文字分詞

java 中文文字分詞

lingpipe中文分詞模組測試Demo的引數設定

文字分詞方法

Python進行文字預處理（文字分詞，過濾停用詞，詞頻統計，特徵選擇，文字表示）

64位的R中使用Rwordseg做文字分詞遇到的安裝問題

Hive基於UDF進行文字分詞

hanlp中文智慧分詞自動識別文字提取例項

利用java實現對文字的去除停用詞以及分詞處理

機器學習-文字特徵值抽取，中文分詞

python3-對某目錄下的文字檔案分詞

HMM與分詞、詞性標註、命名實體識別

簡單NLP分析套路（2）----分詞，詞頻，命名實體識別與關鍵詞抽取

Python3——文字標題關鍵字提取_jieba分詞+sklearn計算tf-idf詞語權重

資料探勘01---文字分析（jieba分詞和詞雲繪製）

自然語言處理之分詞、命名主體識別、詞性、語法分析-stanfordcorenlp-NER(二)

【java HanNLP】HanNLP 利用java實現對文字的去除停用詞以及分詞處理

用python3 SVM訓練測試預測文字摘要（中文）（分句，分詞，詞典，詞袋，測試建模，預測）

lingpipe: 文字分詞識別例子

相關推薦