1. 程式人生 > >Stanford 中文分詞

Stanford 中文分詞

貌似實際用的不是這幾篇。中間2. github上那篇,是怎麼在maven中使用,通過properties檔案生效。然而沒有嘗試成功,properties裡設定不成功。

首先,去【1】下載Download Stanford Word Segmenter version 3.5.2,取得裡面的 data 資料夾,放在maven project的 src/main/resources 裡。

然後,maven依賴新增:

 <properties>
        <java.version>1.8</java.version>
        <project.build.sourceEncoding
>
UTF-8</project.build.sourceEncoding> <corenlp.version>3.6.0</corenlp.version> </properties> <dependencies> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId>
<version>${corenlp.version}</version> </dependency> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>${corenlp.version}</version>
<classifier>models</classifier> </dependency> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>${corenlp.version}</version> <classifier>models-chinese</classifier> </dependency> </dependencies>

最後,程式碼如下:

import java.util.Properties;

import edu.stanford.nlp.ie.crf.CRFClassifier;

public class CoreNLPSegment {

    private static CoreNLPSegment instance;
    private CRFClassifier         classifier;

    private CoreNLPSegment(){
        Properties props = new Properties();
        props.setProperty("sighanCorporaDict", "data");
        props.setProperty("serDictionary", "data/dict-chris6.ser.gz");
        props.setProperty("inputEncoding", "UTF-8");
        props.setProperty("sighanPostProcessing", "true");
        classifier = new CRFClassifier(props);
        classifier.loadClassifierNoExceptions("data/ctb.gz", props);
        classifier.flags.setProperties(props);
    }

    public static CoreNLPSegment getInstance() {
        if (instance == null) {
            instance = new CoreNLPSegment();
        }

        return instance;
    }

    public String[] doSegment(String data) {
        return (String[]) classifier.segmentString(data).toArray();
    }

    public static void main(String[] args) {

        String sentence = "他和我在學校裡常打桌球。";
        String ret[] = CoreNLPSegment.getInstance().doSegment(sentence);
        for (String str : ret) {
            System.out.println(str);
        }

    }

}