1. 程式人生 > >Java基於stanford-corenlp實現英文詞根提取

Java基於stanford-corenlp實現英文詞根提取

本文作者:合肥工業大學 管理學院 錢洋 email:[email protected] 內容可能有不到之處,歡迎交流。

簡介

在做英文文字資料分析時,第一步便是提取詞根。例如,一段文字中了出現‘options’和‘option’,其實這兩個單詞表示一個意思,那麼在預處理時‘options’和‘option’都處理成‘option’。 例如,下面給定的文字:

jhend925  https://blog.csdn.net/timo1160139211/article/details/77603141. All 2015 GTIs have heated seats, including the S trim level 
with no options. I used to own a 2013 335i with the news system and now own 
a 2011 M3 with the stalk system you talked about. The new system on the 2013 was much better. 
Ok, yeah, you have to turn it on every time, but it's much easier to adjust speed

進行分割單詞與標點符號,並進行詞根提取。

stanford-corenlp

stanford-corenlp是一款非常強大的自然語言處理工具,準確率很高。我們可以利用這款工具,完成上述操作。其需要的jar包包括: 在這裡插入圖片描述

java程式

如下,為一個java操作程式:

package com.lda.datadeal;

import java.util.ArrayList;
import java.util.List;
import java.util.Properties;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.
nlp.ling.CoreLabel; import edu.stanford.nlp.pipeline.Annotation; import edu.stanford.nlp.pipeline.StanfordCoreNLP; import edu.stanford.nlp.util.CoreMap; public class Test { public static void main(String[] args) { String aString = "jhend925 https://blog.csdn.net/timo1160139211/article/details/77603141. All 2015 GTIs have heated seats, including the S trim level with no options. I used to own a 2013 335i with the news system and now own a 2011 M3 with the stalk system you talked about. The new system on the 2013 was much better. Ok, yeah, you have to turn it on every time, but it's much easier to adjust speed"
; List<String> word = getlema(aString); for (int i = 0; i < word.size(); i++) { System.out.println(word.get(i)); } } /** * 詞根提取 * @param string:字串 * @return List<String> 分詞、提取詞幹後的結果 * */ public static List<String> getlema(String text){ //詞幹對應的單詞集合 List<String> wordslist = new ArrayList<>();; //StanfordCoreNLP獲取詞幹 Properties props = new Properties(); // set up pipeline properties props.put("annotators", "tokenize, ssplit, pos, lemma"); //分詞、分句、詞性標註和次元資訊。 StanfordCoreNLP pipeline = new StanfordCoreNLP(props); Annotation document = new Annotation(text); pipeline.annotate(document); List<CoreMap> words = document.get(CoreAnnotations.SentencesAnnotation.class); for(CoreMap word_temp: words) { for (CoreLabel token: word_temp.get(CoreAnnotations.TokensAnnotation.class)) { String lema = token.get(CoreAnnotations.LemmaAnnotation.class); // 獲取對應上面word的詞元資訊,即我所需要的詞形還原後的單詞 wordslist.add(lema); } } return wordslist; } }

這裡寫了一個方法getlema,方法的輸入是一串文字,輸出是分詞、提取詞幹後的結果。為了方便讀者直接的理解,以下將控制檯輸出的結果顯示出來: 在實際處理中,可根據getlema方法返回的結果進行再次處理,如去除標點符號、去除停用詞、去除URL型別的字元。

jhend925
https://blog.csdn.net/timo1160139211/article/details/77603141
.
all
2015
gti
have
heat
seat
,
include
the
s
trim
level
with
no
option
.
I
use
to
own
a
2013
335us
with
the
news
system
and
now
own
a
2011
m3
with
the
stalk
system
you
talk
about
.
the
new
system
on
the
2013
be
much
better
.
Ok
,
yeah
,
you
have
to
turn
it
on
every
time
,
but
it
be
much
easier
to
adjust
speed