stanford parser 使用說明
preface: 最近忙著的專案想試著用斯坦福的parser,來解析句子生成句法分析樹,然後分析子樹,與treekernal結合起來,訓練。stanford parser神器下載下來了,可使用卻是蛋疼。一大堆說明,卻沒個方便快捷關於總的介紹。
一、必先利其器
另外擴充套件工具:java、python等等隨各自專案需要再說。
二、使用(stanford parser)
下載解壓後,根據README.txt檔案來,滷主是在ubuntu15.04系統下,java7,不夠,根據上一篇部落格四行程式碼安裝java8:
$ sudo add-apt-repository ppa:webupd8team/java $ sudo apt-get update $ sudo apt-get install oracle-java8-installer $ java -version
準備好了java8後,才可繼續在ubuntu下編譯使用stanford parser. 根據說明,執行lexparser.sh檔案,加入檔名引數,執行即可。testsent.txt包含5句英文。
在包含lexparser.sh資料夾裡終端執行 ./lexparser.sh data/tentsent.txt 得到結果如下(部分):On a Unix system you should be able to parse the English test file with the following command: ./lexparser.sh data/testsent.txt This uses the PCFG parser, which is quick to load and run, and quite accurate. [Notes: it takes a few seconds to load the parser data before parsing begins; continued parsing is quicker. To use the lexicalized parser, replace englishPCFG.ser.gz with englishFactored.ser.gz in the lexparser.sh script and use the flag -mx600m to give more memory to java.]
可以看出,stanford parser將英文很好的解析,而且有兩種解析方式。換其他英文資料,也能很好的解析。騷年,你以為到這裡就結束了麼,too young too simple.Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.5 sec]. Parsing file: data/testsent.txt Parsing [sent. 1 len. 21]: Scores of properties are under extreme fire threat as a huge blaze continues to advance through Sydney 's north-western suburbs . (ROOT (S (NP (NP (NNS Scores)) (PP (IN of) (NP (NNS properties)))) (VP (VBP are) (PP (IN under) (NP (JJ extreme) (NN fire) (NN threat))) (SBAR (IN as) (S (NP (DT a) (JJ huge) (NN blaze)) (VP (VBZ continues) (S (VP (TO to) (VP (VB advance) (PP (IN through) (NP (NP (NNP Sydney) (POS 's)) (JJ north-western) (NNS suburbs)))))))))) (. .))) nsubj(threat-8, Scores-1) case(properties-3, of-2) nmod:of(Scores-1, properties-3) cop(threat-8, are-4) case(threat-8, under-5) amod(threat-8, extreme-6) compound(threat-8, fire-7) root(ROOT-0, threat-8) mark(continues-13, as-9) det(blaze-12, a-10) amod(blaze-12, huge-11) nsubj(continues-13, blaze-12) nsubj(advance-15, blaze-12) advcl(threat-8, continues-13) mark(advance-15, to-14) xcomp(continues-13, advance-15) case(suburbs-20, through-16) nmod:poss(suburbs-20, Sydney-17) case(Sydney-17, 's-18) amod(suburbs-20, north-western-19) nmod:through(advance-15, suburbs-20)
滷主弄的是中文啊。同樣的方式,改了下lexparser.sh檔案裡面的“edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz”變成:“edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz”,資料改了中文的。以為也能解析,可是特麼慢啊,慢啊,慢啊。而且無論怎麼弄,它都解析為一個句子,是因為沒分詞,沒分詞,也可能是引數沒有調好。找了其他部落格也沒找到合適的。
待續。。。
三、使用2(nltk+stanford-parser.jar)
同仁看到我忙著stanford parser,說到NLTK裡面就有這個,瞬間就演示了下怎麼在nltk裡面用,我了個XX啊,神器在身邊可是不會用啊,不知道nltk神器有這功能。不過只有列表形式的結果:
我特麼也這麼幹,不行啊不行啊。同仁說是沒有下載jar包,打算通過nltk.download下載,結果沒下好,在身邊看得一愣一愣的我說已經在網上下好了。通過網上的部落格介紹,nltk結合stanford-parser.jar解析句子:In [8]: from nltk.parse import stanford In [9]: stanford.StanfordParser? Type: type String form: <class 'nltk.parse.stanford.StanfordParser'> File: /home/shifeng/anaconda/lib/python2.7/site-packages/nltk/parse/stanford.py Init definition: stanford.StanfordParser(self, path_to_jar=None, path_to_models_jar=None, model_path=u'edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz', encoding=u'UTF-8', verbose=False, java_options=u'-mx1000m') Docstring: Interface to the Stanford Parser >>> parser=StanfordParser( ... model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz" ... ) >>> parser.raw_parse_sents(( ... "the quick brown fox jumps over the lazy dog", ... "the quick grey wolf jumps over the lazy fox" ... )) [Tree('ROOT', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['quick']), Tree('JJ', ['brown']), Tree('NN', ['fox'])]), Tree('NP', [Tree('NP', [Tree('NNS', ['jumps'])]), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['lazy']), Tree('NN', ['dog'])])])])])]), Tree('ROOT', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['quick']), Tree('JJ', ['grey']), Tree('NN', ['wolf'])]), Tree('NP', [Tree('NP', [Tree('NNS', ['jumps'])]), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['lazy']), Tree('NN', ['fox'])])])])])])]
In [12]: import os In [13]: os.environ["STANFORD_PARSER"] = "stanford-parser.jar" In [14]: os.environ["STANFORD_MODELS"] = "stanford-parser-3.5.2-models.jar" In [15]: parser = stanford.StanfordParser(model_path=u'edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz') In [16]: sentences = parser.raw_parse_sents(("the quick brown fox jumps over the lazy dog","the quick grey wolf jumps over the lazy fox")) In [17]: sentences Out[17]: [Tree('ROOT', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['quick']), Tree('JJ', ['brown']), Tree('NN', ['fox'])]), Tree('NP', [Tree('NP', [Tree('NNS', ['jumps'])]), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['lazy']), Tree('NN', ['dog'])])])])])]), Tree('ROOT', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['quick']), Tree('JJ', ['grey']), Tree('NN', ['wolf'])]), Tree('NP', [Tree('NP', [Tree('NNS', ['jumps'])]), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['lazy']), Tree('NN', ['fox'])])])])])])] In [18]: sentences = parser.raw_parse_sents(("Hello, My name is Melroy.", "What is your name?")) In [19]: sentences Out[19]: [Tree('ROOT', [Tree('S', [Tree('INTJ', [Tree('UH', ['Hello'])]), Tree(',', [',']), Tree('NP', [Tree('PRP$', ['My']), Tree('NN', ['name'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('ADJP', [Tree('JJ', ['Melroy'])])]), Tree('.', ['.'])])]), Tree('ROOT', [Tree('SBARQ', [Tree('WHNP', [Tree('WP', ['What'])]), Tree('SQ', [Tree('VBZ', ['is']), Tree('NP', [Tree('PRP$', ['your']), Tree('NN', ['name'])])]), Tree('.', ['?'])])])]
四、使用3(eclipse+java)
本不太想用java,不太想在ubuntu中用eclipse,但看到師兄用eclipse把句法分析了,便想著試試。可行,只是只有樹結構,可能初始化的物件是樹,另外陣列型式也應該是互通。
執行結果:import java.io.BufferedReader; import java.io.File; import java.io.FileReader; import java.io.IOException; import java.io.UnsupportedEncodingException; import java.util.ArrayList; import java.util.List; import edu.stanford.nlp.ling.Word; import edu.stanford.nlp.parser.lexparser.LexicalizedParser; import edu.stanford.nlp.trees.Tree; public class Parser { public static void main(String[] args) throws IOException { // String grammar = "edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz"; String grammar = "edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz"; String[] options = {}; LexicalizedParser lp = LexicalizedParser.loadModel(grammar, options); String line = "我 的 名字 叫 小明 ?"; Tree parse = lp.parse(line); parse.pennPrint(); String[] arg2 = {"-encoding", "utf-8", "-outputFormat", "penn,typedDependenciesCollapsed", "edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz", "/home/shifeng/shifengworld/study/tool/stanford_parser/stanford-parser-full-2015-04-20/data/chinese-onesent-utf8.txt"}; LexicalizedParser.main(arg2); } }
Picked up JAVA_TOOL_OPTIONS: -javaagent:/usr/share/java/jayatanaag.jar Loading parser from serialized file edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz ... done [0.8 sec]. (ROOT Loading parser from serialized file edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz ... (IP (NP (DNP (NP (PN 我)) (DEG 的)) (NP (NN 名字))) (VP (VV 叫) (NP (NN 小明))) (PU ?))) done [4.1 sec]. Parsing file: /home/shifeng/shifengworld/study/tool/stanford_parser/stanford-parser-full-2015-04-20/data/chinese-onesent-utf8.txt Parsing [sent. 1 len. 8]: 俄國 希望 伊朗 沒有 製造 核武器 計劃 。 (ROOT (IP (NP (NR 俄國)) (VP (VV 希望) (IP (NP (NR 伊朗)) (VP (ADVP (AD 沒有)) (VP (VV 製造) (NP (NN 核武器) (NN 計劃)))))) (PU 。))) nsubj(希望-2, 俄國-1) root(ROOT-0, 希望-2) nsubj(製造-5, 伊朗-3) neg(製造-5, 沒有-4) ccomp(希望-2, 製造-5) nn(計劃-7, 核武器-6) dobj(製造-5, 計劃-7) Parsed file: /home/shifeng/shifengworld/study/tool/stanford_parser/stanford-parser-full-2015-04-20/data/chinese-onesent-utf8.txt [1 sentences]. Parsed 8 words in 1 sentences (30.42 wds/sec; 3.80 sents/sec).
java始終不是滷主擅長的,還是繼續尋找其他的路。。。
五、經驗之談
多查資料。英文的也強看下去。
參考: