Python 文字挖掘:jieba中文分詞和詞性標註
阿新 • • 發佈:2019-01-09
#! /usr/bin/env python2.7 #coding=utf-8 import jieba import jieba.posseg #需要另外載入一個詞性標註模組 string = '其實大家買手機就是看個心情,沒必要比來比去的。' seg = jieba.posseg.cut(string) l = [] for i in seg: l.append((i.word, i.flag)) print l >> Building Trie..., from E:\Python27\lib\site-packages\jieba\dict.txt loading model from cache c:\docume~1\admini~1\locals~1\temp\jieba.cache loading model cost 2.71900010109 seconds. Trie has been built succesfully. [(u'\u5176\u5b9e', 'd'), (u'\u5927\u5bb6', 'n'), (u'\u4e70\u624b\u673a', 'n'), (u'\u5c31\u662f', 'd'), (u'\u770b', 'v'), (u'\u4e2a', 'q'), (u'\u5fc3\u60c5', 'n'), (u'\uff0c', 'x'), (u'\u6ca1', 'v'), (u'\u5fc5\u8981', 'd'), (u'\u6bd4\u6765\u6bd4\u53bb', 'l'), (u'\u7684', 'uj'), (u'\u3002', 'x')]
jieba 的詞性標註方式和ICTCLAS的標註方式一樣。具體在這篇文章中有介紹:ICTCLAS 漢語詞性標註集