1. 程式人生 > >Python 文字挖掘:jieba中文分詞和詞性標註

Python 文字挖掘:jieba中文分詞和詞性標註

#! /usr/bin/env python2.7
#coding=utf-8

import jieba
import jieba.posseg #需要另外載入一個詞性標註模組

string = '其實大家買手機就是看個心情,沒必要比來比去的。'
seg = jieba.posseg.cut(string)

l = []
for i in seg:
    l.append((i.word, i.flag))
print l



>> Building Trie..., from E:\Python27\lib\site-packages\jieba\dict.txt

loading model from cache c:\docume~1\admini~1\locals~1\temp\jieba.cache
loading model cost  2.71900010109 seconds.
Trie has been built succesfully.
[(u'\u5176\u5b9e', 'd'), (u'\u5927\u5bb6', 'n'), (u'\u4e70\u624b\u673a', 'n'), (u'\u5c31\u662f', 'd'), (u'\u770b', 'v'), (u'\u4e2a', 'q'), (u'\u5fc3\u60c5', 'n'), (u'\uff0c', 'x'), (u'\u6ca1', 'v'), (u'\u5fc5\u8981', 'd'), (u'\u6bd4\u6765\u6bd4\u53bb', 'l'), (u'\u7684', 'uj'), (u'\u3002', 'x')]

jieba 的詞性標註方式和ICTCLAS的標註方式一樣。具體在這篇文章中有介紹:ICTCLAS 漢語詞性標註集