用WordNet實現中文情感分析
阿新 • • 發佈:2018-12-30
1. 分析
中文的情感分析可以用詞林做,詞林有一大類(G類)對應心理活動,但是相對於wordnet還是太簡單了.因此使用nltk+wordnet的方案,如下:
1) 中文分詞:結巴分詞
3) 情感分析:wordnet的sentiwordnet元件
2. 程式碼
# encoding=utf-8 import jieba import sys import codecs reload(sys) import nltk from nltk.corpus import wordnet as wn from nltk.corpus import sentiwordnet as swn sys.setdefaultencoding('utf8') def doSeg(filename) : f = open(filename, 'r+') file_list = f.read() f.close() seg_list = jieba.cut(file_list) stopwords = [] for word in open("./stop_words.txt", "r"): stopwords.append(word.strip()) ll = [] for seg in seg_list : if (seg.encode("utf-8") not in stopwords and seg != ' ' and seg != '' and seg != "\n" and seg != "\n\n"): ll.append(seg) return ll def loadWordNet(): f = codecs.open("./cow-not-full.txt", "rb", "utf-8") known = set() for l in f: if l.startswith('#') or not l.strip(): continue row = l.strip().split("\t") if len(row) == 3: (synset, lemma, status) = row elif len(row) == 2: (synset, lemma) = row status = 'Y' else: print "illformed line: ", l.strip() if status in ['Y', 'O' ]: if not (synset.strip(), lemma.strip()) in known: known.add((synset.strip(), lemma.strip())) return known def findWordNet(known, key): ll = []; for kk in known: if (kk[1] == key): ll.append(kk[0]) return ll def id2ss(ID): return wn._synset_from_pos_and_offset(str(ID[-1:]), int(ID[:8])) def getSenti(word): return swn.senti_synset(word.name()) if __name__ == '__main__' : known = loadWordNet() words = doSeg(sys.argv[1]) n = 0 p = 0 for word in words: ll = findWordNet(known, word) if (len(ll) != 0): n1 = 0.0 p1 = 0.0 for wid in ll: desc = id2ss(wid) swninfo = getSenti(desc) p1 = p1 + swninfo.pos_score() n1 = n1 + swninfo.neg_score() if (p1 != 0.0 or n1 != 0.0): print word, '-> n ', (n1 / len(ll)), ", p ", (p1 / len(ll)) p = p + p1 / len(ll) n = n + n1 / len(ll) print "n", n, ", p", p
3. 待解決的問題
1) 結巴分詞與wordnet chinese中的詞不能一一對應
結巴分詞雖然可以匯入自定義的詞典,但仍有些結巴分出的詞,在wordnet找不到對應詞義,比如"太后","童子",還有一些組合詞如"很早已前","黃山"等等.大多是名詞,需要進一步"學習".
臨時的解決方案是:將其當作"專有名詞"處理
2) 一詞多義/一義多詞
無論是情感分析,還是語義分析,中文或者英文,都需要解決詞和義的對應問題.
臨時的解決方案是:找到該詞的所有語義,取其平均的情感值.另外,結巴也可判斷出詞性作為進一步參考.
3) 語義問題
語義問題是最根本的問題,一方面需要分析句子的結構,另外也和內容也有關,尤其是長文章,經常會使用"先抑後揚""對比分析",這樣就比較難以判斷感情色彩了.