NLP之路-檢視獲取文字語料庫
阿新 • • 發佈:2019-02-20
繼續學習NLP in Python
#coding=UTF-8 #上面一句解決中文註釋編碼錯誤問題 import nltk #檢視獲取到的文字語料庫 nltk.corpus.gutenberg.fileids() #給書名附一個簡短的名字emma emma=nltk.corpus.gutenberg.words('austen-emma.txt') #192427 len(emma) #同樣利用前一章中的concordance from nltk.corpus import gutenberg emma = nltk.Text(gutenberg.words('austen-emma.txt')) #如果不import,語句需要寫全: #emma=nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt')) emma.concordance("surprize") #每個文字的三個統計量:平均詞長、平均句子長度和本文中每個詞出現的平均次數 for fileid in gutenberg.fileids(): num_chars = len(gutenberg.raw(fileid)) num_words = len(gutenberg.words(fileid)) num_sents = len(gutenberg.sents(fileid)) num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)])) print int(num_chars/num_words), int(num_words/num_sents), int(num_words/num_vocab), fileid