python 自然語言處理 詞性標註
一、詞性標註簡介
import nltk
text1=nltk.word_tokenize("It is a pleasant day today")
print(nltk.pos_tag(text1))
Number |
Tag |
Description |
1. | CC | Coordinating conjunction |
2. | CD | Cardinal number |
3. | DT | Determiner |
4. | EX | Existential there |
5. | FW | Foreign word |
6. | IN | Preposition or subordinating conjunction |
7. | JJ | Adjective |
8. | JJR | Adjective, comparative |
9. | JJS | Adjective, superlative |
10. | LS | List item marker |
11. | MD | Modal |
12. | NN | Noun, singular or mass |
13. | NNS | Noun, plural |
14. | NNP | Proper noun, singular |
15. | NNPS | Proper noun, plural |
16. | PDT | Predeterminer |
17. | POS | Possessive ending |
18. | PRP | Personal pronoun |
19. | PRP$ | Possessive pronoun |
20. | RB | Adverb |
21. | RBR | Adverb, comparative |
22. | RBS | Adverb, superlative |
23. | RP | Particle |
24. | SYM | Symbol |
25. | TO | to |
26. | UH | Interjection |
27. | VB | Verb, base form |
28. | VBD | Verb, past tense |
29. | VBG | Verb, gerund or present participle |
30. | VBN | Verb, past participle |
31. | VBP | Verb, non-3rd person singular present |
32. | VBZ | Verb, 3rd person singular present |
33. | WDT | Wh-determiner |
34. | WP | Wh-pronoun |
35. | WP$ | Possessive wh-pronoun |
36. | WRB | Wh-adverb |
構建(識別符號,標記)組成的元組
import nltk
taggedword=nltk.tag.str2tuple('bear/NN')
print(taggedword)
print(taggedword[0])
print(taggedword[1])import nltk
sentence='''The/DT sacred/VBN Ganga/NNP flows/VBZ in/IN this/DT region/NN ./. This/DT is/VBZ a/DT pilgrimage/NN ./. People/NNP from/IN all/DT over/IN the/DT country/NN visit/NN this/DT place/NN ./. '''
print([nltk.tag.str2tuple(t) for t in sentence.split()])將元組返回成字串
import nltk
taggedtok = ('bear', 'NN')
from nltk.tag.util import tuple2str
print(tuple2str(taggedtok))統計標記出現的頻率
import nltk
from nltk.corpus import treebank
treebank_tagged = treebank.tagged_words(tagset='universal')
tag = nltk.FreqDist(tag for (word, tag) in treebank_tagged)
print(tag.most_common())設定預設標記和去除標記
import nltk
from nltk.tag import DefaultTagger
tag = DefaultTagger('NN')
print(tag.tag(['Beautiful', 'morning']))import nltk
from nltk.tag import untag
print(untag([('beautiful', 'NN'), ('morning', 'NN')]))
用NLTK庫實現標註任務的方式主要有兩種
1、使用NLTK庫或者其他庫中的預置標註器,並將其運用到測試資料上(足夠英文和不特殊的任務)
2、基於測試資料來建立或訓練適用的標註器,這意味著要處理一個非常特殊的用例
一個典型的標準器需要大量的訓練資料,他主要被用於標註出句子的各個單詞,人們已經花了大量力氣去標註一些內容,如果需要訓練處自己的POS標準器,應該也算的上高手了....我們下面瞭解一些標註器的效能
- 順序標註器
- 讓我們的標註器的tag都是 ‘NN’ 這樣一個標記.....
import nltk
from nltk.corpus import brownbrown_tagged_sents=brown.tagged_sents(categories='news')
default_tagger=nltk.DefaultTagger('NN')
print( default_tagger.evaluate(brown_tagged_sents)) #0.13 效率低下說明這樣的標註器是個shi...
- 使用我們前幾章說的N-grams的標註器
我們使用N-gram前面90%作為訓練集,訓練他的規則,然後拿剩下10%作為測試集,看這樣的標註器效果如何
import nltk
from nltk.corpus import brown
from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger
brown_tagged_sents=brown.tagged_sents(categories='news')
default_tagger=nltk.DefaultTagger('NN')
train_data=brown_tagged_sents[:int(len(brown_tagged_sents)*0.9)]
test_data=brown_tagged_sents[int(len(brown_tagged_sents)*0.9):]
unigram_tagger=UnigramTagger(train_data,backoff=default_tagger)
print( unigram_tagger.evaluate(test_data) )bigram_tagger=BigramTagger(train_data,backoff=unigram_tagger)
print( bigram_tagger.evaluate(test_data) )trigram_tagger=TrigramTagger(train_data,backoff=bigram_tagger)
print( trigram_tagger.evaluate(test_data) )為了更加清楚訓練和測試的過程,下面給出下面的程式碼
import nltk
from nltk.corpus import treebank
from nltk.tag import UnigramTagger
unitag = UnigramTagger(model={'Vinken': 'NN'}) # 只標記一個tag,讓這樣一個數據進行訓練
print(unitag.tag(treebank.sents()[0]))import nltk
from nltk.corpus import treebank
from nltk.tag import UnigramTagger
training= treebank.tagged_sents()[:7000]
unitagger=UnigramTagger(training) #使用資料集訓練
testing = treebank.tagged_sents()[2000:]
print(unitagger.evaluate(testing))談談回退機制backoff的作用
這是順序標記的一個主要特徵吧,如果這有限的訓練集中,你無法得到這次資料的Tag,可以使用下一個標註器來標註這個單詞;
當然還有很多標準器,可以讀一下原始碼......
import nltk
from nltk.tag import AffixTagger
from nltk.corpus import treebank
testing = treebank.tagged_sents()[2000:]
training= treebank.tagged_sents()[:7000]
prefixtag = AffixTagger(training, affix_length=4) #使用四個字首...
print(prefixtag.evaluate(testing))import nltk
from nltk.tag import AffixTagger
from nltk.corpus import treebank
testing = treebank.tagged_sents()[2000:]
training= treebank.tagged_sents()[:7000]
suffixtag = AffixTagger(training, affix_length=-3) #使用三個字尾...
print(suffixtag.evaluate(testing))
基於機器學習的訓練模型...後面再繼續學習