使用NLTK計算word的相似度

阿新 • • 發佈：2019-01-10

5 Similarity

>>> dog = wn.synset('dog.n.01')
>>> cat = wn.synset('cat.n.01')

synset1.path_similarity(synset2): Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1, except in those cases where a path cannot be found (will only be true for verbs as there are many distinct verb taxonomies), in which case -1 is returned. A score of 1 represents identity i.e. comparing a sense with itself will return 1.

>>> dog.path_similarity(cat)
0.20000000000000001

synset1.lch_similarity(synset2): Leacock-Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d the taxonomy depth.

>>> dog.lch_similarity(cat)
2.0281482472922856

synset1.wup_similarity(synset2): Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Note that at this time the scores given do _not_ always agree with those given by Pedersen's Perl implementation of Wordnet Similarity.

The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.

>>> dog.wup_similarity(cat)
0.8571428571428571

wordnet_ic Information Content: Load an information content file from the wordnet_ic corpus.

>>> from nltk.corpus import wordnet_ic
>>> brown_ic = wordnet_ic.ic('ic-brown.dat')
>>> semcor_ic = wordnet_ic.ic('ic-semcor.dat')

Or you can create an information content dictionary from a corpus (or anything that has a words() method).

>>> from nltk.corpus import genesis
>>> genesis_ic = wn.ic(genesis, False, 0.0)

synset1.res_similarity(synset2, ic): Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node). Note that for any similarity measure that uses information content, the result is dependent on the corpus used to generate the information content and the specifics of how the information content was created.

>>> dog.res_similarity(cat, brown_ic)
7.9116665090365768
>>> dog.res_similarity(cat, genesis_ic)
7.1388833044805002

synset1.jcn_similarity(synset2, ic): Jiang-Conrath Similarity Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).

>>> dog.jcn_similarity(cat, brown_ic)
0.44977552855167391
>>> dog.jcn_similarity(cat, genesis_ic)
0.28539390848096979

synset1.lin_similarity(synset2, ic): Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).

>>> dog.lin_similarity(cat, semcor_ic)
0.88632886280862277

原文地址：http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html

使用NLTK計算word的相似度

5 Similarity

使用NLTK計算word的相似度

用gensim doc2vec計算文字相似度，Python可以跑通的程式碼

計算字串相似度的一些方法

DSSM演算法-計算文字相似度

計算文字相似度方法大全-簡單說

應用實戰: 如何利用Spark叢集計算物品相似度

LeetCode之計算字串相似度或編輯距離EditDistance

用gensim doc2vec計算文字相似度

【python + word2vec】計算語義相似度

Doc2Vec計算句子相似度

java實現編輯距離演算法，計算字串相似度

lucene計算文字相似度演算法

計算字串相似度

Tensorflow練習2-Word2vec模型計算詞語相似度

simhash計算文字相似度

計算圖片相似度的多種解決方案

計算文字相似度-java實現

用NLTK對英文語料做預處理，用gensim計算相似度

計算兩張圖片相似度的方法總結

<tf-idf + 余弦相似度> 計算文章的相似度

使用NLTK計算word的相似度

5 Similarity

相關推薦