1. 程式人生 > >微軟句向量工具包Sent2vec

微軟句向量工具包Sent2vec

工具介紹:

What is sent2vec

sent2vec maps a pair of short text strings (e.g., sentences or query-answer pairs) to a pair of feature vectors in a continuous, low-dimensional space where the semantic similarity between the text strings is computed as the cosine similarity between their vectors in that space.

sent2vec performs the mapping using the Deep Structured Semantic Model (DSSM) proposed in (Huang et al. 2013), or the DSSM with convolutional-pooling structure (CDSSM) proposed in (Shen et al. 2014; Gao et al. 2014). Please cite the papers if you use sent2vec in published research.

工具包地址:

http://research.microsoft.com/en-us/downloads/731572aa-98e4-4c50-b99d-ae3f0c9562b9/default.aspx

Slides:

http://emnlp2014.org/material/presentation-EMNLP2014002.pdf

Slides中的Deep Semantic Similarity Model(DSSM)


看了上圖,發現這個工具就是卷積神經網路,網路的輸入是一個word harsing(word harsing後句子特徵維度就不變了),然後做卷積和池化(關於什麼是卷積和池化 參考:http://blog.csdn.net/silence1214/article/details/11809947)。

看到slides中word harsing步驟,問題就來了。如下圖:


為了控制輸入控制元件的維度,作者使用了letter-trigram representation。也就是word 變為一堆letter-trigram representation。感覺中文行不通啊,中文分完詞語,粒度大部分都是兩三個字。然後做這個letter-trigram representation,效果會好嗎?

源自:http://weibo.com/1402400261/ChhIgASO1?type=comment#_rnd1431482545348