英文句子相似性判斷

阿新 • • 發佈：2018-03-28

com pen 使用分享圖片 regex bubuko data strong message

1.要求

　　本次項目提供一系列的英文句子對，每個句子對的兩個句子，在語義上具有一定的相似性；每個句子對，獲得一個在0-5之間的分值來衡量兩個句子的語義相似性，打分越高說明兩者的語義越相近。

如：

技術分享圖片

2.基本實現過程

2.1 數據處理：

（1）分詞：

技術分享圖片

（2）去停用詞：停用詞是一些完全沒有用或者沒有意義的詞，例如助詞、語氣詞等。stopword就是類似 a/an/and/are/then 的這類高頻詞，高頻詞會對基於詞頻的算分公式產生極大的幹擾，所以需要過濾

技術分享圖片

（3）詞行還原：詞幹提取( Stemming ) 這是西方語言特有的處理，比如說英文單詞有單數復數的變形，-ing和-ed的變形，但是在計算相關性的時候，應該當做同一個單詞。比如 apple和apples，doing和done是同一個詞，提取詞幹的目的就是要合並這些變態

技術分享圖片

（4）詞幹化：

技術分享圖片

其中上述過程的代碼如下：

def data_cleaning(data):
    data["s1"] = data["s1"].str.lower()
    data["s2"] = data["s2"].str.lower()

    # 分詞
    tokenizer = RegexpTokenizer(r‘[a-zA-Z]+‘)
    data["s1_token"] = data["s1"].apply(tokenizer.tokenize)
    data["s2_token"] = data["s2"].apply(tokenizer.tokenize)
    
     
# 去停用詞
    stop_words = stopwords.words(‘english‘)
    def word_clean_stopword(word_list):
        words = [word for word in word_list if word not in stop_words]
        return words
    
    data["s1_token"] = data["s1_token"].apply(word_clean_stopword)
    data["s2_token"] = data["s2_token"].apply(word_clean_stopword)
    
     
# 詞形還原
    lemmatizer=WordNetLemmatizer()
    def word_reduction(word_list):
        words = [lemmatizer.lemmatize(word) for word in word_list]
        return words
    data["s1_token"] = data["s1_token"].apply(word_reduction)
    data["s2_token"] = data["s2_token"].apply(word_reduction)
    
    # 詞幹化
    stemmer = nltk.stem.SnowballStemmer(‘english‘)
    def word_stemming(word_list):
        words = [stemmer.stem(word) for word in word_list]
        return words
    data["s1_token"] = data["s1_token"].apply(word_stemming)
    data["s2_token"] = data["s2_token"].apply(word_stemming)
    
    return data

2.2 傳統方法的使用：

（1）bag of words：其中具體的描述可以在這裏看到：bag of words詳解

# bag of words
from sklearn.feature_extraction.text import CountVectorizer
def count_vector(words):
    count_vectorizer = CountVectorizer()
    emb = count_vectorizer.fit_transform(words)
    return emb, count_vectorizer

bow_data = data
bow_data["words_bow"] = bow_data["s1"] + bow_data["s2"]
bow_test = bow_data[bow_data.score.isnull()]
bow_train = bow_data[~bow_data.score.isnull()]

list_test = bow_test["words_bow"].tolist()
list_train = bow_train["words_bow"].tolist()
list_labels = bow_train["score"].tolist()

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(list_train, list_labels, test_size=0.2, random_state=42)
X_train_counts, count_vectorizer = count_vector(X_train)
X_test_counts = count_vectorizer.transform(X_test)
test_counts = count_vectorizer.transform(list_test)
# print(X_train_counts.shape, X_test_counts.shape, test_counts.shape)

（2） TF-IDF：其中具體的描述可以在這裏看到：TF-IDF詳解

# TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
import scipy as sc
def tfidf(data):
    tfidf_vectorizer = TfidfVectorizer()
    train = tfidf_vectorizer.fit_transform(data)

    return train, tfidf_vectorizer
tf_data = data
tf_data["words_tf"] = tf_data["s1"] + tf_data["s2"]
tf_test = tf_data[tf_data.score.isnull()]
tf_train = tf_data[~tf_data.score.isnull()]
list_tf_test = tf_test["words_tf"].tolist()
list_tf_train = tf_train["words_tf"].tolist()
list_tf_labels = tf_train["score"].tolist()

X_train, X_test, y_train, y_test = train_test_split(list_tf_train, list_tf_labels, test_size=0.2, random_state=42)
X_train_tfidf, tfidf_vectorizer = tfidf(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
test_tfidf = tfidf_vectorizer.transform(list_test)

然後通過一些基本的回歸算法，進行訓練和預測即可；

3 三種基於w2v的基本方案

3.1 使用Word2Vec模型的訓練：

　　通過給定的語料庫，來訓練一個詞向量的模型，用於後期對句子進行詞向量的表示：並且使用余弦相似度對句子相似度進行打分，不同於前面的是，通過word2vec方法所進行的是無監督學習，因此對於元數據中給的score並沒有使用；

（1）這裏首先給出所使用的語料：

path_data = "text_small"
path_train_lab = "train_ai-lab.txt"
path_test_lab = "test_ai-lab.txt"
path_other_lab = "sicktest"


def get_sentences():
    """
    獲取文件中句子作為語料庫使用
    :return:
    """
    sentences = []
    with open(path_train_lab) as file:
        for line in file:
            item = line.split(‘\t‘)
            sentences.append(prep_sentence(item[1]))
            sentences.append(prep_sentence(item[2]))

    with open(path_test_lab) as file:
        for line in file:
            item = line.split(‘\t‘)
            sentences.append(prep_sentence(item[1]))
            sentences.append(prep_sentence(item[2]))

    # # 添加額外語料
    # with open(path_other_lab) as file:
    #     for line in file:
    #         item = line.split(‘\t‘)
    #         sentences.append(prep_sentence(item[0]))
    #         sentences.append(prep_sentence(item[1]))

    # sentences += word2vec.Text8Corpus(path_data)
    return sentences

View Code

（2）訓練模型：

def train_w2v_model(sentences):
    """
    訓練w2v模型
    :param sentences: 語料
    :return:
    """
    # logging.basicConfig(format=‘%(asctime)s : %(levelname)s : %(message)s‘, level=logging.INFO)
    model = Word2Vec(sentences, size=200, min_count=1, iter=2000, window=10)
    model.save("w2v.mod")

3.2 三種基於w2v的基本方案

（1）基於余弦距離直接計算句子相似度：
　　通過直接對句子中所有詞向量加和求均值，作為句向量，直接計算兩個句向量的余弦距離，作為最終的結果，發現在不做額外處理的情況下基本可以到達0.7左右的分值。

def get_w2v_result():
    """
    通過w2v後根據余弦相似度預測句子相似度打分
    :return:  輸出 測試機的id  訓練集的預測結果， 對測試集的預測結果， 訓練集的標簽
    """
    model_loaded = Word2Vec.load(‘w2v.mod‘)
    x_train = []
    y_train = []
    x_test = []
    with open(path_train_lab) as file:
        for line in file:
            item = line.split(‘\t‘)
            score = model_loaded.n_similarity(prep_sentence(item[1]), prep_sentence(item[2]))
            x_train.append(float(round(score * 5, 2)))
            y_train.append(float(item[3]))
    idx = []
    with open(path_test_lab) as file:
        for line in file:
            item = line.split(‘\t‘)
            score = model_loaded.n_similarity(prep_sentence(item[1]), prep_sentence(item[2]))
            x_test.append(float(round(score * 5, 2)))
            idx.append(item[0])

    # 計算預測結果
    r, p = pearsonr(x_train, y_train)
    print(‘Result   w2v:r‘, r)
    return idx, x_train, x_test, y_train

（2）基於嶺回歸的方案：
　　考慮題目中所給出的訓練數據是有標簽的，而上述所使用的方法是完全非監督的，因此我們這裏將使用余弦相似度的方法換成通過對句向量及其標簽進行嶺回歸，同時觀察結果，發現此類方法較上述方法稍有提升。

def get_LR_result(dataCounter):
    """
    通過詞向量處理後使用嶺回歸進行監督預測
    :return: 輸出 測試機的id  訓練集的預測結果， 對測試集的預測結果， 訓練集的標簽
    """
    model_loaded = Word2Vec.load(‘w2v.mod‘)
    x_train = []
    y_train = []
    # 使用訓練數據訓練模型
    with open(path_train_lab) as file:
        for line in file:
            item = line.split(‘\t‘)
            matr1 = get_weighted_sent(dataCounter, prep_sentence(item[1]), model_loaded)
            matr2 = get_weighted_sent(dataCounter, prep_sentence(item[2]), model_loaded)

            data1 = matutils.unitvec(np.array(matr1).mean(axis=0))
            data2 = matutils.unitvec(np.array(matr2).mean(axis=0))
            data = [abs(num) for num in data1 - data2]  # 計算詞向量均值差作為特征
            x_train.append(data)
            y_train.append(float(item[3]))
    idx = []
    x_test = []
    with open(path_test_lab) as file:
        for line in file:
            item = line.split(‘\t‘)
            matr1 = get_weighted_sent(dataCounter, prep_sentence(item[1]), model_loaded)
            matr2 = get_weighted_sent(dataCounter, prep_sentence(item[2]), model_loaded)
            data1 = matutils.unitvec(np.array(matr1).mean(axis=0))
            data2 = matutils.unitvec(np.array(matr2).mean(axis=0))
            data = [abs(num) for num in data1 - data2]  # 計算詞向量均值差作為特征
            x_test.append(data)
            idx.append(item[0])
    x_train, x_test = LR(x_train, y_train, x_test)
    return idx, x_train, x_test, y_train


def LR(x_train, y_train, x_test):
    """
    使用嶺回歸訓練模型進行監督學習
    :param x_train: 訓練集 訓練數據
    :param y_train: 訓練集 打分
    :param x_test:  測試集 測試數據
    :return:
    """
    clf = linear_model.Ridge()
    alpha_can = np.logspace(-3, 2, 10)
    clf = GridSearchCV(clf, param_grid={‘alpha‘: alpha_can}, cv=5)  # 交叉驗證
    clf.fit(x_train, y_train)
    pred_x_train = clf.predict(x_train)
    pred_x_test = clf.predict(x_test)

    r, p = pearsonr(pred_x_train, y_train)  # 直接結果輸出
    print(‘Result   LR:‘, r)
    return pred_x_train, pred_x_test

View Code

（3）基於加權詞向量的方案：
　　最後通過閱讀論文，我們發現ICLR2017 A simple but tough-to-beat baseline for sentence embedding中恰好有考慮了對詞向量進行加權的研究，其重要思路為通過詞頻對詞向量重要程度進行衡量，並在句向量形成過程中進行加權，再通過除去數據中PCA的第一主成分，達到非常好的效果，因此我們隊該方案進行了復現，並且用於對本次項目的預測，但是奇怪的是我們的實驗結果並且有達到想象中那麽好，僅比方案（2）高了一點點，甚至不如上述兩種結果。

def get_PCA_result(dataCounter):
    """
    ICLR2017論文算法
    根據詞頻特征加權詞向量，並使用PCA方法抽取第一主成分，從句向量中減去第一主成分進行余弦相似度計算
    :return: 輸出 測試機的id  訓練集的預測結果， 對測試集的預測結果， 訓練集的標簽
    """
    model_loaded = Word2Vec.load(‘w2v.mod‘)
    y_train = []
    # 使用訓練數據訓練模型
    all_train = []
    with open(path_train_lab) as file:
        for line in file:
            item = line.split(‘\t‘)

            matr1 = get_weighted_sent(dataCounter, prep_sentence(item[1]), model_loaded)
            matr2 = get_weighted_sent(dataCounter, prep_sentence(item[2]), model_loaded)

            data1 = matutils.unitvec(np.array(matr1).mean(axis=0))
            data2 = matutils.unitvec(np.array(matr2).mean(axis=0))
            all_train.append(data1)
            all_train.append(data2)
            y_train.append(float(item[3]))
    idx = []
    all_test = []
    with open(path_test_lab) as file:
        for line in file:
            item = line.split(‘\t‘)

            matr1 = get_weighted_sent(dataCounter, prep_sentence(item[1]), model_loaded)
            matr2 = get_weighted_sent(dataCounter, prep_sentence(item[2]), model_loaded)

            data1 = matutils.unitvec(np.array(matr1).mean(axis=0))
            data2 = matutils.unitvec(np.array(matr2).mean(axis=0))
            all_test.append(data1)
            all_test.append(data2)
            idx.append(item[0])

    all_train = PCA_prep(all_train)
    all_test = PCA_prep(all_test)
    x_train = []
    # 計算相似度
    for i in range(len(all_train)):
        if i % 2 == 0:
            sim = cosine_similarity([all_train[i]], [all_train[i + 1]])
            x_train.append(sim[0][0] * 5)

    x_test = []
    for i in range(len(all_test)):
        if i % 2 == 0:
            sim = cosine_similarity([all_test[i]], [all_test[i + 1]])
            x_test.append(sim[0][0] * 5)

    r, p = pearsonr(x_train, y_train)  # 直接結果輸出
    print(‘Result   PCA :‘, r)

    return idx, x_train, x_test, y_train

def get_freq(sentences):
    """
    獲得語料庫詞頻統計字典
    :param sentences:
    :return:
    """
    dataCounter = []
    for sent in sentences:
        dataCounter += sent
    word_len = len(dataCounter)
    dataCounter = collections.Counter(dataCounter)
    for key in dataCounter:
        dataCounter[key] = dataCounter[key] / word_len
    # print(dataCounter)
    # print(1e-3)
    return dataCounter


def get_weighted_sent(dataCounter, word_list, model_loaded):
    """
    基於詞權重加權處理句向量
    :param dataCounter: 詞頻率字典
    :param word_list: 句子詞列表
    :param model_loaded: 詞向量模型
    :return:
    """
    a = 1e-3 / 4
    v_word_list = []
    for word in word_list:
        count = dataCounter[word]
        # print(model_loaded.wv[word])
        # print(a / (a + count) * model_loaded.wv[word])
        v_word_list.append(a / (a + count) * model_loaded.wv[word])
    return v_word_list

View Code

3.3 Stacking方式對結果的集成

def stacking_result(w2v_x_train, LR_x_train,PCA_x_train, w2v_x_test, LR_x_test,PCA_x_test, label):
    """
    stacking 方法對結果的集成
    :param w2v_x_train:
    :param LR_x_train:
    :param w2v_x_test:
    :param LR_x_test:
    :param label:
    :return:
    """
    x_train = [[w2v_score, LR_score, PCA_score] for w2v_score, LR_score, PCA_score in zip(w2v_x_train, LR_x_train, PCA_x_train)]
    x_test = [[w2v_score, LR_score, PCA_score] for w2v_score, LR_score, PCA_score in zip(w2v_x_test, LR_x_test, PCA_x_test)]
    model = LinearRegression()
    model.fit(x_train, label)
    predicted_train = model.predict(x_train)
    predicted_test = model.predict(x_test)

    r, p = pearsonr(predicted_train, label)  # 直接結果輸出
    print(‘Result   stacking:raw‘, r)
    return predicted_test

View Code

結果：

Result   w2v: 0.770842157582
Result   LR: 0.761403048811
Result   PCA : 0.728098131446
Result   stacking: 0.820756499196
end...

英文句子相似性判斷

com pen 使用分享圖片 regex bubuko data strong message 1.要求　　本次項目提供一系列的英文句子對，每個句子對的兩個句子，在語義上具有一定的相似性；每個句子對，獲得一個在0-5之間的分值來衡量兩個句子的語義相似性，打分越高說明兩者

英文句子相似性判斷

1.要求

2.基本實現過程

2.1 數據處理：

2.2 傳統方法的使用：

3 三種基於w2v的基本方案

3.1 使用Word2Vec模型的訓練：

3.2 三種基於w2v的基本方案

3.3 Stacking方式對結果的集成

英文句子相似性判斷

【python 字母索引】找到英文句子裡面每個單詞最後一個字母的索引

英文token預處理，用於將英文句子處理成單詞

【python 字母索引】找到英文句子裡面每個單詞最後一個的索引

gensim 計算句子相似性

C語言作業3-陣列-2英文句子逆向輸出

翻轉英文句子中單詞的順序

首字母變大寫 --輸入一個英文句子，將每個單詞的第一個字母改成大寫字母。

c/c++英文句子中單詞逆置(遞迴和非遞迴實現)

Java 英文句子去掉多餘的空格

將一個英文句子的單詞倒序輸出到另一個檔案，單詞內容不倒序

輸入一個英文句子，翻轉句子中的單詞，要求單詞內的字元順序不變。如：I am a student. 轉換成 student. a am I

Java對英文句子進行倒序排序

程序相似性判斷

將一句英文句子中多餘的空格去掉

C++實現一句英文句子中的單詞逆置

輸入一個英文句子，翻轉句子中單詞的順序，但單詞內字元的順序不變。（筆試題）句子中單詞以空格符隔開。為簡單起見，沒有標點符號。例如輸入“I am a student”，則輸出“student a

用Java翻轉一個英文句子

每天學習一點程式設計（2）（輸入一個英文句子，翻轉句子中單詞的順序，但單詞內字元的順序不變）

java逆序英文句子中的單詞順序

英文句子相似性判斷

1.要求

2.基本實現過程

2.1 數據處理：

2.2 傳統方法的使用：

3 三種基於w2v的基本方案

3.1 使用Word2Vec模型的訓練：

3.2 三種基於w2v的基本方案

3.3 Stacking方式對結果的集成

相關推薦