(五)N-gram語言模型的資料處理

阿新 • • 發佈：2018-12-09

一、步驟

資料集說明：一段英文
（1）分詞：把原始的英文分詞，只保留詞之間的順序不變，多個句子也是看出整體進行分詞。
（2）統計詞頻：按照n元進行詞頻統計，比如“I love NLP I enjoy it”當n=2時候，可以劃分為（【I love】，【love NLP】，【NLP I】…），分別統計【I love】，【love NLP】等出現的次數。（在樸素貝葉斯中只是統計一個詞，這裡是統計n個前後關聯的詞）
（3）對統計好的詞進行大到小的排序，取m和詞作為特徵向量
其他步驟同文字分類步驟

二、程式碼

# -*- coding:utf-8 -*-
import urllib2
import 
 re
import string
import operator


def cleanText(input):
    input = re.sub('\n+', " ", input).lower()  # 匹配換行,用空格替換換行符
    input = re.sub('\[[0-9]*\]', "", input)  # 剔除類似[1]這樣的引用標記
    input = re.sub(' +', " ", input)  # 把連續多個空格替換成一個空格
    input = bytes(input)  # .encode('utf-8') # 把內容轉換成utf-8格式以消除轉義字元 

    # input = input.decode("ascii", "ignore")
    return input


def cleanInput(input):
    input = cleanText(input)
    cleanInput = []
    input = input.split(' ')  # 以空格為分隔符，返回列表

    for item in input:
        item = item.strip(string.punctuation)  # string.punctuation獲取所有標點符號

        if len(item) > 1 
 or (item.lower() == 'a' or item.lower() == 'i'):
            cleanInput.append(item)
    return cleanInput


def getNgrams(input, n):
    #把一段英文處理成一個個詞語，保留了分詞後每個詞在原短文中的順序
    input = cleanInput(input)

    output = {}  # 構造字典
    for i in range(len(input) - n + 1):
        ngramTemp = " ".join(input[i:i + n])
        if ngramTemp not in output:  # 詞頻統計
            output[ngramTemp] = 0
        output[ngramTemp] += 1
    return output


# 獲取資料，content為一段英文
content = urllib2.urlopen(urllib2.Request("http://pythonscraping.com/files/inaugurationSpeech.txt")).read()
#2-grams
ngrams = getNgrams(content, 2)
sortedNGrams = sorted(ngrams.items(), key=operator.itemgetter(1), reverse=True)  # =True 降序排列
print(sortedNGrams)

(五)N-gram語言模型的資料處理

一、步驟

二、程式碼

(五)N-gram語言模型的資料處理

通俗理解N-gram語言模型。（轉）

n-gram語言模型及平滑演算法

(四)N-gram語言模型與馬爾科夫假設

N-Gram語言模型

N-gram語言模型與馬爾科夫假設

N-gram語言模型 & Perplexity & 平滑

R語言之資料處理難題的一套解決方案

N元語言模型

python 自然語言處理統計語言建模 - （n-gram模型）

自然語言處理中的N-Gram模型詳解

2017MySQL中文索引解決辦法自然語言處理(N-gram parser)

對語言模型N-gram的理解

中文資訊處理 N-gram模型

N-gram統計語言模型(總結)

MIT自然語言處理第三講：概率語言模型（第四、五、六部分）

N-Gram模型

自然語言處理中的語言模型預訓練方法

R語言-預測海藻數量1(資料準備和缺失資料處理)

R語言時間序列處理介紹--以A股財報資料處理為案例

(五)N-gram語言模型的資料處理

一、步驟

二、程式碼

相關推薦