自然語言處理作業A1
- 任務1:把HTML格式轉為JSON資料,再用python的JSON包,把JSON資料轉為python能使用的資料結構(dicts, lists…)(chaos2json.py)
Your implementation should have at least one regular expression (to extract the textual content of each line), and use NLTK’s word_tokenize function as the tokenizer. You may also use built-in string methods/operations and write your own helper functions.
The word_tokenize function does not separate hyphens, but this text uses hyphens in place of dashes, so your code should separate them.
Hint 1: The HTML contains (nonstandard) tags like at the beginning of each line. The number is the line within the stanza (between 1 and 4). Ignore ellipsis lines indicating removed stanzas.
Hint 2: When converting to JSON, use the indent argument to make it more human-readable.
(This script should not take extremely long to implement, but it will probably take you longer than you expect.)
from urllib import request
from bs4 import BeautifulSoup
from nltk import word_tokenize
import re
import json
url = 'file:///E:/學習文件/資料集/a1/chaos.html'
# 開啟URL,返回HTML資訊
def open_url(url):
# 根據當前URL建立請求包
req = request.Request(url)
# 新增頭資訊,偽裝成瀏覽器訪問
req.add_header('User-Agent',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36' )
# 發起請求
response = request.urlopen(req)
# 返回請求到的HTML資訊
return response.read()
# 用正則定位
def find_tag(url, regex = '<xxx.>(.*?)(<br>|</p>)'):
# (.*?)
# .是除了\n的任意字元
# *是取之前字元的0個或者n個
# ?是去之前字元的0個或者1個;也可以解釋為非貪婪模式
# ()圓括號,舉例說明,eg: a(b)c,在這個例子中,用abcac來進行匹配的話,可以得到ac,abc兩個結果,意思是小括號中的內容在能匹配
# 的情況下是需要匹配的,匹配不到內容也可以跳過。
# 0個或者任意個不是\n的任意字元
html = open_url(url).decode('utf-8')
# hyphens Filter
# 把 Recipe, pipe, studding-sail, choir 變成 Recipe, pipe, studding sail, choir;
html= re.compile('-').sub(' ', html)
result = re.findall(regex, html)
return result
# 處理rhymeWord,有的結尾是一個標點,則不是rhymeWord,要跳過
def find_rhymeWord(tokens):
length = len(tokens)
for i in range(length):
if tokens[length-1-i] in '.,[email protected]#$%^&*()\"\" ;\'\'':
pass
else:
return tokens[length-1-i]
tag = find_tag(url)
setDict = []
numStanza = 1
count = 1 # 計算句子編號的,其實應該用xxx.這部分,但是我懶
switch = 0;
for i in range(len(tag)):
# "stanza" = i;段首
if tag[i][0][0:3] == '<p>':
if switch == 1:
setDict.append(dictionary)
numStanza += 1
switch = 0
count = 1
dictionary = dict()
dictionary['stanza'] = numStanza
# 處理text時要去掉<tt>(.*?)</tt>,裡面都是一些html的轉義符號,用re.sub去掉
text = re.sub('\\xa0','',BeautifulSoup(tag[i][0], "lxml").get_text())
tokens = word_tokenize(text)
dictionary["lines"] = [{"lineId":'{}-{}'.format(numStanza, count), "lineNum": count, "text" : text,
"tokens": tokens, "rhymeWord" : find_rhymeWord(tokens)}]
pass
else:
switch = 1
count += 1
text = re.sub('\\xa0', '', BeautifulSoup(tag[i][0], "lxml").get_text())
tokens = word_tokenize(text)
dictionary["lines"].append({"lineId": '{}-{}'.format(numStanza, count), "lineNum": count, "text": text,
"tokens": tokens, "rhymeWord": find_rhymeWord(tokens)})
js = json.dumps(setDict, indent=4)
print(js)
- 任務2:查詢cmudict 中的每個rhyming word,並把他們可能的發音新增到JSON資料中(allpron.py)
How many rhyming words are NOT found in cmudict (they are “out-of-vocabulary”, or “OOV”)? In your code, leave a comment indicating how many and give a few examples.
import cmudict
import json
# 發音表(元組+列表格式),和用於引索的列表格式資料
# index = words.index('apple')
# print(pron[index])
# > ('apple', ['AE1', 'P', 'AH0', 'L'])
# pron[index][1]就是我們需要的
pron = cmudict.entries()
words = cmudict.words()
# js 為上個實驗的輸出
setDict = json.loads(js)
list_OOV = []
for i in setDict:
for j in i['lines']:
# 可能cmudict沒有收入
try:
j['rhymeProns'] = pron[words.index(j['rhymeWord'].lower())][1]
except:
j['rhymeProns'] = 0
list_OOV.append(j['rhymeWord'])
pass
print(list_OOV)
[‘Terpsichore’,
‘reviles’,
‘endeavoured’,
‘tortious’,
‘clangour’,
‘hygienic’,
‘inveigle’,
‘mezzotint’,
‘Cholmondeley’,
‘obsequies’,
‘dumbly’,
‘vapour’,
‘fivers’,
‘gunwale’]
- 任務3:用一個啟發式的方法判斷是否兩個發音押韻與否,近似的押韻也不算(exact_rhymes.py)
How many pairs of lines that are supposed to rhyme actually have rhyming pronunciations according to your heuristic? For how many lines does having the rhyming line help you disambiguate between multiple possible pronunciations? What are some reasons that your heuristic is imperfect?
這題不大想做了,可能的思路是將每句詩押韻詞最後的發音進行比對,但是是最後幾個詞呢?可以做一個規則,比如說從後往前數都一樣,遇到不一樣時候看是不是非母音(最後一個非母音也可押韻,比如s,z進行押韻…這當作另一個規則)