用python拆分文章
阿新 • • 發佈:2018-11-11
菜雞的日誌,非常簡單和基礎的內容,不確定會更新多少,目的是自動做英語閱讀匹配題
涉及到的知識:
- python
- pandas和numpy庫
- tf-idf的運用
需要準備
- 若干篇現成的英語閱讀 和 答案(不是機器學習,只是自己測試準確率)
- 個人使用jupyter_notebook
cell 1 18/10/17 對文章拆分
import numpy as np import pandas as pd symbol=[".","'",''] presymbol=[("\'s"," is"),('\'re'," are"),('n\'t'," not"),('\'ve',' have'),("\'m",' |am')##縮寫 ,("\'",""),(","," "),('\"',""),("?","."),("!",".")] text=[] with open("1.txt") as file: text=file.read().strip().lower() for j in presymbol: text=text.replace(j[0],j[1]) paragraphs=text.split("\n") raw_passage=[] sentence=[] for i in paragraphs: if i!="": ##i是一段 temp_paragraph=[] for j in i.split("."): if j!="": sentence=[ word for word in j.split(" ") if (word not in symbol and word!="")] if sentence!=[]: print (sentence) temp_paragraph.append(sentence) raw_passage.append(temp_paragraph)
拆分三個疊加的列表,依次為段落-句子-單詞,需要預處理一下壓縮的單詞和一些符號,非常簡單
cell 2 18/10/18 統計出現的單詞
appeared_word=set()
for rp in raw_passage:
for rs in rp:
for word in rs:
appeared_word.add(word)
非常簡單基礎,就是為了便於直接底部加上appear_word檢視,才作為一個單獨的cell
cell 3 18/10/18 初始化DataFrame並計算單詞個數
word_count=pd.DataFrame(np.zeros((len(raw_passage),len(appeared_word))).astype(int),index=range(len(raw_passage)+1),columns=appeared_word) for i,rp in enumerate(raw_passage): for rs in rp: for word in rs: word_count[word][i+1]+=1 words_data=pd.DataFrame(np.sum(np.array(word_count.iloc[1:]),axis=0).reshape((1,919)),index=[1],columns=appeared_word).append(word_count)
由於個人比較菜,這塊花了好久查閱pandas的函式,可能有更簡便的寫法
為了降低在文字匹配時少受到and這類詞影響,可以使用tf-idf