1. 程式人生 > >用python拆分文章

用python拆分文章

菜雞的日誌,非常簡單和基礎的內容,不確定會更新多少,目的是自動做英語閱讀匹配題

涉及到的知識:
  • python
  • pandas和numpy庫
  • tf-idf的運用
需要準備
  • 若干篇現成的英語閱讀 和 答案(不是機器學習,只是自己測試準確率)
  • 個人使用jupyter_notebook
cell 1 18/10/17 對文章拆分
import numpy as np
import pandas as pd
symbol=[".","'",'']
presymbol=[("\'s"," is"),('\'re'," are"),('n\'t'," not"),('\'ve',' have'),("\'m",' |am')##縮寫
                   ,("\'",""),(","," "),('\"',""),("?","."),("!",".")]

text=[]
with open("1.txt") as file:
    text=file.read().strip().lower()
    for j in presymbol:
        text=text.replace(j[0],j[1])
    paragraphs=text.split("\n")
    
raw_passage=[]
sentence=[]
for i in paragraphs:
    if i!="":   ##i是一段
        temp_paragraph=[]
        for j in i.split("."):
            if j!="":
                sentence=[ word for word in j.split(" ") if (word not in symbol and word!="")]
                if sentence!=[]:
                    print (sentence)
                    temp_paragraph.append(sentence)
        raw_passage.append(temp_paragraph)

拆分三個疊加的列表,依次為段落-句子-單詞,需要預處理一下壓縮的單詞和一些符號,非常簡單

cell 2 18/10/18 統計出現的單詞
appeared_word=set()
for rp in raw_passage:
    for rs in rp:
        for word in rs:
            appeared_word.add(word)

非常簡單基礎,就是為了便於直接底部加上appear_word檢視,才作為一個單獨的cell

cell 3 18/10/18 初始化DataFrame並計算單詞個數
word_count=pd.DataFrame(np.zeros((len(raw_passage),len(appeared_word))).astype(int),index=range(len(raw_passage)+1),columns=appeared_word)
for i,rp in enumerate(raw_passage):
    for rs in rp:
        for word in rs:
            word_count[word][i+1]+=1
words_data=pd.DataFrame(np.sum(np.array(word_count.iloc[1:]),axis=0).reshape((1,919)),index=[1],columns=appeared_word).append(word_count)
由於個人比較菜,這塊花了好久查閱pandas的函式,可能有更簡便的寫法


為了降低在文字匹配時少受到and這類詞影響,可以使用tf-idf