集體智慧程式設計-第三章-得到詞彙在指定部落格源出現的次數

阿新 • • 發佈：2020-08-28

為聚類演算法準備資料的常見做法是定義一組公共的數值型屬性，可以利用這些屬性對資料項進行比較。
在當前資料集中，被用來聚類的是一系列部落格。
原文中是給了現成的資料集，由於網站訪問不到，這裡我使用任意幾組資料集進行測試。
feedlist.txt

https://blog.csdn.net/liang19890820/rss/list
https://blog.csdn.net/xiaoquantouer/rss/list
https://blog.csdn.net/u011240877/rss/list
https://blog.csdn.net/sunhuaqiangl/rss/list

feedparser包在我上個部落格中給出了安裝方法。

下面程式碼執行結果：

blogdata.txt

好，正文來了，下面的程式碼是執行成功的版本，對書上的跑不通的部分進行了更改。

import feedparser as fe
import re

#對訂閱源中的單詞進行計數

def getwordcounts(url):
    d = fe.parse(url)
    wc = {}
    for e in d.entries:
        if 'summary' in e:
            summary = e.summary
        else:
            summary = e.description

        words  
= getwords(e.title + ' ' + summary)
        for word in words:
            wc.setdefault(word, 0)
            wc[word] += 1
        return d.feed.title, wc

#函式getwordcounts將摘要傳給getwords,後者會將所有的html標記剝離掉，並將非字母字元作為分隔符拆分出來，再將結果以列表的形式加以返回。
def getwords(html):
    # 去除所有HTML標記
    txt = re.compile(r'<[^>]+> 
').sub('', html)
    # 利用所有非字母字元拆分出單詞
    words = re.compile(r'[^A-Z^a-z]').split(txt)
    # 轉化成小寫形式
    return [word.lower() for word in words if word != '']

#主體程式碼
#迴圈遍歷訂閱源並生成資料集程式碼的第一部分遍歷feedlist.txt的檔案中的每一行，然後生成針對每個部落格的單詞統計，以及出現這些單詞的部落格數目（apcount)
apcount = {}
wordcounts = {}
feedList = []
with open("feedlist.txt") as lines:
    for line in lines:
        feedList.append(line)

for feedurl in feedList:
    title, wc = getwordcounts(feedurl)
    wordcounts[title] = wc
    for word, count in wc.items():
        apcount.setdefault(word, 0)
        if count > 1:
            apcount[word] += 1
    #print(apcount[word])
#將10%定為下界，將50%定為上界，選擇介於這個百分比範圍內的單詞

wordlist = []
for  w,bc in apcount.items():
    frac = float(bc)/len(feedList)
    #if frac < 0.5 and frac > 0.1 :
    wordlist.append(w)
    print(apcount.items)
#利用上述單詞列表和部落格列表建立一個文字檔案，其中包含一個大的矩陣，記錄著針對每個部落格的所有單詞的統計情況：
#書上是open直接使用，我改成了with open的形式

with open('blogdata.txt','w') as out:
    out.write('Blog')
    for word in wordlist: out.write('\t%s' % word)
    out.write('\n')
    for blog,wc in wordcounts.items():
        out.write(blog)
        for word in wordlist:
            if word in wc: out.write('\t%d' % wc[word])
            else: out.write('\t0')
    out.write('\n')