Python詞雲圖繪製—看部落格大佬們的寫作熱點
在《Python視覺化展現》中,我們使用了Matplotlib可視化了一些部落格大佬們的部落格發表年份與數量的關係,接下來我們再看下這些部落格文章的熱點都有哪些。
我們僅對文章的標題進行分詞處理,然後統計分詞結果,並繪製出部落格文章詞雲,我們使用了jieba和thulac進行中文分詞,結果總體差不多,但thulac速度更顯得慢。
重新定義walk_tree
def walk_tree_j(html, num):
for li in html.find_all("li"):
num = num + 1
print("%s %s %s%s" % (num, li.h 3.a.string, CSDN_BLOG_URL, li.h3.a["href"]))
k_list = jieba.cut(li.h3.a.string)
# k_list = thulac.thulac().cut(li.h3.a.string)
for keyword in k_list:
# for thulac
# keyword = str.strip(keyword[0])
# for jieba
keyword = str.strip(keyword)
if len(keyword) < 2 :
pass
elif keyword_dict.get(keyword, 0) == 0:
keyword_dict[keyword] = 1
else:
keyword_dict[keyword] = keyword_dict[keyword] + 1
for d in li.find_all("div"):
if "class" in d.attrs and str.strip(d["class"][0]) == "unit-control" :
print(d.div.find_all("div")[0].string + ",發表時間:" + d.div.find_all("div")[1].string + ",閱讀量:" +
d.div.find_all("div")[2].span.string + ",評論數:" + d.div.find_all("div")[3].span.string)
t_value = d.div.find_all("div")[1].string
year = int(str.strip(t_value)[0:4])
if article_dict.get(year, 0) == 0:
article_dict[year] = 1
else:
article_dict[year] = article_dict[year] + 1
print(keyword_dict)
return num
對文章標題進行分詞處理,注意,為了簡化處理,我們僅去掉單個字元的單詞:
k_list = jieba.cut(li.h3.a.string)
# k_list = thulac.thulac().cut(li.h3.a.string)
for keyword in k_list:
# for thulac
# keyword = str.strip(keyword[0])
# for jieba
keyword = str.strip(keyword)
if len(keyword) < 2:
pass
elif keyword_dict.get(keyword, 0) == 0:
keyword_dict[keyword] = 1
else:
keyword_dict[keyword] = keyword_dict[keyword] + 1
獲取到分詞結果之後,我們使用wordcloud進行詞雲繪製
def generate_dict(dic):
fullTermsDict = multidict.MultiDict()
for key, value in dic.items():
fullTermsDict.add(key, value)
return fullTermsDict
image_path = '1.jpg'
d = path.dirname(__file__)
image = imread(path.join(d, image_path))
wc = WordCloud(background_color="white", max_words=1000, font_path="C:/Windows/Fonts/simkai.ttf", mask=image)
# generate word cloud
fullTermsDict = multidict.MultiDict()
wc.generate_from_frequencies(generate_dict(keyword_dict))
# show
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()