關於爬取json內容生成詞雲(瘋狂踩坑)
阿新 • • 發佈:2018-04-30
.sh 動態 cnblogs google 插件 save result json數據 keys
本文爬取了掘金上關於前端前n頁的標題。將文章的標題進行分析,可以看出人們對前端關註的點或者近來的熱點。
- 導入庫
import requests import re from bs4 import BeautifulSoup import json import urllib import jieba from wordcloud import WordCloud import matplotlib.pyplot as plt import numpy as np import xlwt import jieba.analyse from PIL import Image,ImageSequence
- 爬取json
#動態網頁json爬取 response=urllib.request.urlopen(ajaxUrl) ajaxres=response.read().decode(‘utf-8‘) json_str = json.dumps(ajaxres) #編碼 strdata = json.loads(json_str) # 解碼 data=eval(strdata)
- 循環輸出title內容,並寫入文件
for i in range(0,25): ajaxUrl = ajaxUrlBegin + str(i) + ajaxUrlLast; for
- 生成詞雲
#詞頻統計 f = open(‘finally.txt‘, ‘r‘, encoding=‘utf-8‘) str = f.read() stringList = list(jieba.cut(str)) symbol
- 一個項目n個坑,一個坑踩一萬年
- 獲取動態網頁的具體內容
爬取動態網頁時標題並不能在html裏直接找到,需要通過開發者工具裏的Network去尋找。尋找到的是ajax發出的json數據。
- 獲取json裏面的具體某個數據
我們獲取到json數據之後(通過url獲取)發現它。。
(wtf,啥玩意啊這是???)
這時我們可以用一個Google插件JSONview,用了之後發現他說人話了終於!
- 接下來就是wordCloud的安裝
這個我就不說了(說了之後只是網上那批沒用的答案+1.)。想知道怎麽解決的出門右轉隔壁的隔壁的隔壁老黃的博客。(芬達牛比)
- 總體代碼
import requests import re from bs4 import BeautifulSoup import json import urllib import jieba from wordcloud import WordCloud import matplotlib.pyplot as plt import numpy as np import xlwt import jieba.analyse from PIL import Image,ImageSequence url=‘https://juejin.im/search?query=前端‘ res = requests.get(url) res.encoding = "utf-8" soup = BeautifulSoup(res.text,"html.parser") #遍歷n次 ajaxUrlBegin=‘https://search-merger-ms.juejin.im/v1/search?query=%E5%89%8D%E7%AB%AF&page=‘ ajaxUrlLast=‘&raw_result=false&src=web‘ for i in range(0,25): ajaxUrl=ajaxUrlBegin+str(i)+ajaxUrlLast; #動態網頁json爬取 response=urllib.request.urlopen(ajaxUrl) ajaxres=response.read().decode(‘utf-8‘) json_str = json.dumps(ajaxres) #編碼 strdata = json.loads(json_str) # 解碼 data=eval(strdata) #str轉換為dict for i in range(0,25): ajaxUrl = ajaxUrlBegin + str(i) + ajaxUrlLast; for i in range(0,19): result=[] result=data[‘d‘][i][‘title‘] print(result+‘\n‘) f = open(‘finally.txt‘, ‘a‘, encoding=‘utf-8‘) f.write(result) f.close() #詞頻統計 f = open(‘finally.txt‘, ‘r‘, encoding=‘utf-8‘) str = f.read() stringList = list(jieba.cut(str)) symbol = {"/", "(", ")", " ", ";", "!", "、", ":","+","?"," ",")","(","?",",","之","你","了","嗎","】","【"} stringSet = set(stringList) - symbol title_dict = {} for i in stringSet: title_dict[i] = stringList.count(i) print(title_dict) #導入excel di = title_dict wbk = xlwt.Workbook(encoding=‘utf-8‘) sheet = wbk.add_sheet("wordCount") # Excel單元格名字 k = 0 for i in di.items(): sheet.write(k, 0, label=i[0]) sheet.write(k, 1, label=i[1]) k = k + 1 wbk.save(‘前端數據.xls‘) # 保存為 wordCount.xls文件 font = r‘C:\Windows\Fonts\simhei.ttf‘ content = ‘ ‘.join(title_dict.keys()) # 根據圖片生成詞雲 image = np.array(Image.open(‘cool.jpg‘)) wordcloud = WordCloud(background_color=‘white‘, font_path=font, mask=image, width=1000, height=860, margin=2).generate(content) # 顯示生成的詞雲圖片 plt.imshow(wordcloud) plt.axis("off") plt.show() wordcloud.to_file(‘c-cool.jpg‘)
(詞雲圖)
關於爬取json內容生成詞雲(瘋狂踩坑)