【python3爬蟲-爬文章】爬IT之家10萬篇文章
阿新 • • 發佈:2018-11-20
-
1. 技術要點:
- 多執行緒;
- Python3 GET請求(請求評論數),POST請求(將抓取的資料存入資料庫);
- 資料持久化(將抓取資料存入MySQL)。
2. 過程:
採用窮舉的辦法(為什麼用窮舉,因為文章url地址是數字結尾但是又沒有明顯規律),開600執行緒,一共抓90萬次,每次延時0.05秒,一共花費時間47小時。
- BS4操作DOM,篩選html節點,從而獲取想要的文字文字;
- GET請求,直接返回結果:
# GET import requests get_response = requests.get(url=' ') res = get_response.text
返回評論數。
-
POST請求,請求後端api即可完成資料入庫:
import requests post_url = 'http://192.168.13.103/laravel55/public/index.php/it_pages2' data = { "it_title": it_title, "it_url": it_url, "it_week": it_week, "it_number": it_number, "it_time": it_time, "it_source": it_source, "it_editor": it_editor, "it_comment": it_comment, } post = requests.post(url=post_url, data=data, headers={'Content-Type': 'application/x-www-form-urlencoded'})
將資料儲存到資料庫。
我後端用的是PHP框架laravel55,直接寫一個POST提交介面即可。研究了一下Django編寫有點麻煩,資料庫配置很麻煩,增刪改查資料很麻煩,建專案很麻煩,分散式很麻煩,總之有點麻煩。
3. 整個爬蟲Python3.x程式碼:
from bs4 import BeautifulSoup import urllib.request import requests import os import re import time import _thread # 單篇文章 def a_page(week, number, weburl): it_url = weburl + str(week) + '/' + str(number) + '.htm' # 文章地址 print('\n文章地址:' + it_url) try: # 獲取網頁 response = urllib.request.urlopen(it_url) html = response.read().decode('utf-8') # 編碼格式gb2312,utf-8,GBK html_string = str(html) # 轉換成string,可以直接向資料庫新增 soup = BeautifulSoup(html_string, "html.parser") # 解析網頁標籤 title_div = soup.find_all('div', attrs={'class', 'post_title'})[0] it_title = title_div.find('h1').string # 標題 it_time = title_div.find_all(id='pubtime_baidu')[0].string # 文章時間 it_source = title_div.find_all(id='source_baidu')[0].find('a').string # 文章來源 it_editor = title_div.find_all(id='editor_baidu')[0].find('strong').string # 編輯 get_response = requests.get(url='https://dyn.ithome.com/api/comment/count?newsid=' + str(week) + str(number)) it_comment = re.findall('(\d+)', get_response.text)[-1] # 評論數 it_week = week # 第多少周 it_number = number # 文章在本週的編號 print(it_title) print(it_url) print(it_week) print(it_number) print(it_time) print(it_source) print(it_editor) print(it_comment) print('本篇文章抓取完成,編號=' + str(number)) # 文章資料儲存 post_url = 'http://192.168.13.103/laravel55/public/index.php/it_pages2' data = { "it_title": it_title, "it_url": it_url, "it_week": it_week, "it_number": it_number, "it_time": it_time, "it_source": it_source, "it_editor": it_editor, "it_comment": it_comment, } post = requests.post(url=post_url, data=data, headers={'Content-Type': 'application/x-www-form-urlencoded'}) print('資料儲存完成,res=' + post) pass except: print('不存在文章,編號=' + str(number) + ';周=' + str(week)) pass time.sleep(0.05) # 延時抓取完成要延時12.5小時,避免拖死wifi路由器,減少樣本失敗率。 pass def week_page(week, weburl): for number in range(1, 1500): # 一週文章最大編號 print('\n第' + str(number) + '篇文章') # a_page(week, number, weburl) pass pass all_thread_num = 0 def page_class(cla, that_num): print("已啟動執行緒=" + str(that_num)) global all_thread_num all_thread_num += 1 print("執行緒總數=" + str(all_thread_num)) for page in range(1, 30): print("內=" + str(page)) pass pass for cla in range(0, 600): # 建立執行緒 try: _thread.start_new_thread(week_page, (cla, 'https://www.ithome.com/0/')) pass except: print("無法啟動執行緒") pass pass while 1: pass
MySQL下載:
地址1:https://download.csdn.net/download/weixin_41827162/10795556
地址2:https://makeoss.oss-cn-hangzhou.aliyuncs.com/it/it_pages2.sql
-