1. 程式人生 > 其它 >多執行緒處理爬蟲

多執行緒處理爬蟲

爬取某網站部分資訊,由於頁面過多,採用多執行緒方式,提高爬取速度,完整程式碼如下

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import requests
from bs4 import BeautifulSoup as Bs4
import threading

headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36",
"Accept-Language": "zh-CN,zh;q=0.9"
}

name_list = []
def url_text(url,n):
    response = requests.get(url,headers = headers)
    response.encoding = 'utf-8'
    try:
        soup = Bs4(response.text,'lxml')
        urls_name = soup.select(".list-item")
        data_dict={}
        for urls in urls_name:
            text_data1 = urls.select(".dp-b")
            n = n + 1
            data_dict["id"] = n
            for i in text_data1:
                data_dict["name"] = i.text.strip()
                # print(data_dict)
            text_data2 = urls.select(".content-img")
            for j in text_data2:
                data_dict["data"] = j.text.strip()
            print(data_dict)
            with open("smiles_0716.txt", "a+", encoding="utf-8") as f:
                f.write(str(data_dict)+"\n")
    except:
        print("請求出錯")
if __name__ == "__main__":
    n = 0
    for num in range(1,20):
        url = "https://www.xxx.com/index_{}.html".format(num)
"""
- 你寫好程式碼
- 交給直譯器執行: python thread1.py 
- 直譯器讀取程式碼,再交給作業系統去執行,根據你的程式碼去選擇建立多少個執行緒/程序去執行(單程序/多執行緒)。
- 作業系統呼叫硬體:硬碟、cpu、網絡卡....
"""
        t = threading.Thread(target=url_text, args=(url,n,))
        t.start()

        n = n + 10