多執行緒處理爬蟲
阿新 • • 發佈:2021-07-16
爬取某網站部分資訊,由於頁面過多,採用多執行緒方式,提高爬取速度,完整程式碼如下
#!/usr/bin/env python # -*- coding:utf-8 -*- import requests from bs4 import BeautifulSoup as Bs4 import threading headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36", "Accept-Language": "zh-CN,zh;q=0.9" } name_list = [] def url_text(url,n): response = requests.get(url,headers = headers) response.encoding = 'utf-8' try: soup = Bs4(response.text,'lxml') urls_name = soup.select(".list-item") data_dict={} for urls in urls_name: text_data1 = urls.select(".dp-b") n = n + 1 data_dict["id"] = n for i in text_data1: data_dict["name"] = i.text.strip() # print(data_dict) text_data2 = urls.select(".content-img") for j in text_data2: data_dict["data"] = j.text.strip() print(data_dict) with open("smiles_0716.txt", "a+", encoding="utf-8") as f: f.write(str(data_dict)+"\n") except: print("請求出錯") if __name__ == "__main__": n = 0 for num in range(1,20): url = "https://www.xxx.com/index_{}.html".format(num) """ - 你寫好程式碼 - 交給直譯器執行: python thread1.py - 直譯器讀取程式碼,再交給作業系統去執行,根據你的程式碼去選擇建立多少個執行緒/程序去執行(單程序/多執行緒)。 - 作業系統呼叫硬體:硬碟、cpu、網絡卡.... """ t = threading.Thread(target=url_text, args=(url,n,)) t.start() n = n + 10