python如何提升爬蟲效率
阿新 • • 發佈:2020-09-29
單執行緒+多工非同步協程
- 協程
在函式(特殊函式)定義的時候,使用async修飾,函式呼叫後,內部語句不會立即執行,而是會返回一個協程物件
- 任務物件
任務物件=高階的協程物件(進一步封裝)=特殊的函式
任務物件必須要註冊到時間迴圈物件中
給任務物件繫結回撥:爬蟲的資料解析中
- 事件迴圈
當做是一個裝載任務物件的容器
當啟動事件迴圈物件的時候,儲存在內的任務物件會非同步執行
- 特殊函式內部不能寫不支援非同步請求的模組,如time,requests...否則雖然不報錯但實現不了非同步
time.sleep -- asyncio.sleep
requests -- aiohttp
import asyncio import time start_time = time.time() async def get_request(url): await asyncio.sleep(2) print(url,'下載完成!') urls = [ 'www.1.com','www.2.com',] task_lst = [] # 任務物件列表 for url in urls: c = get_request(url) # 協程物件 task = asyncio.ensure_future(c) # 任務物件 # task.add_done_callback(...) # 繫結回撥 task_lst.append(task) loop = asyncio.get_event_loop() # 事件迴圈物件 loop.run_until_complete(asyncio.wait(task_lst)) # 註冊,手動掛起
執行緒池+requests模組
# 執行緒池 import time from multiprocessing.dummy import Pool start_time = time.time() url_list = [ 'www.1.com','www.3.com',] def get_request(url): print('正在下載...',url) time.sleep(2) print('下載完成!',url) pool = Pool(3) pool.map(get_request,url_list) print('總耗時:',time.time()-start_time)
兩個方法提升爬蟲效率
起一個flask服務端
from flask import Flask import time app = Flask(__name__) @app.route('/bobo') def index_bobo(): time.sleep(2) return 'hello bobo!' @app.route('/jay') def index_jay(): time.sleep(2) return 'hello jay!' @app.route('/tom') def index_tom(): time.sleep(2) return 'hello tom!' if __name__ == '__main__': app.run(threaded=True)
aiohttp模組+單執行緒多工非同步協程
import asyncio import aiohttp import requests import time start = time.time() async def get_page(url): # page_text = requests.get(url=url).text # print(page_text) # return page_text async with aiohttp.ClientSession() as s: #生成一個session物件 async with await s.get(url=url) as response: page_text = await response.text() print(page_text) return page_text urls = [ 'http://127.0.0.1:5000/bobo','http://127.0.0.1:5000/jay','http://127.0.0.1:5000/tom',] tasks = [] for url in urls: c = get_page(url) task = asyncio.ensure_future(c) tasks.append(task) loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.wait(tasks)) end = time.time() print(end-start) # 非同步執行! # hello tom! # hello bobo! # hello jay! # 2.0311079025268555
''' aiohttp模組實現單執行緒+多工非同步協程 並用xpath解析資料 ''' import aiohttp import asyncio from lxml import etree import time start = time.time() # 特殊函式:請求的傳送和資料的捕獲 # 注意async with await關鍵字 async def get_request(url): async with aiohttp.ClientSession() as s: async with await s.get(url=url) as response: page_text = await response.text() return page_text # 返回頁面原始碼 # 回撥函式,解析資料 def parse(task): page_text = task.result() tree = etree.HTML(page_text) msg = tree.xpath('/html/body/ul//text()') print(msg) urls = [ 'http://127.0.0.1:5000/bobo',] tasks = [] for url in urls: c = get_request(url) task = asyncio.ensure_future(c) task.add_done_callback(parse) #繫結回撥函式! tasks.append(task) loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.wait(tasks)) end = time.time() print(end-start)
requests模組+執行緒池
import time import requests from multiprocessing.dummy import Pool start = time.time() urls = [ 'http://127.0.0.1:5000/bobo',] def get_request(url): page_text = requests.get(url=url).text print(page_text) return page_text pool = Pool(3) pool.map(get_request,urls) end = time.time() print('總耗時:',end-start) # 實現非同步請求 # hello jay! # hello bobo! # hello tom! # 總耗時: 2.0467123985290527
小結
- 爬蟲的加速目前掌握了兩種方法:
aiohttp模組+單執行緒多工非同步協程
requests模組+執行緒池
- 爬蟲接觸的模組有三個:
requests
urllib
aiohttp
- 接觸了一下flask開啟伺服器
以上就是python如何提升爬蟲效率的詳細內容,更多關於python提升爬蟲效率的資料請關注我們其它相關文章!