Python 非同步爬蟲 aiohttp 示例
阿新 • • 發佈:2021-08-11
1、寫在前面
之前一篇隨筆記錄了非同步的個人理解https://www.cnblogs.com/rainbow-tan/p/15081118.html
之前隨筆非同步都是以asyncio.sleep()來進行非同步操作的演示,下面程式碼具體演示了一次非同步爬蟲
2、使用的非同步爬蟲庫為 aiohttp
演示功能:
爬取https://wall.alphacoders.com/ 中的小圖片,進行批量下載,進行下載用時的對比
(1)先用一般的requests庫進行爬蟲演示,檢視執行的時間
import os import time import requests from bs4 import BeautifulSoupdef get_html(url): ret = requests.get(url) return ret if __name__ == '__main__': index = 0 start = time.time() response = get_html('https://wall.alphacoders.com/') soup = BeautifulSoup(response.text, 'lxml') boxgrids = soup.find_all(class_='boxgrid') for boxgrid in boxgrids: img= boxgrid.find('a').find('picture').find('img') link = img.attrs['src'] content = get_html(link).content picture_type = str(link).split('.')[-1] index += 1 path = os.path.abspath('imgs') if not os.path.exists(path): os.makedirs(path) with open('{}/{}.{}'.format(path, index, picture_type), 'wb') as f: f.write(content) end = time.time() print(f'下載完成{index}個圖片,用時:{end - start}秒')
執行
用時14秒,下載30個圖片
(2)使用非同步庫aiohttp下載
import asyncio import os import time import aiohttp from aiohttp import TCPConnector from bs4 import BeautifulSoup async def get_html(url): async with aiohttp.ClientSession( connector=TCPConnector(verify_ssl=False)) as session: async with session.get(url) as resp: text = await resp.text() soup = BeautifulSoup(text, 'lxml') boxgrids = soup.find_all(class_='boxgrid') links = [] for boxgrid in boxgrids: img = boxgrid.find('a').find('picture').find('img') link = img.attrs['src'] links.append(link) return links async def write_file(url, index): async with aiohttp.ClientSession( connector=TCPConnector(verify_ssl=False)) as session: async with session.get(url) as resp: text = await resp.read() path = os.path.abspath('images') if not os.path.exists(path): os.makedirs(path) with open(f'{path}/{index}.{str(url).split(".")[-1]}', 'wb') as f: f.write(text) if __name__ == '__main__': index = 0 start = time.time() loop = asyncio.get_event_loop() task = loop.create_task(get_html('https://wall.alphacoders.com/')) links = loop.run_until_complete(task) tasks = [] for link in links: tasks.append(write_file(link, index)) index += 1 loop.run_until_complete(asyncio.gather(*tasks)) end = time.time() print(f'下載完成{index}個圖片,用時:{end - start}秒')
執行
下載30個圖片,用時4秒
學習連結 :
https://www.jianshu.com/p/20ca9daba85f
https://docs.aiohttp.org/en/stable/client_quickstart.html
https://juejin.cn/post/6857140761926828039 (這個未參考,但是看起來也很牛,收藏一下)