Scrapy框架介紹之Puppeteer渲染的使用
阿新 • • 發佈:2020-06-19
1、Scrapy框架
Scrapy是用純Python實現一個為了爬取網站資料、提取結構性資料而編寫的應用框架,用途非常廣泛。
框架的力量,使用者只需要定製開發幾個模組就可以輕鬆的實現一個爬蟲,用來抓取網頁內容以及各種圖片,非常之方便。
Scrapy 使用了 Twisted'twɪstɪd非同步網路框架來處理網路通訊,可以加快我們的下載速度,不用自己去實現非同步框架,並且包含了各種中介軟體介面,可以靈活的完成各種需求。
- Scrapy Engine(引擎): 負責Spider、ItemPipeline、Downloader、Scheduler中間的通訊,訊號、資料傳遞等。
- Scheduler(排程器): 它負責接受引擎傳送過來的Request請求,並按照一定的方式進行整理排列,入隊,當引擎需要時,交還給引擎。
- Downloader(下載器):負責下載Scrapy Engine(引擎)傳送的所有Requests請求,並將其獲取到的Responses交還給Scrapy Engine(引擎),由引擎交給Spider來處理,
- Spider(爬蟲):它負責處理所有Responses,從中分析提取資料,獲取Item欄位需要的資料,並將需要跟進的URL提交給引擎,再次進入Scheduler(排程器),
- Item Pipeline(管道):它負責處理Spider中獲取到的Item,並進行進行後期處理(詳細分析、過濾、儲存等)的地方.
- Downloader Middlewares(下載中介軟體):你可以當作是一個可以自定義擴充套件下載功能的元件。
- Spider Middlewares(Spider中介軟體):你可以理解為是一個可以自定擴充套件和操作引擎和Spider中間通訊的功能元件(比如進入Spider的Responses;和從Spider出去的Requests)
2、Puppeteer渲染
Puppeteer 是 Chrome 開發團隊在 2017 年釋出的一個 Node.js 包,用來模擬 Chrome 瀏覽器的執行。
為了爬取js渲染的html頁面,我們需要用瀏覽器來解析js後生成html。在scrapy中可以利用pyppeteer來實現對應功能。
我們需要新建專案中middlewares.py檔案(./專案名/middlewares.py)
import websockets from scrapy.http import HtmlResponse from logging import getLogger import asyncio import pyppeteer import logging from concurrent.futures._base import TimeoutError import base64 import sys import random pyppeteer_level = logging.WARNING logging.getLogger('websockets.protocol').setLevel(pyppeteer_level) logging.getLogger('pyppeteer').setLevel(pyppeteer_level) PY3 = sys.version_info[0] >= 3 def base64ify(bytes_or_str): if PY3 and isinstance(bytes_or_str,str): input_bytes = bytes_or_str.encode('utf8') else: input_bytes = bytes_or_str output_bytes = base64.urlsafe_b64encode(input_bytes) if PY3: return output_bytes.decode('ascii') else: return output_bytes class ProxyMiddleware(object): USER_AGENT = open('useragents.txt').readlines() def process_request(self,request,spider): # 代理伺服器 proxyHost = "t.16yun.cn" proxyPort = "31111" # 代理隧道驗證資訊 proxyUser = "username" proxyPass = "password" request.meta['proxy'] = "http://{0}:{1}".format(proxyHost,proxyPort) # 新增驗證頭 encoded_user_pass = base64ify(proxyUser + ":" + proxyPass) request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass # 設定IP切換頭(根據需求) tunnel = random.randint(1,10000) request.headers['Proxy-Tunnel'] = str(tunnel) request.headers['User-Agent'] = random.choice(self.USER_AGENT) class PyppeteerMiddleware(object): def __init__(self,**args): """ init logger,loop,browser :param args: """ self.logger = getLogger(__name__) self.loop = asyncio.get_event_loop() self.browser = self.loop.run_until_complete( pyppeteer.launch(headless=True)) self.args = args def __del__(self): """ close loop :return: """ self.loop.close() def render(self,url,retries=1,script=None,wait=0.3,scrolldown=False,sleep=0,timeout=8.0,keep_page=False): """ render page with pyppeteer :param url: page url :param retries: max retry times :param script: js script to evaluate :param wait: number of seconds to wait before loading the page,preventing timeouts :param scrolldown: how many times to page down :param sleep: how many long to sleep after initial render :param timeout: the longest wait time,otherwise raise timeout error :param keep_page: keep page not to be closed,browser object needed :param browser: pyppetter browser object :param with_result: return with js evaluation result :return: content,[result] """ # define async render async def async_render(url,script,scrolldown,sleep,wait,timeout,keep_page): try: # basic render page = await self.browser.newPage() await asyncio.sleep(wait) response = await page.goto(url,options={'timeout': int(timeout * 1000)}) if response.status != 200: return None,None,response.status result = None # evaluate with script if script: result = await page.evaluate(script) # scroll down for {scrolldown} times if scrolldown: for _ in range(scrolldown): await page._keyboard.down('PageDown') await asyncio.sleep(sleep) else: await asyncio.sleep(sleep) if scrolldown: await page._keyboard.up('PageDown') # get html of page content = await page.content() return content,result,response.status except TimeoutError: return None,500 finally: # if keep page,do not close it if not keep_page: await page.close() content,status = [None] * 3 # retry for {retries} times for i in range(retries): if not content: content,status = self.loop.run_until_complete( async_render(url=url,script=script,sleep=sleep,wait=wait,scrolldown=scrolldown,timeout=timeout,keep_page=keep_page)) else: break # if need to return js evaluation result return content,status def process_request(self,spider): """ :param request: request object :param spider: spider object :return: HtmlResponse """ if request.meta.get('render'): try: self.logger.debug('rendering %s',request.url) html,status = self.render(request.url) return HtmlResponse(url=request.url,body=html,request=request,encoding='utf-8',status=status) except websockets.exceptions.ConnectionClosed: pass @classmethod def from_crawler(cls,crawler): return cls(**crawler.settings.get('PYPPETEER_ARGS',{}))
然後修改專案配置檔案 (./專案名/settings.py)
DOWNLOADER_MIDDLEWARES = { 'scrapypyppeteer.middlewares.PyppeteerMiddleware': 543,'scrapypyppeteer.middlewares.ProxyMiddleware': 100,}
然後我們執行程式
到此這篇關於Scrapy框架介紹之Puppeteer渲染的使用的文章就介紹到這了,更多相關Scrapy Puppeteer渲染內容請搜尋我們以前的文章或繼續瀏覽下面的相關文章希望大家以後多多支援我們!