scrapy 使用selenium作為下載中介軟體
阿新 • • 發佈:2020-12-30
最近在做scrapy專案的時候, 需要非同步渲染動態的網頁,自己構建了一個selenium 非同步渲染的下載中介軟體
1.爬蟲傳送請求的時候需要攜帶引數
Request(
url=url,
meta={"is_selenium": True},
callback=self.parse,
dont_filter=True,
)
2.中介軟體程式碼
class SeleniumMiddleware(object):
"""selenium模擬渲染"""
def __init__(self):
self.options = webdriver.ChromeOptions()
self.options.add_argument('-headless')
self.options.add_experimental_option('excludeSwitches', ['enable-automation']) # 忽略證書錯誤
self.options.add_argument('--ignore-certificate-errors')
self.driver = webdriver.Chrome(executable_path="C:\chromedriver.exe", options=self.options)
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_closed, signal=signals.spider_closed)
return s
def spider_closed(self):
self.driver.close()
def process_request(self, request, spider):
is_selenium = request.meta.get('is_selenium', False)
if is_selenium:
try:
self.driver.get(request.url)
except TimeoutException:
self.driver.execute_script('window.stop()')
return HtmlResponse(url=request.url, status=666, encoding='utf-8')
time.sleep(1)
url = self.driver.current_url
body = self.driver.page_source
return HtmlResponse(url=url, body=body, status=200, encoding='utf-8')
注意事項
使用中介軟體的時候注意資源的釋放,需要監控爬蟲結束的訊號,在爬蟲結束的時候,關閉掉selenium的driver。