1. 程式人生 > >scrapy+selenium+chromedriver解析動態渲染頁面

scrapy+selenium+chromedriver解析動態渲染頁面

wait tab rim 增加 encoding war 互動 ogg @class

背景:動態頁面是頁面是通過js代碼渲染出來的,無法直接使用scrapy爬蟲,這是就需要先把js代碼轉為靜態的html,再用scrapy爬蟲就可以解決

解決辦法:增加SeleniumMiddlewares中間件

代碼: 

class SeleniumMiddleware(object):
def __init__(self,timeout=25):
chrome_options = Options()
prefs = {
‘profile.default_content_setting_values‘: {
‘images‘: 2, # 禁用圖片的加載
‘javascript‘: 2 # 禁用js,可能會導致通過js加載的互動數抓取失效
}
}
chrome_options.add_experimental_option("prefs", prefs)
self.browser = webdriver.Chrome(executable_path="C:\Program Files (x86)\Google\Chrome\Application\chromedriver",chrome_options=chrome_options)
self.timeout = timeout
self.browser.maximize_window()
# self.browser.implicitly_wait(20)
# self.browser.set_page_load_timeout(25)
self.browser.set_page_load_timeout(self.timeout)
self.wait = WebDriverWait(self.browser, self.timeout)

def __del__(self):
self.browser.close()

def process_request(self, request, spider):
"""
用ChromeDriver抓取頁面
:param request: Request對象
:param spider: Spider對象
:return: HtmlResponse
"""
logging.info(‘******ChromeDriver is Starting******‘)
try:
self.browser.get(request.url)
self.wait.until(EC.presence_of_element_located((By.XPATH, ‘//div[@class="s-result-list sg-row"]‘)))
time.sleep(2)
return HtmlResponse(url=request.url, body=self.browser.page_source, request=request, encoding=‘utf-8‘,
status=200)
except TimeoutException:
return HtmlResponse(url=request.url, status=500, request=request)

在setting文件中增加如下配置:
DOWNLOADER_MIDDLEWARES = {
# ‘amazon.middlewares.AmazonDownloaderMiddleware‘: 543,
‘amazon.custom_rewrite.SeleniumMiddlewares.SeleniumMiddleware‘: 541, #自定義selenium中間件
}

scrapy+selenium+chromedriver解析動態渲染頁面