scrapy之Request物件
我們在使用scrapy框架的時候,會經常疑惑,資料流是怎麼樣在各個元件中間傳遞的。最近經常用scrapy+selenium爬取淘寶,又因為今天週五心情好,本寶寶決定梳理一下這方面知識。
scrapy中各個元件相互通訊的方式是通過request物件和response物件來完成的。也就是說spider和middleware之間的資料傳遞時通過這兩個物件傳遞的。request物件是在spider中產生的,看程式碼:
from scrapyseleniumtest.items import ProductItem class TaobaoSpider(Spider): name= 'taobao' allowed_domains = ['www.taobao.com'] base_url = 'https://s.taobao.com/search?q=' def start_requests(self): for keyword in self.settings.get('KEYWORDS'): for page in range(1, self.settings.get('MAX_PAGE') + 1): url = self.base_url + quote(keyword)yield Request(url=url, callback=self.parse, meta={'page': page}, dont_filter=True)
這個是scrapy中的spider,大家看最後的yield Request(url=url, callback=self.parse, meta={'page': page}, dont_filter=True),這就是將Request類例項化了一個request物件,通過request物件來傳遞資料。比如在middleware.py中
class SeleniumMiddleware(): def __init__(self, timeout=None, service_args=[]): self.logger = getLogger(__name__) self.timeout = timeout self.browser = webdriver.Firefox(executable_path="geckodriver.exe") self.browser.set_window_size(1400, 700) self.browser.set_page_load_timeout(self.timeout) self.wait = WebDriverWait(self.browser, self.timeout) def __del__(self): self.browser.close() def process_request(self, request, spider): """ 用PhantomJS抓取頁面 :param request: Request物件 :param spider: Spider物件 :return: HtmlResponse """ self.logger.debug('PhantomJS is Starting') page = request.meta.get('page', 1)
在process_request(self, request, spider)中,我們看到了第二個引數是request,這個就是request物件,一個request物件代表一個HTTP請求,通常有Spider產生,經Downloader執行從而產生一個Response。但是呢,這裡我們使用了selenium,這個reponse就不是Downloader執行產生的了,而是由火狐瀏覽器物件browser代替Downloader完成了下載(頁面載入),然後構築了一個HtmlResponse物件,返回給了spider進行解析,request物件到這裡也就不在繼續處理了。看downloadmiddleware的完整程式碼:
from selenium import webdriver from selenium.common.exceptions import TimeoutException from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from scrapy.http import HtmlResponse from logging import getLogger class SeleniumMiddleware(): def __init__(self, timeout=None, service_args=[]): self.logger = getLogger(__name__) self.timeout = timeout self.browser = webdriver.Firefox(executable_path="geckodriver.exe") self.browser.set_window_size(1400, 700) self.browser.set_page_load_timeout(self.timeout) self.wait = WebDriverWait(self.browser, self.timeout) def __del__(self): self.browser.close() def process_request(self, request, spider): """ 用PhantomJS抓取頁面 :param request: Request物件 :param spider: Spider物件 :return: HtmlResponse """ self.logger.debug('PhantomJS is Starting') page = request.meta.get('page', 1) try: self.browser.get(request.url) if page > 1: input = self.wait.until( EC.presence_of_element_located((By.CSS_SELECTOR, '#mainsrp-pager div.form > input'))) submit = self.wait.until( EC.element_to_be_clickable((By.CSS_SELECTOR, '#mainsrp-pager div.form > span.btn.J_Submit'))) input.clear() input.send_keys(page) submit.click() self.wait.until( EC.text_to_be_present_in_element((By.CSS_SELECTOR, '#mainsrp-pager li.item.active > span'), str(page))) self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.m-itemlist .items .item'))) return HtmlResponse(url=request.url, body=self.browser.page_source, request=request, encoding='utf-8', status=200) except TimeoutException: return HtmlResponse(url=request.url, status=500, request=request) @classmethod def from_crawler(cls, crawler): return cls(timeout=crawler.settings.get('SELENIUM_TIMEOUT'), service_args=crawler.settings.get('PHANTOMJS_SERVICE_ARGS'))
然後根據scrapy官方文件的解釋,看看request物件的一些具體引數:
1,Request objects
class scrapy.http.
Request
(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback])
一個request物件代表一個HTTP請求,通常有Spider產生,經Downloader執行從而產生一個Response。
Paremeters: url(string): 用於請求的URL
callback(callable):指定一個回撥函式,該回調函式以這個request是的response作為第一個引數。如果未指定callback,
則預設使用spider的parse()方法。
method(string):HTTP請求的方法,預設為GET(看到GET你應該明白了,過不不明白建議先學習urllib或者requets模組)
meta(dict):指定Request.meta屬性的初始值。如果給了該引數,dict將會淺拷貝。(淺拷貝不懂的趕緊回爐)
body(str):the request body.(這個沒有理解,若有哪位大神明白,請指教,謝謝)
headers(dict):request的頭資訊。
cookies(dict or list):cookie有兩種格式。
1、使用dict:
request_with_cookies = Request(url="http://www.example.com", cookies={'currency': 'USD', 'country': 'UY'})
2、使用字典的list
request_with_cookies = Request(url="http://www.example.com", cookies=[{'name': 'currency', 'value': 'USD', 'domain': 'example.com', 'path': '/currency'}])
後面這種形式可以定製cookie的domain和path屬性,只有cookies為接下來的請求儲存的時候才有用。
當網站在response中返回cookie時,這些cookie將被儲存以便未來的訪問請求。這是常規瀏覽器的行為。如果你想避免修改當前
正在使用的cookie,你可以通過設定Request.meta中的dont_merge_cookies為True來實現。
request_with_cookies = Request(url="http://www.example.com", cookies={'currency': 'USD', 'country': 'UY'}, meta={'dont_merge_cookies': True})
encoding(string):請求的編碼, 預設為utf-8
priority(int):請求的優先順序
dont_filter(boolean):指定該請求是否被 Scheduler過濾。該引數可以是request重複使用(Scheduler預設過濾重複請求)。謹慎使用!!
errback(callable):處理異常的回撥函式。
屬性和方法:
url: 包含request的URL的字串
method: 代表HTTP的請求方法的字串,例如'GET', 'POST'...
headers: request的頭資訊
body: 請求體
meta: 一個dict,包含request的任意元資料。該dict在新Requests中為空,當Scrapy的其他擴充套件啟用的時候填充資料。dict在傳輸是淺拷貝。
copy(): 拷貝當前Request
replace([url, method, headers, body, cookies, meta, encoding, dont_filter, callback, errback]): 返回一個引數相同的Request,
可以為引數指定新資料。
給回撥函式傳遞資料
當request的response被下載是,就會呼叫回撥函式,並以response物件為第一個引數
def parse_page1(self, response): return scrapy.Request("http://www.example.com/some_page.html", callback=self.parse_page2) def parse_page2(self, response): # this would log http://www.example.com/some_page.html self.logger.info("Visited %s", response.url)example
在某些情況下,你希望在回撥函式們之間傳遞引數,可以使用Request.meta。(其實有點類似全域性變數的趕腳)
def parse_page1(self, response): item = MyItem() item['main_url'] = response.url request = scrapy.Request("http://www.example.com/some_page.html", callback=self.parse_page2) request.meta['item'] = item yield request def parse_page2(self, response): item = response.meta['item'] item['other_url'] = response.url yield itemView Code
使用errback來捕獲請求執行中的異常
當request執行時有異常丟擲將會呼叫errback回撥函式。
它接收一個Twisted Failure例項作為第一個引數,並被用來回溯連線超時或DNS錯誤等。
1 import scrapy 2 3 from scrapy.spidermiddlewares.httperror import HttpError 4 from twisted.internet.error import DNSLookupError 5 from twisted.internet.error import TimeoutError, TCPTimedOutError 6 7 class ErrbackSpider(scrapy.Spider): 8 name = "errback_example" 9 start_urls = [ 10 "http://www.httpbin.org/", # HTTP 200 expected 11 "http://www.httpbin.org/status/404", # Not found error 12 "http://www.httpbin.org/status/500", # server issue 13 "http://www.httpbin.org:12345/", # non-responding host, timeout expected 14 "http://www.httphttpbinbin.org/", # DNS error expected 15 ] 16 17 def start_requests(self): 18 for u in self.start_urls: 19 yield scrapy.Request(u, callback=self.parse_httpbin, 20 errback=self.errback_httpbin, 21 dont_filter=True) 22 23 def parse_httpbin(self, response): 24 self.logger.info('Got successful response from {}'.format(response.url)) 25 # do something useful here... 26 27 def errback_httpbin(self, failure): 28 # log all failures 29 self.logger.error(repr(failure)) 30 31 # in case you want to do something special for some errors, 32 # you may need the failure's type: 33 34 if failure.check(HttpError): 35 # these exceptions come from HttpError spider middleware 36 # you can get the non-200 response 37 response = failure.value.response 38 self.logger.error('HttpError on %s', response.url) 39 40 elif failure.check(DNSLookupError): 41 # this is the original request 42 request = failure.request 43 self.logger.error('DNSLookupError on %s', request.url) 44 45 elif failure.check(TimeoutError, TCPTimedOutError): 46 request = failure.request 47 self.logger.error('TimeoutError on %s', request.url)example
Request.meta的特殊關鍵字
Request.meta可以包含任意的資料,但Scrapy和內建擴充套件提供了一些特殊的關鍵字
dont_redirect (其實dont就是don't,嗯哼~)
dont_retry
handle_httpstatus_list
handle_httpstatus_all
dont_merge_cookies
(seecookies
parameter ofRequest
constructor)cookiejar
dont_cache
redirect_urls
bindaddress
dont_obey_robotstxt
download_timeout(下載超時)
download_maxsize
download_latency(下載延時)
proxy