筆記-scrapy-Request/Response
筆記-scrapy-Request/Response
1. 簡介
Scrapy使用Request和Response來爬取網站。
2. request
class scrapy.http.Request(url [,callback,method =‘GET‘,headers,body,cookies,meta,encoding =‘utf-8‘,priority = 0,dont_filter = False,errback,flags ] )
參數說明:
url (string):the URL of this request
callback (callable):the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s parse() method will be used. Note that if exceptions are raised during processing, errback is called instead.
回調函數:將這個請求的響應(一旦下載完成)作為第一個參數調用的函數,如果請求沒有指定回調,則將使用蜘蛛的parse()方法。請註意,如果在處理期間引發異常,則會調用errback。
method (string):the HTTP method of this request. Defaults to ‘GET‘.
meta (dict):the initial values for the Request.meta attribute. If given, the dict passed in this parameter will be shallow copied.參數傳遞用,註意,淺拷貝。
body (str or unicode):the request body. If a unicode is passed, then it’s encoded to str using the encoding passed (which defaults to utf-8). If body is not given, an empty string is stored. Regardless of the type of this argument, the final value stored will be a str (never unicode or None).
headers (dict):http頭部。
cookies (dict or list):
可以使用兩種形式發送。
字典型:
request_with_cookies = Request(url="http://www.example.com",
cookies={‘currency‘: ‘USD‘, ‘country‘: ‘UY‘})
字典列表型:
request_with_cookies = Request(url="http://www.example.com",
cookies=[{‘name‘: ‘currency‘,
‘value‘: ‘USD‘,
‘domain‘: ‘example.com‘,
‘path‘: ‘/currency‘}])
The latter form allows for customizing the domain and path attributes of the cookie. This is only useful if the cookies are saved for later requests.
When some site returns cookies (in a response) those are stored in the cookies for that domain and will be sent again in future requests. That’s the typical behaviour of any regular web browser. However, if, for some reason, you want to avoid merging with existing cookies you can instruct Scrapy to do so by setting the dont_merge_cookies key to True in the Request.meta.
不合並cookie示例:
request_with_cookies = Request(url="http://www.example.com",
cookies={‘currency‘: ‘USD‘, ‘country‘: ‘UY‘},
meta={‘dont_merge_cookies‘: True})
encoding (string):(defaults to ‘utf-8‘).
priority (int):請求優先級,目前沒用過;
dont_filter (boolean):表示這個請求不應該被調度器過濾。當您想多次執行相同的請求時使用此選項,以忽略重復過濾器。小心使用它,否則你將進入爬行循環。默認為False。errback (callable): 如果在處理請求時引發任何異常,將會調用該函數
flags (list) – Flags sent to the request, can be used for logging or similar purposes.
該類還包含一些屬性:
url 此請求的URL。此屬性為轉義的URL,可能與構造函數中傳遞的URL不同。屬性是只讀的。更改使用URL replace()。
method 表示請求中的HTTP方法,大寫。例如:"GET","POST","PUT"
headers 一個包含請求頭文件的類似字典的對象。
body 包含請求主體的str。屬性是只讀的。更改使用body. replace()。
meta 包含此請求的任意元數據的字典。對於新的請求是空的,通常由不同的Scrapy組件(擴展,中間件等)填充。
copy() 返回一個新請求,它是此請求的副本。
2.1. meta:附加數據傳遞
使用meta參數傳遞數據
def parse_page1(self, response):
item = MyItem()
item[‘main_url‘] = response.url
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta[‘item‘] = item
yield request
def parse_page2(self, response):
item = response.meta[‘item‘]
item[‘other_url‘] = response.url
yield item
meta有一些官方指定鍵值用來對請求進行處理:
download_timeout 下載器在超時之前等待的時間(以秒為單位)
2.2. errbacks:異常處理
import scrapy
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError
class ErrbackSpider(scrapy.Spider):
name = "errback_example"
start_urls = [
"http://www.httpbin.org/", # HTTP 200 expected
"http://www.httpbin.org/status/404", # Not found error
"http://www.httpbin.org/status/500", # server issue
"http://www.httpbin.org:12345/", # non-responding host, timeout expected
"http://www.httphttpbinbin.org/", # DNS error expected
]
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, callback=self.parse_httpbin,
errback=self.errback_httpbin,
dont_filter=True)
def parse_httpbin(self, response):
self.logger.info(‘Got successful response from {}‘.format(response.url))
# do something useful here...
def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))
# in case you want to do something special for some errors,
# you may need the failure‘s type:
if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
self.logger.error(‘HttpError on %s‘, response.url)
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error(‘DNSLookupError on %s‘, request.url)
elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
self.logger.error(‘TimeoutError on %s‘, request.url)
3. response
classscrapy.http.Response(url [,status = 200,headers = None,body = b‘‘,flags = None,request = None ])
參數:
url(字符串) - 此響應的URL
狀態(整數) - 響應的HTTP狀態。一般為200。
標題(字典) - 這個響應的標題。字典值可以是字符串(對於單值標題)或列表(對於多值標題)。
body (bytes) - 響應主體。註意解碼
標誌(列表) - 是包含Response.flags屬性初始值的列表 。如果給出,列表將被淺拷貝。
請求(Request對象) - Response.request屬性的初始值。這代表了Request產生這個回應的那個。
response還有一些子類,但一般情況下不會使用,不予討論。
筆記-scrapy-Request/Response