【爬蟲】Python Scrapy 基礎概念 —— 請求和響應
Typically, spiders 中會產生 Request
物件,然後傳遞 across the system, 直到他們到達 Downloader, which 執行請求並返回一個 Response
物件 which travels back to the spider that issued the request.
Request 物件
class scrapy.http.
Request
(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False
A Request
物件代表了一次 HTTP 請求, which is usually generated in the Spider and executed by the Downloader, and thus 生成一個 Response
.
Parameters: |
|
---|
url
A string containing the URL of this request. Keep in mind that this attribute contains 轉義的 URL, so it can differ from the URL passed in the constructor.
This attribute is read-only. To change the URL of a Request use replace()
.
method
A string representing the HTTP method in the request. This is guaranteed to be 大寫. Example: "GET"
, "POST"
, "PUT"
, etc
headers
A 類似於字典的物件 which contains the request headers.
body
一個包含了請求體的 str.
This attribute is read-only. To change the body of a Request use replace()
.
meta
A dict that contains arbitrary 元資料 for this request. 對於 new Requests 來說這個字典是空的, 且該字典通常會填充不同的 Scrapy 元件 (外掛, 中介軟體, etc). 所以這個字典中包含的資料會依賴於 the extensions you have enabled.
See Request.meta special keys for a list of special meta keys recognized by Scrapy.
當使用 copy()
or replace()
方法克隆請求的時候,這個字典會被 shallow copied, 並且可以通過你的 spider 中的 response.meta
屬性 be accessed.
copy
()
replace
([url, method, headers, body, cookies, meta, encoding, dont_filter, callback, errback])
返回一個 Request 物件 with the same members, 除了 those members given new values by whichever keyword arguments are specified. The attribute Request.meta
預設會被複制 (除非 meta
argument中被給予了一個新的值). See also Passing additional data to callback functions.
Passing additional data to callback functions
一個請求的回撥函式是指一個方程,該方程在此請求所對應的響應被下載的時候被呼叫. 這個回撥函式在被呼叫的時候,被下載的 Response
物件會成為其第一個引數.
Example:
def parse_page1(self, response):
return scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
def parse_page2(self, response):
# this would log http://www.example.com/some_page.html
self.logger.info("Visited %s", response.url)
有的時候你可能想傳遞 arguments 給這些回撥函式,然後之後在第二個回撥函式中你就可以接收這些 arguments. You can use the Request.meta
attribute for that.
Here’s an example of how to pass an item using this mechanism, to populate different fields from different pages:
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
yield request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
yield item
Using errbacks to 在處理請求時捕獲異常
一個請求的 errback 是一個方程 that will be called when an exception is raise while processing it.
It receives a Twisted Failure instance as first parameter and can be used to track 連線建立超時, DNS 錯誤 etc.
Here’s an example spider logging all errors and 捕獲一些特殊的 errors if needed:
import scrapy
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError
class ErrbackSpider(scrapy.Spider):
name = "errback_example"
start_urls = [
"http://www.httpbin.org/", # HTTP 200 expected
"http://www.httpbin.org/status/404", # Not found error
"http://www.httpbin.org/status/500", # server issue
"http://www.httpbin.org:12345/", # non-responding host, timeout expected
"http://www.httphttpbinbin.org/", # DNS error expected
]
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, callback=self.parse_httpbin,
errback=self.errback_httpbin,
dont_filter=True)
def parse_httpbin(self, response):
self.logger.info('Got successful response from {}'.format(response.url))
# do something useful here...
def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))
# in case you want to do something special for some errors,
# you may need the failure's type:
if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)
elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)
Request.meta special keys
The Request.meta
屬性可以包含任何 arbitrary data, but there are some special keys recognized by Scrapy and its built-in extensions.
Those are:
bindaddress
The IP of the outgoing IP address to use for the performing the request.
download_timeout
The amount of time (in secs) that the downloader will wait before timing out. See also: DOWNLOAD_TIMEOUT
.
download_latency
指從請求開始起,the amount of time spent to fetch the response, i.e. HTTP 訊息 sent over the network. 當響應被下載下來的時候,這個 meta key 才會變得可用. While most other meta keys are used to control Scrapy behavior, this one is supposed to be read-only.
download_fail_on_dataloss
max_retry_times
The meta key is used set retry times per request. When initialized, the max_retry_times
meta key 比 the RETRY_TIMES
setting 優先順序更高.
Request subclasses
Here is the list of built-in Request
subclasses. You can also subclass it to implement your own custom functionality.
FormRequest objects (略)
Request usage examples
Using FormRequest to send data via HTTP POST
If you want to 模擬 a HTML Form POST in your spider 並且傳送一系列 key-value fields, you can return a FormRequest
object (from your spider) like this:
return [FormRequest(url="http://www.example.com/post/action",
formdata={'name': 'John Doe', 'age': '27'},
callback=self.after_post)]
Using FormRequest.from_response() to 模擬使用者登陸
網站經常會通過 <input type="hidden">
元素提供 pre-populated 表單 fields, 比如 session related data or authentication tokens (在登陸頁面上). 當爬取頁面時,你會希望這些 fields 都是自動 pre-populated and only override a couple of them, such as the user name and password. You can use the FormRequest.from_response()
method for this job. Here’s an example spider which uses it:
import scrapy
class LoginSpider(scrapy.Spider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login
)
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.logger.error("Login failed")
return
# continue scraping with authenticated session...
Response objects
class scrapy.http.
Response
(url[, status=200, headers=None, body=b'', flags=None, request=None])
A Response
物件代表了一個 HTTP 響應, which 通常會被下載下來 (by the Downloader) 然後給到 Spiders 進行處理.
Parameters: |
|
---|
url
A string containing the URL of the response.
This attribute is read-only. To change the URL of a Response use replace()
.
status
一個代表了響應的 HTTP 狀態的 Integer. Example: 200
, 404
.
headers
A dictionary-like object which 包含了響應頭. 可以通過 get()
獲取值然後返回 the first header value with the specified name 或者通過 getlist()
to return all header values with the specified name. For example, this call will give you all cookies in the headers:
response.headers.getlist('Set-Cookie')
body
The body of this Response. Keep in mind that Response.body 永遠是一個位元組物件. If you want the unicode version use TextResponse.text
(only available in TextResponse
and subclasses).
This attribute is read-only. To change the body of a Response use replace()
.
request
The Request
object that generated this response. This attribute is assigned in the Scrapy engine, after the response and the request have passed through all Downloader Middlewares. In particular, this means that:
- HTTP 重定向會使得原始請求 (to the URL before redirection) 被賦給被重定向的響應 (with the final URL after redirection).
- Response.request.url doesn’t always equal Response.url
- This attribute is only available in the spider code, and in the Spider Middlewares, but not in Downloader Middlewares (although you have the Request available there by other means) and handlers of the
response_downloaded
signal.
meta
A shortcut to the Request.meta
attribute of the Response.request
object (ie. self.request.meta
).
Unlike the Response.request
attribute, the Response.meta
attribute is propagated along redirects and retries, so you will get the original Request.meta
sent from your spider.
See also
flags
一個包含了該請求 flags 的列表. Flags 是用於標記 Responses 的標籤. For example: ‘cached’, ‘redirected’, etc. 並且他們被顯示在 Response 的字串表示式中 (__str__ method) which is used by the engine for logging.
copy
()
Returns a new Response which is a copy of this Response.
replace
([url, status, headers, body, request, flags, cls])
返回一個 Response 物件 with the same members, except for those members given new values by whichever keyword arguments are specified. The attribute Response.meta
is copied by default.
urljoin
(url)
Constructs an absolute url by combining the Response’s url
with a possible relative url.
This is a wrapper over urlparse.urljoin, it’s merely an alias for making this call:
urlparse.urljoin(response.url, url)
follow
(url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None)
Return a Request
instance 來 follow 一個連結 url
. 與 Request.__init__
方法接受的 arguments 相同, but url
can be 一個相對的 URL or a scrapy.link.Link
object, not only an absolute URL.
TextResponse
provides a follow()
method which supports selectors in addition to absolute/relative URLs and Link objects.
Response subclasses
Here is the list of available built-in Response subclasses. You can also subclass the Response class to implement your own functionality.