【爬蟲】Python Scrapy 基礎概念 —— 請求和響應

阿新 • • 發佈：2019-01-29

Typically, spiders 中會產生 Request 物件，然後傳遞 across the system, 直到他們到達 Downloader, which 執行請求並返回一個 Response 物件 which travels back to the spider that issued the request.

Request 物件

class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False

, errback, flags])

A Request 物件代表了一次 HTTP 請求, which is usually generated in the Spider and executed by the Downloader, and thus 生成一個 Response.

Parameters:

Parameters:	url (string) – the URL of this request callback (callable) – the function that will be called 且這個請求對應的響應 (once its downloaded) 會成為該方法的第一個引數. For more information see Passing additional data to callback functions below. If a Request 沒有指定回撥函式, the spider’s `parse()` 方法會被使用. Note that 如果過程中產生了異常, errback is called instead. method (string) – the HTTP 方法 of this request. Defaults to `'GET'`. meta (dict) – `Request.meta` 屬性的初始值. If given, 此引數傳遞進來的字典會被淺拷貝. body (str or unicode) – 請求體. 如果傳遞的是一個 `unicode`, 那麼會使用傳遞進來的 encoding (預設為 `utf-8`) 編碼成 `str`. 如果沒有指定 `body`, 會儲存一個空字串. 不論該 argument 是什麼型別, 最終儲存的值會是一個 `str` (never `unicode` or `None`). headers (dict) – 請求頭. 字典的值可以為字串 (對於單一值請求頭而言) 或列表 (對於多值請求頭而言). 如果傳遞的值是 `None`, the HTTP 頭不會被髮送. cookies (dict or list) – 請求的 cookies. 可以通過兩種形式進行傳送. 使用字典: `request_with_cookies = Request(url="http://www.example.com", cookies={'currency': 'USD', 'country': 'UY'})` 使用字典列表: `request_with_cookies = Request(url="http://www.example.com", cookies=[{'name': 'currency', 'value': 'USD', 'domain': 'example.com', 'path': '/currency'}])` 後一種形式 allows for customizing cookie 的 `domain` and `path` 屬性. 只有在 cookies 被儲存 for later requests 時才有用. 當有些網站 (in a response) 返回了 cookies 時，會被存在這個域名的 cookies 中，然後 in future requests 會被再次傳送. 這是一般網頁瀏覽器的典型行為. 然而，出於某些原因，你可能想避免 merging with existing cookies, 你可以設定 the `dont_merge_cookies` key 為 True in the `Request.meta`. Example of request without merging cookies: `request_with_cookies = Request(url="http://www.example.com", cookies={'currency': 'USD', 'country': 'UY'}, meta={'dont_merge_cookies': True})` encoding (string) – the encoding of this request (defaults to `'utf-8'`). This encoding will be used to percent-encode the URL and to convert the body to `str` (if given as `unicode`). priority (int) – 這個請求的優先順序 (defaults to `0`). Scheduler 會使用這個優先順序來定義處理請求的順序. 有更高優先順序的值的請求會更早執行. 負數值 are allowed, 可以用來表示相對較低的優先順序. dont_filter (boolean) – 表示該請求 should not be filtered by the scheduler. 當你想多次執行同一個相同的請求時，可以使用該引數 to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to `False`. errback (callable) – a function that will be called if any exception was raised while processing the request. 包括失敗頁面 with 404 HTTP errors and such. 該方法會接收一個 Twisted Failure 例項作為第一個引數. For more information, see Using errbacks to catch exceptions in request processing below. flags (list) – Flags sent to the request, 可以用於列印日誌 or similar purposes.

url (string) – the URL of this request
callback (callable) – the function that will be called 且這個請求對應的響應 (once its downloaded) 會成為該方法的第一個引數. For more information see

Passing additional data to callback functions below. If a Request 沒有指定回撥函式, the spider’s parse() 方法會被使用. Note that 如果過程中產生了異常, errback is called instead.
method (string) – the HTTP 方法 of this request. Defaults to 'GET'.
meta (dict) – Request.meta 屬性的初始值. If given, 此引數傳遞進來的字典會被淺拷貝.
body (str

or unicode) – 請求體. 如果傳遞的是一個 unicode, 那麼會使用傳遞進來的 encoding (預設為 utf-8) 編碼成 str. 如果沒有指定 body, 會儲存一個空字串. 不論該 argument 是什麼型別, 最終儲存的值會是一個 str (never unicode or None).
headers (dict) – 請求頭. 字典的值可以為字串 (對於單一值請求頭而言) 或列表 (對於多值請求頭而言). 如果傳遞的值是 None, the HTTP 頭不會被髮送.

cookies (dict or list) – 請求的 cookies. 可以通過兩種形式進行傳送.

使用字典:

request_with_cookies = Request(url="http://www.example.com",
                               cookies={'currency': 'USD', 'country': 'UY'})

使用字典列表:

request_with_cookies = Request(url="http://www.example.com",
                               cookies=[{'name': 'currency',
                                        'value': 'USD',
                                        'domain': 'example.com',
                                        'path': '/currency'}])

後一種形式 allows for customizing cookie 的 domain and path 屬性. 只有在 cookies 被儲存 for later requests 時才有用.

當有些網站 (in a response) 返回了 cookies 時，會被存在這個域名的 cookies 中，然後 in future requests 會被再次傳送. 這是一般網頁瀏覽器的典型行為. 然而，出於某些原因，你可能想避免 merging with existing cookies, 你可以設定 the dont_merge_cookies key 為 True in the Request.meta.

Example of request without merging cookies:

request_with_cookies = Request(url="http://www.example.com",
                               cookies={'currency': 'USD', 'country': 'UY'},
                               meta={'dont_merge_cookies': True})

encoding (string) – the encoding of this request (defaults to 'utf-8'). This encoding will be used to percent-encode the URL and to convert the body to str (if given as unicode).
priority (int) – 這個請求的優先順序 (defaults to 0). Scheduler 會使用這個優先順序來定義處理請求的順序. 有更高優先順序的值的請求會更早執行. 負數值 are allowed, 可以用來表示相對較低的優先順序.
dont_filter (boolean) – 表示該請求 should not be filtered by the scheduler. 當你想多次執行同一個相同的請求時，可以使用該引數 to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.
errback (callable) – a function that will be called if any exception was raised while processing the request. 包括失敗頁面 with 404 HTTP errors and such. 該方法會接收一個 Twisted Failure 例項作為第一個引數. For more information, see Using errbacks to catch exceptions in request processing below.
flags (list) – Flags sent to the request, 可以用於列印日誌 or similar purposes.

url

A string containing the URL of this request. Keep in mind that this attribute contains 轉義的 URL, so it can differ from the URL passed in the constructor.

This attribute is read-only. To change the URL of a Request use replace().

method

A string representing the HTTP method in the request. This is guaranteed to be 大寫. Example: "GET", "POST", "PUT", etc

headers

A 類似於字典的物件 which contains the request headers.

body

一個包含了請求體的 str.

This attribute is read-only. To change the body of a Request use replace().

meta

A dict that contains arbitrary 元資料 for this request. 對於 new Requests 來說這個字典是空的, 且該字典通常會填充不同的 Scrapy 元件 (外掛, 中介軟體, etc). 所以這個字典中包含的資料會依賴於 the extensions you have enabled.

See Request.meta special keys for a list of special meta keys recognized by Scrapy.

當使用 copy() or replace() 方法克隆請求的時候，這個字典會被 shallow copied, 並且可以通過你的 spider 中的 response.meta 屬性 be accessed.

copy()

replace([url, method, headers, body, cookies, meta, encoding, dont_filter, callback, errback])

返回一個 Request 物件 with the same members, 除了 those members given new values by whichever keyword arguments are specified. The attribute Request.meta 預設會被複制 (除非 meta argument中被給予了一個新的值). See also Passing additional data to callback functions.

Passing additional data to callback functions

一個請求的回撥函式是指一個方程，該方程在此請求所對應的響應被下載的時候被呼叫. 這個回撥函式在被呼叫的時候，被下載的 Response 物件會成為其第一個引數.

Example:

def parse_page1(self, response):
    return scrapy.Request("http://www.example.com/some_page.html",
                          callback=self.parse_page2)

def parse_page2(self, response):
    # this would log http://www.example.com/some_page.html
    self.logger.info("Visited %s", response.url)

有的時候你可能想傳遞 arguments 給這些回撥函式，然後之後在第二個回撥函式中你就可以接收這些 arguments. You can use the Request.meta attribute for that.

Here’s an example of how to pass an item using this mechanism, to populate different fields from different pages:

def parse_page1(self, response):
    item = MyItem()
    item['main_url'] = response.url
    request = scrapy.Request("http://www.example.com/some_page.html",
                             callback=self.parse_page2)
    request.meta['item'] = item
    yield request

def parse_page2(self, response):
    item = response.meta['item']
    item['other_url'] = response.url
    yield item

Using errbacks to 在處理請求時捕獲異常

一個請求的 errback 是一個方程 that will be called when an exception is raise while processing it.

It receives a Twisted Failure instance as first parameter and can be used to track 連線建立超時, DNS 錯誤 etc.

Here’s an example spider logging all errors and 捕獲一些特殊的 errors if needed:

import scrapy

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError

class ErrbackSpider(scrapy.Spider):
    name = "errback_example"
    start_urls = [
        "http://www.httpbin.org/",              # HTTP 200 expected
        "http://www.httpbin.org/status/404",    # Not found error
        "http://www.httpbin.org/status/500",    # server issue
        "http://www.httpbin.org:12345/",        # non-responding host, timeout expected
        "http://www.httphttpbinbin.org/",       # DNS error expected
    ]

    def start_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(u, callback=self.parse_httpbin,
                                    errback=self.errback_httpbin,
                                    dont_filter=True)

    def parse_httpbin(self, response):
        self.logger.info('Got successful response from {}'.format(response.url))
        # do something useful here...

    def errback_httpbin(self, failure):
        # log all failures
        self.logger.error(repr(failure))

        # in case you want to do something special for some errors,
        # you may need the failure's type:

        if failure.check(HttpError):
            # these exceptions come from HttpError spider middleware
            # you can get the non-200 response
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)

        elif failure.check(TimeoutError, TCPTimedOutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)

Request.meta special keys

The Request.meta 屬性可以包含任何 arbitrary data, but there are some special keys recognized by Scrapy and its built-in extensions.

Those are:

bindaddress

The IP of the outgoing IP address to use for the performing the request.

download_timeout

The amount of time (in secs) that the downloader will wait before timing out. See also: DOWNLOAD_TIMEOUT.

download_latency

指從請求開始起，the amount of time spent to fetch the response, i.e. HTTP 訊息 sent over the network. 當響應被下載下來的時候，這個 meta key 才會變得可用. While most other meta keys are used to control Scrapy behavior, this one is supposed to be read-only.

download_fail_on_dataloss

max_retry_times

The meta key is used set retry times per request. When initialized, the max_retry_times meta key 比 the RETRY_TIMES setting 優先順序更高.

Request subclasses

Here is the list of built-in Request subclasses. You can also subclass it to implement your own custom functionality.

FormRequest objects (略)

Request usage examples

Using FormRequest to send data via HTTP POST

If you want to 模擬 a HTML Form POST in your spider 並且傳送一系列 key-value fields, you can return a FormRequest object (from your spider) like this:

return [FormRequest(url="http://www.example.com/post/action",
                    formdata={'name': 'John Doe', 'age': '27'},
                    callback=self.after_post)]

Using FormRequest.from_response() to 模擬使用者登陸

網站經常會通過 <input type="hidden"> 元素提供 pre-populated 表單 fields, 比如 session related data or authentication tokens (在登陸頁面上). 當爬取頁面時，你會希望這些 fields 都是自動 pre-populated and only override a couple of them, such as the user name and password. You can use the FormRequest.from_response() method for this job. Here’s an example spider which uses it:

import scrapy

class LoginSpider(scrapy.Spider):
    name = 'example.com'
    start_urls = ['http://www.example.com/users/login.php']

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'john', 'password': 'secret'},
            callback=self.after_login
        )

    def after_login(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.logger.error("Login failed")
            return

        # continue scraping with authenticated session...

Response objects

class scrapy.http.Response(url[, status=200, headers=None, body=b'', flags=None, request=None])

A Response 物件代表了一個 HTTP 響應, which 通常會被下載下來 (by the Downloader) 然後給到 Spiders 進行處理.

Parameters:

Parameters:	url (string) – the URL of this response status (integer) – the HTTP status of the response. Defaults to `200`. headers (dict) – 響應頭. 字典的值可以是字串 (for single valued headers) 或列表 (for multi-valued headers). body (bytes) – the response body. To access the decoded text as str (unicode in Python 2) you can use `response.text` from an encoding-aware Response subclass, such as `TextResponse`. flags (list) – 是一個包含了 `Response.flags` 屬性初始值的列表. If given, the list will be shallow copied. request (`Request` object) – `Response.request` 屬性的初始值. This represents the `Request` that generated this response.

url (string) – the URL of this response
status (integer) – the HTTP status of the response. Defaults to 200.
headers (dict) – 響應頭. 字典的值可以是字串 (for single valued headers) 或列表 (for multi-valued headers).
body (bytes) – the response body. To access the decoded text as str (unicode in Python 2) you can use response.text from an encoding-aware Response subclass, such as TextResponse.
flags (list) – 是一個包含了 Response.flags 屬性初始值的列表. If given, the list will be shallow copied.
request (Request object) – Response.request 屬性的初始值. This represents the Request that generated this response.

url

A string containing the URL of the response.

This attribute is read-only. To change the URL of a Response use replace().

status

一個代表了響應的 HTTP 狀態的 Integer. Example: 200, 404.

headers

A dictionary-like object which 包含了響應頭. 可以通過 get() 獲取值然後返回 the first header value with the specified name 或者通過 getlist() to return all header values with the specified name. For example, this call will give you all cookies in the headers:

response.headers.getlist('Set-Cookie')

body

The body of this Response. Keep in mind that Response.body 永遠是一個位元組物件. If you want the unicode version use TextResponse.text (only available in TextResponse and subclasses).

This attribute is read-only. To change the body of a Response use replace().

request

The Request object that generated this response. This attribute is assigned in the Scrapy engine, after the response and the request have passed through all Downloader Middlewares. In particular, this means that:

HTTP 重定向會使得原始請求 (to the URL before redirection) 被賦給被重定向的響應 (with the final URL after redirection).
Response.request.url doesn’t always equal Response.url
This attribute is only available in the spider code, and in the Spider Middlewares, but not in Downloader Middlewares (although you have the Request available there by other means) and handlers of the response_downloaded signal.

meta

A shortcut to the Request.meta attribute of the Response.request object (ie. self.request.meta).

Unlike the Response.request attribute, the Response.meta attribute is propagated along redirects and retries, so you will get the original Request.meta sent from your spider.

Response subclasses

Here is the list of available built-in Response subclasses. You can also subclass the Response class to implement your own functionality.

【爬蟲】Python Scrapy 基礎概念 —— 請求和響應

Request 物件

Passing additional data to callback functions

Using errbacks to 在處理請求時捕獲異常

Request.meta special keys

bindaddress

download_timeout

download_latency

download_fail_on_dataloss

max_retry_times

Request subclasses

FormRequest objects (略)

Request usage examples

Response objects

Response subclasses

TextResponse objects（略）

HtmlResponse objects（略）

XmlResponse objects（略）

【爬蟲】Python Scrapy 基礎概念 —— 請求和響應

【爬蟲】python爬蟲工具scrapy的安裝使用

【轉】深度學習基礎概念理解

【爬蟲】python selenium 爬取資料

【原創】Python+Scrapy+Selenium簡單爬取淘寶天貓商品資訊及評論

【轉載】掌握 HTTP 快取——從請求到響應過程的一切（上）

【轉載】Python tips: 什麼是*args和**kwargs？ Python tips: 什麼是*args和**kwargs？

【NLP】Python NLTK獲取文字語料和詞彙資源

ThinkPHP5 ---基礎篇(請求和響應)

python—【爬蟲】學習_2(正則表示式篇）1.基礎知識

【施工ing】【整理】python---各種程式設計基礎概念---辨析整理

Python自動化開發課堂筆記【Day03】 - Python基礎(字符編碼使用，文件處理，函數)

【23】Python基礎筆記2

【轉】python基礎-編碼與解碼

【轉】Python基礎語法

【小白專區】python 列表基礎內容彙總

【原創】python學習筆記（自學階段1）-- 自學，爬蟲備註--先佔坑

python—【爬蟲】學習_2(正則表示式篇）_2(practice)

python—【爬蟲】學習_1(基本知識篇）

【Python爬蟲】Python安裝

【爬蟲】Python Scrapy 基礎概念 —— 請求和響應

Request 物件

Passing additional data to callback functions

Using errbacks to 在處理請求時捕獲異常

Request.meta special keys

bindaddress

download_timeout

download_latency

download_fail_on_dataloss

max_retry_times

Request subclasses

FormRequest objects (略)

Request usage examples

Response objects

Response subclasses

TextResponse objects（略）

HtmlResponse objects（略）

XmlResponse objects（略）

相關推薦