Scrapy 入門：爬蟲類詳解（Parse()函式、選擇器、提取資料）

阿新 • • 發佈：2020-08-19

安裝 & 建立專案

# 安裝Scrapy
pip install scrapy
# 建立專案
scrapy startproject tutorial # tutorial為專案名
# 建立爬蟲
scrapy genspider <爬蟲名> <domain.com>

得到的目錄結構如下：

tutorial/
    scrapy.cfg            # 配置檔案
    tutorial/             # 專案的模組
        __init__.py
        items.py          # 定義items
        middlewares.py    # 中介軟體
        pipelines.py      # pipelines
        settings.py       # 設定檔案
        spiders/          # 爬蟲
            __init__.py
            spider1.py
            ...

爬蟲類

爬蟲類必須繼承 scrapy.Spider，爬蟲類中必要的屬性和方法：

1. name = "quotes"：爬蟲名，必須唯一，因為需要使用 scrapy crawl "爬蟲名" 命令用來開啟指定的爬蟲。

2. start_requests()：要求返回一個 requests 的列表或生成器，爬蟲將從 start_requests() 提供的 requests 中爬取，例如：

# start_requests()
def start_requests(self):
    urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]
    for url in urls:
        yield scrapy.Request(url=url, callback=self.parse)

3. parse()：用於處理每個 Request 返回的 Response 。parse() 通常用來將 Response 中爬取的資料提取為資料字典，或者過濾出 URL 然後繼續發出 Request 進行進一步的爬取。

# parse()
def parse(self, response):
    page = response.url.split("/")[-2]
    filename = 'quotes-%s.html' % page
    with open(filename, 'wb') as f:
        f.write(response.body)
    self.log('Saved file %s' % filename)

4. start_urls 列表：可以在爬蟲類中定義一個名為 start_urls 的列表替代 start_requests() 方法。作用同樣是為爬蟲提供初始的 Requests，但程式碼更加的簡潔。

執行爬蟲後，名為 parse() 的方法將會被自動呼叫，用來處理 start_url 列表中的每一個 URL：

start_urls = [
    'http://quotes.toscrape.com/page/1/',
    'http://quotes.toscrape.com/page/2/',
]

5. 執行爬蟲：

$ scrapy crawl quotes

執行爬蟲時發生了什麼：Scrapy 通過爬蟲類的 start_requests 方法返回 scrapy.Request 物件。在接收到每個 response 響應時，它例項化 Response 物件並呼叫與 request 相關的回撥方法（ parse 方法），並將 Response 作為其引數傳遞。

parse() 函式

parse() 函式無疑是爬蟲類中最重要的函式，它包含了爬蟲解析響應的主要邏輯。

學習使用 Scrapy 選擇器的最佳方法就是使用 Scrapy shell，輸入這個命令之後將會進入一個互動式的命令列模式：

scrapy shell 'http://quotes.toscrape.com/page/1/'

下面將通過互動式命令實踐來學習 Response 選擇器：

CSS 選擇器

response.css 返回的是一個 SelectorList 物件，它是一個Selector 物件構成的列表，例如：

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

用 getall() 方法獲取所有符合條件的字串列表，用 get() 獲取首個匹配的字串。::text 用於去除標籤(<tag>)。

>>> response.css('title::text').getall()
['Quotes to Scrape']
>>> response.css('title::text').get()
'Quotes to Scrape'
>>> response.css('title::text')[0].get()
'Quotes to Scrape'

使用 re() 相當於在 getall() 的基礎上用正則表示式對內容進一步篩選

>>> response.css('title::text').re(r'Q\w+')
['Quotes']

XPath 選擇器

XPath 選擇器相較於 CSS 選擇器更加強大。實際上在 Scrapy 內部，CSS 選擇器最終會被轉換成 XPath 選擇器。

>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').get()
'Quotes to Scrape'

生成資料字典

要將 Response 中爬取的資料生成為資料字典，使用字典生成器，例如：

def parse(self, response):
    for quote in response.css('div.quote'):  # quote是SelectorList物件
        yield {
            'text': quote.css('span.text::text').get(),
            'author': quote.css('small.author::text').get(),
            'tags': quote.css('div.tags a.tag::text').getall(),
        }

儲存資料到檔案

最簡單的方法是用 Feed exports。使用 -o 引數指定一個 json 檔案用於儲存 parse() 函式 yield 出的內容。

$ scrapy crawl quotes -o quotes.json -s FEED_EXPORT_ENCODING=utf-8
# 若有中文務必加上 -s FEED_EXPORT_ENCODING=utf-8

使用 JSON Lines 格式儲存。由於歷史原因，Scrapy 只會追加而非覆蓋原先的 Json 檔案，會導致第二次寫入後 Json 格式被破壞。而使用 JSON Lines 格式 ( .jl )可以避免這個問題

$ scrapy crawl quotes -o quotes.jl

要對資料進行更多的操作（例如驗證爬到的資料，去重等等），可以在 pipelines.py 中寫一個 Item Pipeline。當然，如果只需要儲存爬取到的資料則不需要。

提取 URL 進行深層爬取

例如要提取出下一頁的 URL 地址進行進一步的爬取：

<li class="next">
    <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a> <!-- &rarr;表示右箭頭 -->
</li>

通過以下兩種方式都可以提取出 <a> 標籤中的 href 屬性：

>>> response.css('li.next a::attr(href)').get()
'/page/2/'
>>> response.css('li.next a').attrib['href']
'/page/2'

當在 parse() 中 yield 出的是一個 Request 物件時，Scrapy 會自動安排傳送這個 request，當請求完成後繼續呼叫 callback 引數所指定的回撥函式，如下所示：

def parse(self, response):
    for quote in response.css('div.quote'):  # quote是SelectorList物件
        yield {
            'text': quote.css('span.text::text').get(),
            'author': quote.css('small.author::text').get(),
            'tags': quote.css('div.tags a.tag::text').getall(),
        }

    next_page = response.css('li.next a::attr(href)').get()
    if next_page is not None:
        next_page = response.urljoin(next_page)  # urljoin()方法可以自動將相對路徑轉換為絕對路徑
        yield scrapy.Request(next_page, callback=self.parse)  # yield scrapy.Request()

response.follow()

建議使用更方便的 response.follow() 替代 scrapy.Request()，因為它直接支援相對路徑，上文中程式碼可以簡化如下：

next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
    yield response.follow(next_page, callback=self.parse)  # next_page = '/page/2/'

response.follow() 還支援直接使用 Selector 物件作為引數，無需提取出 URL，於是上述程式碼得到進一步簡化：

for href in response.css('li.next a::attr(href)'):
    yield response.follow(href, callback=self.parse)  # href = [<Selector xpath='' data=''>]

注意 SelectorList 物件不能直接作為引數，下面的用法是錯誤的：
yield response.follow(response.css('li.next a::sattr(href)'), callback=self.parse)

針對 <a> 標籤的 css 選擇器，response.follow() 會自動使用其 href 屬性，於是上述程式碼終極簡化版本如下所示：

# CSS選擇器
for a in response.css('li.next a'):
    yield response.follow(a, callback=self.parse)

但是注意 XPath 選擇器不能這麼簡寫：

# 不能簡化成 //div[@class='p_name']/a
for a in response.xpath("//div[@class='p_name']/a/@href"):
    yield response.follow(a, callback=self.parse)

預設情況下，Scrapy 會幫我們過濾掉重複訪問的地址，可以通過 DUPEFILTER_CLASS Setting 設定。

scrapy crawl 附帶引數

使用 -a 選項來給爬蟲提供額外的引數，提供的引數會自動變成爬蟲類的屬性（使用 self.tag 或 getattr(self, 'tag', None) 獲取），如下例，使用 -a tag=humor 命令列引數，最終資料將儲存到 quotes-humor.json 檔案：

$ scrapy crawl quotes -o quotes-humor.json -a tag=humor

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Scrapy 入門：爬蟲類詳解（Parse()函式、選擇器、提取資料）

安裝 & 建立專案

爬蟲類

parse() 函式

CSS 選擇器

XPath 選擇器

生成資料字典

儲存資料到檔案

提取 URL 進行深層爬取

response.follow()

scrapy crawl 附帶引數

Scrapy 入門：爬蟲類詳解（Parse()函式、選擇器、提取資料）

java學習筆記------集合類詳解（Collection、Array、List、Set、Map）

scrapy處理python爬蟲排程詳解

Java抽象（abstract）類的詳解（語法規則，設計思想，程式碼例項）

Java日期類詳解（Date類和DateFormat類）

Java包裝類詳解（二）

[擴充套件閱讀] timeit 模組詳解（準確測量小段程式碼的執行時間）

520表白小程式設計Python程式碼詳解（PyQt5介面，B站動漫風）

寫給後端的Docker初級入門教程：DockerFile 命令詳解

Spring data jpa 的使用與詳解（二）：複雜動態查詢及分頁，排序

Spring data jpa 的使用與詳解（一）：框架整合及基本使用

scrapy爬蟲:scrapy.FormRequest中formdata引數詳解

mysql錯誤詳解（1819）：ERROR 1819 (HY000): Your password does not satisfy the current policy requirements

Maven 專題（五）：Maven核心概念詳解（一）

Spring5詳解（二）——Spring的入門案例HelloSpring

TransCoder 程式碼詳解（一）：最頂層的main函式

TransCoder程式碼詳解（三）：DAE/BT的訓練過程

Solon詳解（一）- 快速入門

Python 中的裝飾器詳解（@decorator、裝飾器函式、裝飾器類）

SpringBoot - 獲取POST請求引數詳解（附樣例：表單資料、json、陣列、物件）

Scrapy 入門：爬蟲類詳解（Parse()函式、選擇器、提取資料）

安裝 & 建立專案

爬蟲類

parse() 函式

CSS 選擇器

XPath 選擇器

生成資料字典

儲存資料到檔案

提取 URL 進行深層爬取

response.follow()

scrapy crawl 附帶引數

相關推薦