爬蟲 Scrapy 學習系列之一：Tutorial

阿新 • • 發佈：2018-12-30

前言

筆者打算寫一系列的文章，記錄自己在學習並使用 Scrapy 的點滴；作者打算使用 python 3.6 作為 Scrapy 的基礎執行環境；

本文為作者的原創作品，轉載需註明出處；

備註：本文轉載自本人的部落格，傷神的部落格：http://www.shangyang.me/2017/06/29/scrapy-learning-1-tutorial/

Scrapy 安裝

我本地安裝有兩個版本的 python, 2.7 和 3.6；而正如前言所描述的那樣，筆者打算使用 Python 3.6 的環境來搭建 Scrapy；

1	$ pip install Scrapy

預設安裝的支援 Python 2.7 版本的 Scrapy；

1	$ pip3 install Scrap

安裝的是支援 python 3.x 版本的 Scrapy；不過安裝過程中，遇到了些問題，HTTPSConnectionPool(host=’pypi.python.org’, port=443): Read timed out.解決辦法是，在安裝的過程中，延長超時的時間，

1	$ pip3 install -U --timeout 1000 Scrapy

Scrapy Tutorial

建立 tutorial 專案

使用

1 2 3

$ scrapy startproject tutorial New Scrapy project 'tutorial'

, using template directory '/usr/local/lib/python2.7/site-packages/scrapy/templates/project', created in: /Users/mac/workspace/scrapy/tutorial

可見預設使用的 python 2.7，但是如果需要建立一個支援 python 3.x 版本的 tutoiral 專案呢？如下所示，使用 python3 -m

1 2 3

$ python3 -m scrapy startproject tutorial New Scrapy project 'tutorial', using template directory '/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/templates/project', created in: /Users/mac/workspace/scrapy/tutorial

匯入 PyCharm

直接 open 專案工程 /Users/mac/workspace/scrapy/tutorial；這裡需要注意的是預設的 PyCharm 使用的直譯器 Interpretor 是我本地的 Python 2.7；這裡需要將直譯器改為 Python 3.6；下面記錄下修改的步驟，

點選左上角 PyCharm Community Edition，進入 Preferences
點選 Project:tutorial，然後選擇 Project Interpreter，然後設定直譯器的版本，如下

工程結構

通過命令構建出來的專案骨架如圖所示

第一個 Spider

我們來新建一個 Spider 類，名叫 quotes_spider.py，並將其放置到 tutorial/spiders 目錄中

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" def start_requests(self): urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): page = response.url.split( "/")[ -2] filename = 'quotes-%s.html' % page with open(filename, 'wb') as f: f.write(response.body) self.log( 'Saved file %s' % filename)

可以看到，我們新建的 QuotesSpider 類是繼承自 scrapy.Spider 類的；下面看看其屬性和方法的意義，

name
是 Spider 的識別符號，用於唯一標識該 Spider；它必須在整個專案中是全域性唯一的；
start_requests()
必須定義並返回一組可以被 Spider 爬取的 Requests，Request 物件由一個 URL 和一個回撥函式構成；
parse()
就是 Request 物件中的回撥方法，用來解析每一個 Request 之後的 Response；所以，parse() 方法就是用來解析返回的內容，通過解析得到的 URL 同樣可以建立對應的 Requests 進而繼續爬取；

再來看看具體的實現，

start_request(self) 方法分別針對 http://quotes.toscrape.com/page/1/ 和 http://quotes.toscrape.com/page/2/ 建立了兩個需要被爬取的 Requests 物件；並通過 yield 進行迭代返回；備註，yield 是迭代生成器，是一個 Generator；
parse(self, response) 方法既是對 Request 的反饋的內容 Response 進行解析，這裡的解析的邏輯很簡單，就是分別建立兩個本地檔案，然後將 response.body 的內容放入這兩個檔案當中。

如何執行

執行的過程需要使用到命令列，注意，這裡需要使用到scrapy命令來執行；

1 2	$ cd /Users/mac/workspace/scrapy/tutorial $ python3 -m scrapy crawl quotes

大致會輸出如下內容

1 2 3 4 5 6 7 8 9 10 11

... 2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened 2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None) 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None) 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None) 2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html 2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html 2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished) ...

可以看到，通過爬取，我們在本地生成了兩個 html 檔案 quotes-1.html 和 quotes-2.html

如何提取

通過命令列的方式提取

Scrapy 提供了命令列的方式可以對需要被爬取的內容進行高效的除錯，通過使用Scrapy shell進入命令列，然後在命令列中可以快速的對要爬取的內容進行提取；

如何進入 Scrapy shell 環境

我們試著通過 Scrapy shell 來提取下 “http://quotes.toscrape.com/page/1/“ 中的資料，通過執行如下命令，進入 shell

1	$ scrapy shell "http://quotes.toscrape.com/page/1/"

輸出

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

[ ... Scrapy log here ... ] 2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None) [s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler <scrapy.crawler.Crawler object at 0x7fa91d888c90> [s] item {} [s] request <GET http://quotes.toscrape.com/page/1/> [s] response <200 http://quotes.toscrape.com/page/1/> [s] settings <scrapy.settings.Settings object at 0x7fa91d888c10> [s] spider <DefaultSpider 'default' at 0x7fa91c8af990> [s] Useful shortcuts: [s] shelp() Shell help ( print this help) [s] fetch(req_or_url) Fetch request (or URL) and update local objects [s] view(response) View response in a browser >>>

這樣，我們就進入了 Scrapy shell 的環境，上面顯示了連線請求和返回的相關資訊，response 返回 status code 200 表示成功返回；

通過 CSS 標準進行提取

這裡主要是遵循 CSS 標準 https://www.w3.org/TR/selectors/ 來對網頁的元素進行提取，

通過使用 css() 選擇我們要提取的元素；下面演示一下如何提取元素 <title/>

1 2	>>> response.css( 'title') [<Selector xpath=u 'descendant-or-self::title' data=u '<title>Quotes to Scrape</title>'>]

可以看到，它通過返回一個類似 SelectorList 的物件成功的獲取到了 http://quotes.toscrape.com/page/1/ 頁面中的 <title/> 的資訊，該資訊是封裝在Selector物件中的 data 屬性中的；

提取Selector元素的文字內容，一般有兩種方式用來提取，

通過使用 extract() 或者 extract_first() 方法來提取元素的內容；下面演示如何提取 #1 返回的元素 <title/> 中的文字內容 text；

1 2	>>> response.css( 'title::text').extract_first() 'Quotes to Scrape'

extract_first() 表示提取返回佇列中的第一個 Selector 物件；同樣也可以使用如下的方式，

1 2	>>> response.css( 'title::text')[0].extract() 'Quotes to Scrape'

不過 extract_first() 方法可以在當頁面沒有找到的情況下，避免出現IndexError的錯誤；

通過 re() 方法來使用正則表示式的方式來進行提取元素的文字內容

1 2 3 4 5 6

>>> response.css( 'title::text').re(r 'Quotes.*') [ 'Quotes to Scrape'] >>> response.css( 'title::text').re(r 'Q\w+') [ 'Quotes'] >>> response.css( 'title::text').re(r '(\w+) to (\w+)') [ 'Quotes', 'Scrape']

備註，最後一個正則表示式返回了兩個匹配的 Group；

使用 XPath

除了使用 CSS 標準來提取元素意外，我們還可以使用 XPath 標準來提取元素，比如，

1 2 3 4

>>> response.xpath( '//title') [<Selector xpath= '//title' data= '<title>Quotes to Scrape</title>'>] >>> response.xpath( '//title/text()').extract_first() 'Quotes to Scrape'

XPath 比 CSS 的爬取方式更為強大，因為它不僅僅是根據 HTML 的結構元素去進行檢索(Navigating)，並且它可以順帶的對文字(text)進行檢索；所以它可以支援 CSS 標準不能做到的場景，比如，檢索一個包含文字內容”Next Page”的 link 元素；這就使得通過 XPath 去構建爬蟲更為簡單；

提取 quotes 和 authors

下面我們將來演示如何提取 http://quotes.toscrape.com 首頁中的內容，先來看看首頁的結構

可以看到，裡面每個段落包含了一個名人的一段語錄，那麼我們如何來提取所有的相關資訊呢？

我們從提取第一個名人的資訊入手，看看如何提取第一個名人的名言資訊；可以看到，第一個名人的語句是愛因斯坦的，那麼我們試著來提取名言、作者以及相關的tags；

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

<div class="quote"> <span class="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” </span> <span> by <small class="author">Albert Einstein </small> <a href="/author/Albert-Einstein">(about) </a> </span> <div class="tags"> Tags: <a class="tag" href="/tag/change/page/1/">change </a> <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts </a> <a class="tag" href="/tag/thinking/page/1/">thinking </a> <a class="tag" href="/tag/world/page/1/">world </a> </div> </div>

下面我們就來試著一步一步的去提取相關的資訊，

首先，進入 Scrapy Shell，

1	$ scrapy shell 'http://quotes.toscrape.com'

然後，獲取 <div class="quote" /> 元素列表

1	>>> response.css( "div.quote")

這裡會返回一系列的相關的 Selectors，不過因為這裡我們僅僅是對第一個名言進行解析，所以我們只取第一個元素，並將其儲存在 quote 變數中

1	>>> quote = response.css( "div.quote")[0]

然後，我們來分別提取title、author和tags

提取title

1 2 3

>>> title = quote.css( "span.text::text").extract_first() >>> title '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

提取author

1 2 3

>>> author = quote.css( "small.author::text").extract_first() >>> author 'Albert Einstein'

提取tags，這裡需要注意的是，tags 是一系列的文字，

1 2 3

>>> tags = quote.css( "div.tags a.tag::text").extract() >>> tags [ 'change', 'deep-thoughts', 'thinking', 'world']

Ok，上述完成了針對其中一個名言資訊的提取，那麼，我們如何提取完所有名人的名言資訊呢？

1 2 3 4 5 6 7 8

>>> for quote in response.css( "div.quote"): ... text = quote.css( "span.text::text").extract_first() ... author = quote.css( "small.author::text").extract_first() ... tags = quote.css( "div.tags a.tag::text").extract() ... print(dict(text=text, author=author, tags=tags)) { 'tags': [ 'change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'} { 'tags': [ 'abilities', 'choices'], 'author': 'J.K. Rowling', 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'} ... a few more of these, omitted for brevity

寫個迴圈，將所有的資訊的資訊放入 Python dictionary；

通過 Python 程式來進行提取

本小計繼續沿用提取 quotes 和 authors 小節的例子，來看看如何通過 python 程式來做相同的爬取動作；

提取資料

修改該之前的 quotes_spider.py 內容，如下，

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] def parse(self, response): for quote in response.css( 'div.quote'): yield { 'text': quote.css( 'span.text::text').extract_first(), 'author': quote.css( 'small.author::text').extract_first(), 'tags': quote.css( 'div.tags a.tag::text').extract(), }

執行上述的名為 quotes 的爬蟲，

1	$ scrapy crawl quotes

執行結果如下，

1 2 3 4

2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> { 'tags': [ 'life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'} 2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> { 'tags': [ 'edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}

可以看到，我們通過 python 建立的爬蟲 quotes 一條一條的返回了爬取的資訊；

儲存資料

最簡單儲存方式被爬取的資料是通過使用 Feed exports，通過使用如下的命令，

使用 JSON 格式

1	$ scrapy crawl quotes -o quotes.json

上述命令會生成一個檔案quotes.json，該檔案中包含了所有被爬取的資料；不過由於歷史的原因，Scrapy 是往一個檔案中追加被爬取的資訊，而不是覆蓋更新，所以如果你執行上述命令兩次，將會得到一個損壞了的 json 檔案；

使用 JSON Lines 格式

1	$ scrapy crawl quotes -o quotes.jl

這樣，儲存的檔案就是 JSON Lines 的格式了，注意，這裡的唯一變化是檔案的字尾名改為了.jl；

補充，JSON Lines 是另一種 JSON 格式的定義，基本設計是每行是一個有效的 JSON Value；比如它的格式比 CSV 格式更友好，

1 2 3 4 5

[ "Name", "Session", "Score", "Completed"] [ "Gilbert", "2013", 24, true] [ "Alexa", "2013", 29, true] [ "May", "2012B", 14, false] [ "Deloise", "2012A", 19, true]

同時也可以支援內嵌資料，

1 2 3 4

{ "name": "Gilbert", "wins": [[ "straight", "7♣"], [ "one pair", "10♥"]]} { "name": "Alexa", "wins": [[ "two pair", "4♠"], [ "two pair", "9♠"]]} { "name": "May", "wins": []} { "name": "Deloise", "wins": [[ "three of a kind", "5♣"]]}

JSON Lines 格式非常適合處理含有大量資料的檔案，通過迭代，每行處理一個數據物件；不過，要注意的是，使用 JSON lines 的方式，Scrapy 同樣的是以追加的方式新增內容，只是因為 JSON Lines 逐行的方式新增被爬取的資料，所以以追加的方式並不會想使用 JSON 格式那樣導致檔案格式錯誤；

如果是一個小型的專案，使用 JSON Lines 的方式就足夠了；但是，如果你面臨的是一個更復雜的專案，而且有更復雜的資料需要爬取，那麼你就可以使用 Item Pipeline；一個 demo Pipelines 已經幫你建立好了，tutorial/pipelines.py；

提取下一頁(提取連結資訊)

如何提取章節詳細的描述瞭如何爬取頁面的資訊，那麼，如何爬取該網站的所有資訊呢？那麼就必須爬取相關的連結資訊；那麼我們依然以 http://quotes.toscrape.com 為例，來看看我們該如何爬取連結資訊，

我們可以看到，下一頁的連結 HTML 元素，

1 2 3 4 5

我們可以通過 shell 來抓取它，

1 2	>>> response.css( 'li.next a').extract_first() '<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'

這樣，我們得到了這個anchor元素，但是我們想要得到的是其href屬性；Scrapy 支援 CSS 擴充套件的方式，因此我們可以直接爬取其屬性值，

1 2	>>> response.css( 'li.next a::attr(href)').extract_first() '/page/2/'

好的，我們現在已經知道該如何獲取下一頁連結的相對地址了，那麼我們如何修改我們的 python 程式使得我們可以自動的爬取所有頁面的資料呢？

使用 scrapy.Request

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', ] def parse(self, response): for quote in response.css( 'div.quote'): yield { 'text': quote.css( 'span.text::text').extract_first(), 'author': quote.css( 'small.author::text').extract_first(), 'tags': quote.css( 'div.tags a.tag::text').extract(), } next_page = response.css( 'li.next a::attr(href)').extract_first() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse)

這裡簡單的描述下程式的執行邏輯，通過 for 迴圈處理完當前頁面的爬取操作，然後執行獲取下一頁的相關操作，首先獲得下一頁的相對路徑並儲存到變數 next_page 中，然後通過 response.urljon(next_page) 方法得到絕對路徑；最後，通過該絕對路徑再生成一個 scrapy.Request 物件返回，並加入爬蟲佇列中，等待下一次的爬取；由此，你就可以動態的去爬取所有相關頁面的資訊了；

基於此，你就可以建立起非常複雜的爬蟲了，同樣，可以根據不同連結的型別，構建不同的 Parser，那麼就可以對不同型別的返回頁面進行分別處理；

使用 response.follow

不同於使用 scrapy Request，需要通過相對路徑構造出絕對路徑，response.follow 可以直接使用相對路徑，因此就不需要呼叫 urljoin 方法了；注意，response.follow 直接返回一個 Request 例項，可以直接通過 yield 進行返回；所以，上述程式碼可以簡化為

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', ] def parse(self, response): for quote in response.css( 'div.quote'): yield { 'text': quote.css( 'span.text::text').extract_first(), 'author': quote.css( 'span small::text').extract_first(), 'tags': quote.css( 'div.tags a.tag::text').extract(), } next_page = response.css( 'li.next a::attr(href)').extract_first() if next_page is not None: yield response.follow(next_page, callback=self.parse)

另外，response.follow 在處理 <a> 元素的時候，會直接使用它們的 href 屬性；所以上述程式碼還可以簡化為，

1 2 3

next_page = response.css( 'li.next a').extract_first() if next_page is not None: yield response.follow(next_page, callback=self.parse)

因此匹配的時候不需要顯示的宣告 <a> 的屬性值了；

定義更多的 Parser

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

import scrapy class AuthorSpider(scrapy.Spider): name = 'author' start_urls = [ 'http://quotes.toscrape.com/'] def parse(self, response): # follow links to author pages for href in response.css( '.author + a::attr(href)'): yield response.follow(href, self.parse_author) # follow pagination links for href in response.css( 'li.next a::attr(href)'): yield response.follow(href, self.parse) def parse_author(self, response): def extract_with_css(query): return response.css(query).extract_first().strip() yield { 'name': extract_with_css( 'h3.author-title::text'), 'birthdate': extract_with_css( '.author-born-date::text'), 'bio': extract_with_css( '.author-description::text'), }

該例子建立了兩個解析方法 parse() 和 parse_author()，一個是用來控制整個爬取流程，一個是用來解析 author 資訊的；首先，我們來分析一下執行的流程，

進入 parse()，從當前的頁面中爬取得到所有相關的 author href 屬性值既是一個連結，然後針對該連結，通過 response.follow 建立一個新的 Request 繼續進行爬取，通過回撥 parse_author() 方法對爬取的內容進行進一步的解析，這裡就是對爬取到的 Author 的資訊進行提取；
當 #1 有關當前頁面所

爬蟲 Scrapy 學習系列之一：Tutorial

前言

Scrapy 安裝

Scrapy Tutorial

建立 tutorial 專案

匯入 PyCharm

工程結構

第一個 Spider

如何執行

如何提取

通過命令列的方式提取

如何進入 Scrapy shell 環境

通過 CSS 標準進行提取

使用 XPath

提取 quotes 和 authors

通過 Python 程式來進行提取

提取資料

儲存資料

使用 JSON 格式

使用 JSON Lines 格式

提取下一頁(提取連結資訊)

使用 scrapy.Request

使用 response.follow

定義更多的 Parser

相關推薦