根據scrapy文件學習

阿新 • • 發佈：2021-02-05

入門案例



class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()' 
).get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

執行命令scrapy runspider quotes_spider.py
如果在這段命令後面加上 -o quotes.jl

，會把爬取的資料以json格式放到一個jl檔案中，並且每一條資料都是獨佔一行的

執行結果

{‘author’: ‘Jane Austen’, ‘text’: ‘“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”’}
{‘author’: ‘Steve Martin’, ‘text’: ‘“A day without sunshine is like, you know, night.”’}
{‘author’: ‘Garrison Keillor’, ‘text’: ‘“Anyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.”’}…

執行過程

輸入執行命令的時候，Scrapy在它內部尋找一個蜘蛛定義，並通過它的爬蟲引擎執行它。

開始向start_urls中定義的url發出請求，並呼叫預設的回撥方法parse()，引數為上一次請求的響應物件response。並在回撥函式中找尋下一頁的連結，再次發起請求，迴圈往復。

scrapy的優點：請求是非同步的，這意味著Scrapy不需要等待請求完成和處理，它可以同時傳送另一個請求或做其他事情。這也意味著，即使某些請求失敗或在處理過程中發生錯誤，其他請求也可以繼續進行。

Scrapy安裝略過，連結

建立專案

# 建立scrapy專案，名為tutorial
scrapy startproject tutorial

檔案目錄

tutorial/
- scrapy.cfg-------------部署配置檔案
- tutorial/
  - init.py----------init兩邊是有下劃線的，這個格式顯示不出來
  - items.py-----------定義實體類
  - middlewares.py--------------排程中介軟體
  - pipelines.py-------------管道檔案
  - settings.py---------配置檔案
  - spiders/------------存放爬蟲的資料夾
    - init.py----------init兩邊是有下劃線的，這個格式顯示不出來

第一個爬蟲程式碼

import scrapy


class QuotesSpider(scrapy.Spider):
	# 在一個專案中必須唯一
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        # 這段程式碼可以省略不寫，
        # 因為parse()是它預設的回撥方法
        # 並且直接把方法提取成屬性
        # start_urls = []
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

	# 預設的回撥方法
	# 將抓取的資料提取為dict
    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'quotes-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log(f'Saved file {filename}')

執行爬蟲，需要轉到專案的頂級目錄
scrapy crawl quotes

此時已經建立了兩個新檔案：quotes-1.html 和引用-2.HTML(這是兩個沒有經過任何處理的Html檔案)

爬取全部頁碼

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        # /page/2/
        next_page = response.css('li.next a::attr(href)').get()
        print('==================')
        print(next_page)
        if next_page is not None:
          # urljoin，連結可以是相對的
          # http://quotes.toscrape.com/page/2/
          # http://quotes.toscrape.com/page/3/
          next_page = response.urljoin(next_page)
          print("----------------")
          print(next_page)
          yield scrapy.Request(next_page, callback=self.parse)

建立請求的快捷方式

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
        	# 建立請求的快捷方式
            yield response.follow(next_page, callback=self.parse)

和Scrapy.Request， response.follow 直接支援相對URL-無需呼叫URLJOIN。

也可以將選擇器傳遞給 response.follow 而不是字串；此選擇器應提取必要的屬性：

for href in response.css('ul.pager a::attr(href)'):
    yield response.follow(href, callback=self.parse)

為了 a標籤元素有一個快捷方式： response.follow 自動使用其href屬性。因此程式碼可以進一步縮短：

for a in response.css('ul.pager a'):
    yield response.follow(a, callback=self.parse)

要從iterable建立多個請求，可以使用 response.follow_all 取而代之的是：

anchors = response.css('ul.pager a')
yield from response.follow_all(anchors, callback=self.parse)

或者，進一步縮短：

yield from response.follow_all(css='ul.pager a', callback=self.parse)

使用response.follow_all

follow_allScrapy 2.0才有，之前的版本沒有，所以我測試的時候是報錯的

import scrapy


class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        author_page_links = response.css('.author + a')
        yield from response.follow_all(author_page_links, self.parse_author)

        pagination_links = response.css('li.next a')
        yield from response.follow_all(pagination_links, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).get(default='').strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

這個爬蟲展示的另一個有趣的事情是，即使同一作者引用了很多話，我們也不需要擔心多次訪問同一作者頁面。預設情況下，scrappy過濾掉對已經訪問過的URL的重複請求，避免了由於程式設計錯誤而太多地訪問伺服器的問題。這可以通過設定進行配置 DUPEFILTER_CLASS

使用爬蟲引數構建url

執行時通過-a來新增引數

scrapy crawl quotes -O quotes-humor.json -a tag=humor

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

如果你通過 tag=humor 對於這個蜘蛛，您會注意到它只訪問來自 humor 標記，如 http://quotes.toscrape.com/tag/humor

基本概念

爬蟲的幾個類

class scrapy.spiders.Spider

這是最簡單的爬蟲，也是每個爬蟲都必須繼承的。它不提供任何特殊功能。它只是提供了一個預設值start_requests()。從型別為列表的屬性start_urls 提取url傳送請求，並呼叫spider的方法 parse 對應每個結果響應。

name
爬蟲的名字，唯一，必需。
allowed_domains
允許爬蟲爬取的url，假設目標url是https://www.example.com/1.html，然後新增’example.com’
如不新增屬性，就是不限制爬取的範圍
start_urls
存放需要爬取的url列表
custom_settings
執行此spider時，將從專案範圍配置中重寫的設定字典。它必須被定義為類屬性，因為在例項化之前更新了設定。todo什麼意思
logger
用爬蟲建立的python記錄器 name . 可以使用它通過它傳送日誌訊息
start_requests()
如果要更改用於開始抓取域的請求，這是要重寫的方法。例如，如果您需要從使用POST請求登入開始，可以執行以下操作：

class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        return [scrapy.FormRequest("http://www.example.com/login",
                                   formdata={'user': 'john', 'pass': 'secret'},
                                   callback=self.logged_in)]

    def logged_in(self, response):
        # here you would extract links to follow and return Requests for
        # each of them, with another callback
        pass

closed(reason)
蜘蛛關閉時呼叫。此方法為 spider_closed 訊號。

還有好多屬性，用到的時候再記錄吧

內容全部轉載自：
https://www.osgeo.cn/scrapy/intro/overview.html#walk-through-of-an-example-spider

根據scrapy文件學習

入門案例

執行過程

建立專案

檔案目錄

第一個爬蟲程式碼

爬取全部頁碼

建立請求的快捷方式

使用response.follow_all

使用爬蟲引數構建url

基本概念

class scrapy.spiders.Spider

根據scrapy文件學習

Django官方文件學習——第一部分

ignite官方文件學習的坑點

Dva 初讀文件學習

介面測試實戰專案02：根據介面文件測試

vue3.0的文件學習

根據XML文件 PHP實現SOAP請求WSDL

MyBatis官方文件學習

2021-8-5 Microsoft文件學習筆記（C#）

python 第三方庫BeautifulSoup4文件學習（1）

python 第三方庫BeautifulSoup4文件學習（2）

python 第三方庫BeautifulSoup4文件學習（3）

根據官方文件使用Visual Studio Code建立程式碼元件的一些總結

TypeScript 函式官方文件學習

Vue3.0文件學習心得--響應式核心

Vue3.0文件學習心得--響應式工具

< Android Camera2 HAL3 學習文件 >

前端學習筆記一：HTML 特點基本結構 doctype文件型別宣告網頁編碼設定 html標籤屬性文字和段落標籤特殊符號

Asp.Net Core學習筆記6——Swagger實現API文件功能

MongoDB學習筆記：文件Crud Shell

根據scrapy文件學習

入門案例

執行過程

建立專案

檔案目錄

第一個爬蟲程式碼

爬取全部頁碼

建立請求的快捷方式

使用response.follow_all

使用爬蟲引數構建url

基本概念

class scrapy.spiders.Spider

相關推薦