python爬蟲scrapy之rules的基本使用

阿新 • • 發佈：2017-12-04

highlight 目的創建 true ans 滿足 topic hole auth

Link Extractors

Link Extractors 是那些目的僅僅是從網頁(scrapy.http.Response 對象)中抽取最終將會被follow鏈接的對象?

Scrapy默認提供2種可用的 Link Extractor, 但你通過實現一個簡單的接口創建自己定制的Link Extractor來滿足需求?

每個LinkExtractor有唯一的公共方法是 extract_links ,它接收一個 Response 對象,並返回一個 scrapy.link.Link 對象?Link Extractors,要實例化一次並且 extract_links 方法會根據不同的response調用多次提取鏈接?

Link Extractors在 CrawlSpider 類(在Scrapy可用)中使用, 通過一套規則,但你也可以用它在你的Spider中,即使你不是從 CrawlSpider 繼承的子類, 因為它的目的很簡單: 提取鏈接?

上面都是官網解釋，看看就行了，這個Rule啊其實就是為了爬取全站內容的寫法，首先我們繼承的就不是scrapy.spider類了，而是繼承CrawlSpider這個類，看源碼就回明白CrawlSpider這個類也是繼承scrapy.spider類。

　　具體參數：

　　allow：這裏用的是re過濾，我們其實就是start_urls加上我們這個匹配到的具體鏈接下的內容。　 LinkExtractor：故名思議就是鏈接的篩選器，首先篩選出來我們需要爬取的鏈接。

　　deny：這個參數跟上面的參數剛好想反，定義我們不想爬取的鏈接。

　　follow：默認是false，爬取和start_url符合的url。如果是True的話，就是爬取頁面內容所有的以start_urls的url。

　　restrict_xpaths：使用xpath表達式，和allow共同作用過濾鏈接。還有一個類似的restrict_css

　　callback：定義我們拿到可以爬取到的url後，要執行的方法，並傳入每個鏈接的response內容（也就是網頁內容）

　　註意：rule無論有無callback，都由同一個_parse_response函數處理，只不過他會判斷是否有follow和callback

from scrapy.spiders.crawl import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor

示例：

from whole_website.items import DoubanSpider_Book
from scrapy.spiders.crawl import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor


class DoubanSpider(CrawlSpider):
    name = "douban"
    allowed_domains = ["book.douban.com"]
    start_urls = [‘https://book.douban.com/‘]

    rules = [
        Rule(LinkExtractor(allow=‘subject/\d+‘),callback=‘parse_items)
    ]

    def parse_items(self, response):
        items = DoubanSpider_Book()
        items[‘name‘] = response.xpath(‘//*[@id="wrapper"]/h1/span/text()‘).extract_first()
        items[‘author‘] = response.xpath(‘//*[@id="info"]//a/text()‘).extract()
        data = {‘book_name‘:items[‘name‘],
                ‘book_author‘:items[‘author‘]
                }
        print(data)

參考地址：http://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/link-extractors.html

python爬蟲scrapy之rules的基本使用

highlight 目的創建 true ans 滿足 topic hole auth Link Extractors Link Extractors 是那些目的僅僅是從網頁(scrapy.http.Response 對象)中抽取最終將會被follow鏈接的對象? Scra

python爬蟲scrapy之rules的基本使用

Link Extractors

python爬蟲scrapy之rules的基本使用

python爬蟲scrapy之如何同時執行多個scrapy爬行任務

python爬蟲學習之XPath基本語法

python爬蟲Scrapy(一)-我爬了boss資料 MongoDB基本命令操作

【我要學python】爬蟲準備之瞭解基本的html標籤

Python爬蟲系列之----Scrapy

python爬蟲學習之正則表示式的基本使用

python爬蟲Scrapy框架之中間件

Python爬蟲系列之----Scrapy(一)爬蟲原理

python爬蟲Scrapy框架之增量式爬蟲

2017.08.10 Python爬蟲實戰之爬蟲攻防

2017.08.10 Python爬蟲實戰之爬蟲攻防篇

Python爬蟲常用之登錄(一) 思想

python爬蟲學習之路-遇錯筆記-1

Python的學習之旅———基本數據類型 (元組)

Python的學習之旅———基本數據類型(字符編碼)

python爬蟲scrapy的LinkExtractor

Python爬蟲Scrapy(二)_入門案例

python 爬蟲入門之正則表達式一

安裝python爬蟲scrapy踩過的那些坑和編程外的思考

python爬蟲scrapy之rules的基本使用

Link Extractors

相關推薦