1. 程式人生 > >筆記-scrapy-extentions

筆記-scrapy-extentions

list custom href number urn sse hat mwc 類加載順序

筆記-scrapy-extentions

1. extentions

1.1. 開始

The extensions framework provides a mechanism for inserting your own custom functionality into Scrapy.

Extensions are just regular classes that are instantiated at Scrapy startup, when extensions are initialized.

在scrapy中註冊擴展類需要在settings中設置EXTENSIONS參數,該參數中的每一條擴展記代表一個擴展類,記錄格式是擴展類的全路徑。

EXTENSIONS = {

‘scrapy.extensions.corestats.CoreStats‘: 500,

‘scrapy.extensions.telnet.TelnetConsole‘: 500,

}

記錄的值指定擴展類加載順序,一般不用。

停用擴展類:將 EXTENSIONS_BASE設為 None.:

EXTENSIONS = { ‘scrapy.extensions.corestats.CoreStats‘: None,}

1.2. 自定義擴展類

首先要知道scrapy從哪調用這些自定義擴展類,

第一步開始於crawler.py的self.extensions = ExtensionManager.from_crawler(self)

找到最後:

@classmethod

def from_settings(cls, settings, crawler=None):

mwlist = cls._get_mwlist_from_settings(settings)

middlewares = []

enabled = []

for clspath in mwlist:

try:

mwcls = load_object(clspath)

if crawler and hasattr(mwcls, ‘from_crawler‘):

mw = mwcls.from_crawler(crawler)

elif hasattr(mwcls, ‘from_settings‘):

mw = mwcls.from_settings(settings)

else:

mw = mwcls()

middlewares.append(mw)

enabled.append(clspath)

核心就這句了mw = mwcls.from_crawler(crawler),官方文檔描述如下:

Each extension is a Python class. The main entry point for a Scrapy extension (this also includes middlewares and pipelines) is the from_crawler class method which receives a Crawler instance. Through the Crawler object you can access settings, signals, stats, and also control the crawling behaviour.

Typically, extensions connect to signals and perform tasks triggered by them.

Finally, if the from_crawler method raises the NotConfigured exception, the extension will be disabled. Otherwise, the extension will be enabled.

意思是說擴展類必需要有from_crawler方法,scrapy會從這裏初始化類。

1.2.1. 案例解說

下面是一個擴展類案例:

import logging

from scrapy import signals

from scrapy.exceptions import NotConfigured

logger = logging.getLogger(__name__)

class SpiderOpenCloseLogging(object):

def __init__(self, item_count):

self.item_count = item_count

self.items_scraped = 0

@classmethod

def from_crawler(cls, crawler):

# first check if the extension should be enabled and raise

# NotConfigured otherwise

if not crawler.settings.getbool(‘MYEXT_ENABLED‘):

raise NotConfigured

# get the number of items from settings

item_count = crawler.settings.getint(‘MYEXT_ITEMCOUNT‘, 1000)

# instantiate the extension object

ext = cls(item_count)

# connect the extension object to signals

crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)

crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)

crawler.signals.connect(ext.item_scraped, signal=signals.item_scraped)

# return the extension object

return ext

def spider_opened(self, spider):

logger.info("opened spider %s", spider.name)

def spider_closed(self, spider):

logger.info("closed spider %s", spider.name)

def item_scraped(self, item, spider):

self.items_scraped += 1

if self.items_scraped % self.item_count == 0:

logger.info("scraped %d items", self.items_scraped)

看下它做了什麽,

from_crawler初始化了類,

然後這三句決定了什麽時候調用擴展類中的函數來執行操作。

crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)

crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)

crawler.signals.connect(ext.item_scraped, signal=signals.item_scraped)

其中的函數定義操作。

關於singals參考scrapy-singals文檔。

筆記-scrapy-extentions