1. 程式人生 > 其它 >Scrapy從指令碼執行爬蟲的5種方式

Scrapy從指令碼執行爬蟲的5種方式

一、命令列執行爬蟲

1、執行爬蟲(2種方式)
執行爬蟲
$ scrapy crawl spidername

在沒有建立專案的情況下執行爬蟲
$ scrapy runspider spidername .py

二、檔案中執行爬蟲


1、cmdline方式執行爬蟲

# -*- coding: utf-8 -*-

from scrapy import cmdline, Spider


class BaiduSpider(Spider):
    name = 'baidu'

    start_urls = ['http://baidu.com/']

    def parse(self, response):
        self.log(
"run baidu") if __name__ == '__main__': cmdline.execute("scrapy crawl baidu".split())


2、CrawlerProcess方式執行爬蟲

# -*- coding: utf-8 -*-

from scrapy import Spider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

class BaiduSpider(Spider):
    name 
= 'baidu' start_urls = ['http://baidu.com/'] def parse(self, response): self.log("run baidu") if __name__ == '__main__': # 通過方法 get_project_settings() 獲取配置資訊 process = CrawlerProcess(get_project_settings()) process.crawl(BaiduSpider) process.start()

3、通過CrawlerRunner 執行爬蟲

# -*- coding: utf-8 -*-

from scrapy import Spider
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor


class BaiduSpider(Spider):
    name = 'baidu'

    start_urls = ['http://baidu.com/']

    def parse(self, response):
        self.log("run baidu")


if __name__ == '__main__':
    # 直接執行控制檯沒有日誌
    configure_logging(
        {
            'LOG_FORMAT': '%(message)s'
        }
    )

    runner = CrawlerRunner()

    d = runner.crawl(BaiduSpider)
    d.addBoth(lambda _: reactor.stop())
    reactor.run()

三、檔案中執行多個爬蟲


專案中新建一個爬蟲 SinaSpider

# -*- coding: utf-8 -*-

from scrapy import Spider


class SinaSpider(Spider):
    name = 'sina'

    start_urls = ['https://www.sina.com.cn/']

    def parse(self, response):
        self.log("run sina")


1、cmdline方式不可以執行多個爬蟲
如果將兩個語句放在一起,第一個語句執行完後程序就退出了,執行到不到第二句

# -*- coding: utf-8 -*-

from scrapy import cmdline

cmdline.execute("scrapy crawl baidu".split())
cmdline.execute("scrapy crawl sina".split())


使用 cmdline執行多個爬蟲的指令碼

from multiprocessing import Process
from scrapy import cmdline
import time
import logging

# 配置引數即可, 爬蟲名稱,執行頻率
confs = [
    {
        "spider_name": "unit42",
        "frequency": 2,
    },
    {
        "spider_name": "cybereason",
        "frequency": 2,
    },
    {
        "spider_name": "Securelist",
        "frequency": 2,
    },
    {
        "spider_name": "trendmicro",
        "frequency": 2,
    },
    {
        "spider_name": "yoroi",
        "frequency": 2,
    },
    {
        "spider_name": "weibi",
        "frequency": 2,
    },
]


def start_spider(spider_name, frequency):
    args = ["scrapy", "crawl", spider_name]
    while True:
        start = time.time()
        p = Process(target=cmdline.execute, args=(args,))
        p.start()
        p.join()
        logging.debug("### use time: %s" % (time.time() - start))
        time.sleep(frequency)


if __name__ == '__main__':
    for conf in confs:
        process = Process(target=start_spider,
                          args=(conf["spider_name"], conf["frequency"]))     #這裡會無限迴圈???
        process.start()
        time.sleep(10)


不過有了以下兩個方法來替代,就更優雅了

2、CrawlerProcess方式執行多個爬蟲
備註:爬蟲專案檔案為:
scrapy_demo/spiders/baidu.py
scrapy_demo/spiders/sina.py

# -*- coding: utf-8 -*-

from scrapy.crawler import CrawlerProcess

from scrapy_demo.spiders.baidu import BaiduSpider
from scrapy_demo.spiders.sina import SinaSpider

process = CrawlerProcess()
process.crawl(BaiduSpider)
process.crawl(SinaSpider)
process.start()

此方式執行,發現日誌中中介軟體只啟動了一次,而且傳送請求基本是同時的,說明這兩個爬蟲執行不是獨立的,可能會相互干擾

3、通過CrawlerRunner 執行多個爬蟲

# -*- coding: utf-8 -*-

from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor

from scrapy_demo.spiders.baidu import BaiduSpider
from scrapy_demo.spiders.sina import SinaSpider


configure_logging()
runner = CrawlerRunner()
runner.crawl(BaiduSpider)
runner.crawl(SinaSpider)
d = runner.join()
d.addBoth(lambda _: reactor.stop())

reactor.run()

此方式也只加載一次中介軟體,不過是逐個執行的,會減少干擾,官方文件也推薦使用此方法來執行多個爬蟲

總結


方式 是否讀取settings.py 執行數量
$ scrapy crawl baidu 讀取 單個
$ scrapy runspider baidu.py 讀取 單個
cmdline.execute 讀取 單個(推薦)
CrawlerProcess 不讀取 單個,多個
CrawlerRunner 不讀取 單個,多個(推薦)
cmdline.execute 執行單個爬蟲檔案的配置最簡單,一次配置,多次執行