Scrapy從指令碼執行爬蟲的5種方式
阿新 • • 發佈:2021-11-19
一、命令列執行爬蟲
1、執行爬蟲(2種方式)
執行爬蟲
$ scrapy crawl spidername
在沒有建立專案的情況下執行爬蟲
$ scrapy runspider spidername .py
二、檔案中執行爬蟲
1、cmdline方式執行爬蟲
# -*- coding: utf-8 -*- from scrapy import cmdline, Spider class BaiduSpider(Spider): name = 'baidu' start_urls = ['http://baidu.com/'] def parse(self, response): self.log("run baidu") if __name__ == '__main__': cmdline.execute("scrapy crawl baidu".split())
2、CrawlerProcess方式執行爬蟲
# -*- coding: utf-8 -*- from scrapy import Spider from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings class BaiduSpider(Spider): name= 'baidu' start_urls = ['http://baidu.com/'] def parse(self, response): self.log("run baidu") if __name__ == '__main__': # 通過方法 get_project_settings() 獲取配置資訊 process = CrawlerProcess(get_project_settings()) process.crawl(BaiduSpider) process.start()
3、通過CrawlerRunner 執行爬蟲
# -*- coding: utf-8 -*- from scrapy import Spider from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging from twisted.internet import reactor class BaiduSpider(Spider): name = 'baidu' start_urls = ['http://baidu.com/'] def parse(self, response): self.log("run baidu") if __name__ == '__main__': # 直接執行控制檯沒有日誌 configure_logging( { 'LOG_FORMAT': '%(message)s' } ) runner = CrawlerRunner() d = runner.crawl(BaiduSpider) d.addBoth(lambda _: reactor.stop()) reactor.run()
三、檔案中執行多個爬蟲
專案中新建一個爬蟲 SinaSpider
# -*- coding: utf-8 -*- from scrapy import Spider class SinaSpider(Spider): name = 'sina' start_urls = ['https://www.sina.com.cn/'] def parse(self, response): self.log("run sina")
1、cmdline方式不可以執行多個爬蟲
如果將兩個語句放在一起,第一個語句執行完後程序就退出了,執行到不到第二句
# -*- coding: utf-8 -*- from scrapy import cmdline cmdline.execute("scrapy crawl baidu".split()) cmdline.execute("scrapy crawl sina".split())
使用 cmdline執行多個爬蟲的指令碼
from multiprocessing import Process from scrapy import cmdline import time import logging # 配置引數即可, 爬蟲名稱,執行頻率 confs = [ { "spider_name": "unit42", "frequency": 2, }, { "spider_name": "cybereason", "frequency": 2, }, { "spider_name": "Securelist", "frequency": 2, }, { "spider_name": "trendmicro", "frequency": 2, }, { "spider_name": "yoroi", "frequency": 2, }, { "spider_name": "weibi", "frequency": 2, }, ] def start_spider(spider_name, frequency): args = ["scrapy", "crawl", spider_name] while True: start = time.time() p = Process(target=cmdline.execute, args=(args,)) p.start() p.join() logging.debug("### use time: %s" % (time.time() - start)) time.sleep(frequency) if __name__ == '__main__': for conf in confs: process = Process(target=start_spider, args=(conf["spider_name"], conf["frequency"])) #這裡會無限迴圈??? process.start() time.sleep(10)
不過有了以下兩個方法來替代,就更優雅了
2、CrawlerProcess方式執行多個爬蟲
備註:爬蟲專案檔案為:
scrapy_demo/spiders/baidu.py
scrapy_demo/spiders/sina.py
# -*- coding: utf-8 -*- from scrapy.crawler import CrawlerProcess from scrapy_demo.spiders.baidu import BaiduSpider from scrapy_demo.spiders.sina import SinaSpider process = CrawlerProcess() process.crawl(BaiduSpider) process.crawl(SinaSpider) process.start()
此方式執行,發現日誌中中介軟體只啟動了一次,而且傳送請求基本是同時的,說明這兩個爬蟲執行不是獨立的,可能會相互干擾
3、通過CrawlerRunner 執行多個爬蟲
# -*- coding: utf-8 -*- from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging from twisted.internet import reactor from scrapy_demo.spiders.baidu import BaiduSpider from scrapy_demo.spiders.sina import SinaSpider configure_logging() runner = CrawlerRunner() runner.crawl(BaiduSpider) runner.crawl(SinaSpider) d = runner.join() d.addBoth(lambda _: reactor.stop()) reactor.run()
此方式也只加載一次中介軟體,不過是逐個執行的,會減少干擾,官方文件也推薦使用此方法來執行多個爬蟲
總結
方式 是否讀取settings.py 執行數量
$ scrapy crawl baidu 讀取 單個
$ scrapy runspider baidu.py 讀取 單個
cmdline.execute 讀取 單個(推薦)
CrawlerProcess 不讀取 單個,多個
CrawlerRunner 不讀取 單個,多個(推薦)
cmdline.execute 執行單個爬蟲檔案的配置最簡單,一次配置,多次執行