Scrapy從入門到放棄4--管道pipelines使用

阿新 • • 發佈：2021-01-01

技術標籤：網路爬蟲資料庫 python mongodb

Scrapy管道的使用

在這裡插入圖片描述

學習目標：

掌握 scrapy管道(pipelines.py)的使用

之前我們在scrapy入門使用一節中學習了管道的基本使用，接下來我們深入的學習scrapy管道的使用

1. pipeline中常用的方法：

process_item(self,item,spider):
- 管道類中必須有的函式
- 實現對item資料的處理
- 必須return item
open_spider(self, spider): 在爬蟲開啟的時候僅執行一次
close_spider(self, spider): 在爬蟲關閉的時候僅執行一次

2. 管道檔案的修改

繼續完善wangyi爬蟲，在pipelines.py程式碼中完善

import json
from pymongo import MongoClient

class WangyiFilePipeline(object):
    def open_spider(self, spider):  # 在爬蟲開啟的時候僅執行一次
        if spider.name == 'itcast':
            self.f = open('json.txt', 'a', encoding='utf-8')

    def close_spider(self, spider):  # 在爬蟲關閉的時候僅執行一次
        if spider.name == 'itcast':
            self.f.close()

    def process_item(self, item, spider):
        if spider.name == 'itcast':
            self.f.write(json.dumps(dict(item), ensure_ascii=False, indent=2) + ',\n')
        # 不return的情況下，另一個權重較低的pipeline將不會獲得item
        return item  

class WangyiMongoPipeline(object):
    def open_spider(self, spider):  # 在爬蟲開啟的時候僅執行一次
        if spider.name == 'itcast':
        # 也可以使用isinstanc函式來區分爬蟲類:
            con = MongoClient(host='127.0.0.1', port=27017) # 例項化mongoclient
            self.collection = con.itcast.teachers # 建立資料庫名為itcast,集合名為teachers的集合操作物件

    def process_item(self, item, spider):
        if spider.name == 'itcast':
            self.collection.insert(item) 
            # 此時item物件必須是一個字典,再插入
            # 如果此時item是BaseItem則需要先轉換為字典：dict(BaseItem)
        # 不return的情況下，另一個權重較低的pipeline將不會獲得item
        return item

3. 開啟管道

在settings.py設定開啟pipeline

......
ITEM_PIPELINES = {
    'myspider.pipelines.ItcastFilePipeline': 400, # 400表示權重
    'myspider.pipelines.ItcastMongoPipeline': 500, # 權重值越小，越優先執行！
}
......

別忘了開啟mongodb資料庫 sudo service mongodb start
並在mongodb資料庫中檢視 mongo

思考：在settings中能夠開啟多個管道，為什麼需要開啟多個？

不同的pipeline可以處理不同爬蟲的資料，通過spider.name屬性來區分

不同的pipeline能夠對一個或多個爬蟲進行不同的資料處理的操作，比如一個進行資料清洗，一個進行資料的儲存
同一個管道類也可以處理不同爬蟲的資料，通過spider.name屬性來區分

4. pipeline使用注意點

使用之前需要在settings中開啟
pipeline在setting中鍵表示位置(即pipeline在專案中的位置可以自定義)，值表示距離引擎的遠近，越近資料會越先經過：權重值小的優先執行
有多個pipeline的時候，process_item的方法必須return item,否則後一個pipeline取到的資料為None值
pipeline中process_item的方法必須有，否則item沒有辦法接受和處理
process_item方法接受item和spider，其中spider表示當前傳遞item過來的spider
open_spider(spider) :能夠在爬蟲開啟的時候執行一次
close_spider(spider) :能夠在爬蟲關閉的時候執行一次
上述倆個方法經常用於爬蟲和資料庫的互動，在爬蟲開啟的時候建立和資料庫的連線，在爬蟲關閉的時候斷開和資料庫的連線

小結

管道能夠實現資料的清洗和儲存，能夠定義多個管道實現不同的功能，其中有個三個方法
- process_item(self,item,spider):實現對item資料的處理
- open_spider(self, spider): 在爬蟲開啟的時候僅執行一次
- close_spider(self, spider): 在爬蟲關閉的時候僅執行一次