python爬蟲-scrapy持久化儲存

阿新 • • 發佈：2022-03-23

scrapy的持久化儲存有兩種：基於終端指令的和基於管道的

基於終端指令

限制：

只能將parse方法的返回值儲存在本地的文字檔案中
檔案格式只能是，json、jsonlines、jl、csv、xml、marshal、pickle

scrapy crawl 爬蟲檔案 -o 儲存路徑

基於管道

編碼流程：

資料解析
在item類中定義要儲存的相關的屬性
將解析的資料儲存到item型別的物件中
將item型別物件交給管道持久化儲存
在管道類中，process_item將會處理item物件，將資料持久化儲存
在setting.py配置檔案中開啟管道

item物件

在專案工程中有一個item.py的檔案，開啟是一個類，我們將使用這個類來例項化item。
但是這個類初始是空的，需要我們自己來構建一下。
假設我們需要儲存的資料是作者和文字，那麼需要在item中新增對應屬性。

import scrapy


class StudyScrapy02Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    author = scrapy.Field()
    content = scrapy.Field()

在爬蟲檔案中的書寫也發生了改變
導包可能會報紅，但是不影響使用

import scrapy
from study_scrapy02.items import StudyScrapy02Item


class GushiSpider(scrapy.Spider):
    name = 'gushi'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://so.gushiwen.cn/mingjus/']

    def parse(self, response):
        div_list = response.xpath('//*[@id="html"]/body/div[2]/div[1]/div[2]/div')
        datas = []
        for div in div_list:
            # extract可以將SSelector物件的儲存的資料提取出來
            content = div.xpath('./a[1]/text()')[0].extract()
            author = div.xpath('./a[2]/text()')[0].extract()
            # 例項化item
            item = StudyScrapy02Item()
            # 將資料放入item
            item['author'] = author
            item['content'] = content
            # 將item提交給管道
            yield item

管道

在專案工程檔案中我們還可以發現一個py檔案pipeline.py，裡面依舊是一個類，類中定義了一個process_item的方法。
我麼將根據這個方法來處理item物件。
如果有多個管道儲存需求可以建立多個管道類來使用。

class StudyScrapy02Pipeline:
    fp = None

    # 這個方法只在爬蟲開始時執行一次
    def open_spider(self, spider):
        print('爬蟲開始====================')
        self.fp = open('./gushi.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        author = item['author']
        content = item['content']
        self.fp.write(content+' ——  '+author+'\n')
        return item

    # 這個方法只在爬蟲結束時執行一次
    def close_spider(self,spider):
        self.fp.close()
        print('爬蟲結束====================')

我們還需要在settings配置檔案中開啟管道

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    # 300表示優先順序，數值越小優先順序越高
    'study_scrapy02.pipelines.StudyScrapy02Pipeline': 300,
}

python爬蟲-scrapy持久化儲存

基於終端指令

基於管道

item物件

管道

python爬蟲-scrapy持久化儲存

Python爬蟲 scrapy框架爬取某招聘網存入mongodb解析

Python爬蟲Scrapy框架CrawlSpider原理及使用案例

python爬蟲scrapy圖書分類例項講解

[Python爬蟲]scrapy-redis快速上手（爬蟲分散式改造）

Python 爬蟲 - Scrapy框架原理

Python爬蟲scrapy框架Cookie池(微博Cookie池)的使用

python爬蟲-scrapy資料解析

python爬蟲-scrapy下載中介軟體

python爬蟲貓眼電影和電影天堂資料csv和mysql儲存過程解析

Python資料持久化儲存實現方法分析

python網路爬蟲 Scrapy中selenium用法詳解

python爬蟲庫scrapy簡單使用例項詳解

scrapy框架持久化儲存

Python爬蟲例項——scrapy框架爬取拉勾網招聘資訊

【Python爬蟲】儲存格式化資料

附: Python 爬蟲資料庫儲存資料

python爬蟲學習筆記(二十五)-Scrapy框架 Middleware

python爬蟲學習筆記(二十四)-Scrapy框架圖片管道的使用

python爬蟲學習筆記(二十三)-Scrapy框架 CrawlSpider

python爬蟲-scrapy持久化儲存

基於終端指令

基於管道

item物件

管道

相關推薦