爬蟲學習之-文件管道重寫

阿新 • • 發佈：2018-09-24

.org url The brush 學習 zoom 保存 raw files

如果要文件管道保存為原有的文件名需要重寫文件管道的方法

pipeitem文件

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don‘t forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy.pipelines.files import FilesPipeline
class OveridePipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None):
        file_name = request.url.split(‘/‘)[-1]
        if "." not in file_name:
            file_name = file_name + ‘.png‘
        return "pexels/"+file_name


class ImagesPipeline(object):
    def process_item(self, item, spider):
        # tmp = item[‘image_urls‘]
        # item[‘image_urls‘] = []
        #
        # for i in tmp:
        #     if "?" in i:
        #         item[‘image_urls‘].append(i.split("?")[0])
        #     else:
        #         item[‘image_urls‘].append(i)
        # print("下載圖片:",item[‘image_urls‘])
        # return item
        tmp = item[‘file_urls‘]
        item[‘file_urls‘] = []

        for i in tmp:
            if "?" in i:
                item[‘file_urls‘].append(i.split("?")[0])
            else:
                item[‘file_urls‘].append(i)
        print("下載圖片:", item[‘file_urls‘])
        return item

　　setting配置

ITEM_PIPELINES = {
    #‘scrapy.pipelines.images.ImagesPipeline‘:2,
    #‘scrapy.pipelines.files.FilesPipeline‘:3,
    ‘images.pipelines.OveridePipeline‘:3,
    ‘images.pipelines.ImagesPipeline‘: 1,
}

FILES_STORE = ‘d:/crawl‘

　　spider文件

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from ..items import ImagesItem
from scrapy.pipelines.images import ImagesPipeline
from scrapy.pipelines.files import FilesPipeline
from scrapy.pipelines.media import MediaPipeline
class PexSpider(CrawlSpider):
    name = ‘pex‘
    allowed_domains = [‘www.pexels.com‘]
    start_urls = [‘https://www.pexels.com/photo/vehicle-on-road-along-green-grass-during-night-714023/‘]

    rules = (
        Rule(LinkExtractor(allow=r‘/photo/‘), callback=‘parse_item‘, follow=True),
    )

    def parse_item(self, response):
        i = ImagesItem()
        #i[‘domain_id‘] = response.xpath(‘//input[@id="sid"]/@value‘).extract()
        #i[‘name‘] = response.xpath(‘//div[@id="name"]‘).extract()
        #i[‘description‘] = response.xpath(‘//div[@id="description"]‘).extract()
        #i[‘image_urls‘] = response.xpath("//img[@class=‘image-section__image js-photo-zoom‘]/@src").extract()
        i[‘file_urls‘] = response.xpath("//img[@class=‘image-section__image js-photo-zoom‘]/@src").extract()
        return i

　　item文件

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ImagesItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # image_urls = scrapy.Field()
    # images = scrapy.Field()
    file_urls = scrapy.Field()
    files = scrapy.Field()

爬蟲學習之-文件管道重寫

.org url The brush 學習 zoom 保存 raw files 如果要文件管道保存為原有的文件名需要重寫文件管道的方法 pipeitem文件 # -*- coding: utf-8 -*- # Define your item pipelines

Python 學習之文件對象的屬性和方法簡介

bject order 直接 rsyslogd cep com except pri and python的文件處理和相關輸入輸出能力。介紹文件對象(它的內建函數，內建方法和屬性)，標準文件，同時討論文件系統的訪問方法，文件執行，以及相關文件模塊。一、內建函數open

Linux學習之文件的壓縮與解壓

zip2 壓縮命令 unzip 文件夾擴展格式應該 host bz2 下面來學習四個常用的壓縮命令。 gzip 特點： 1.只能壓縮文件，不能壓縮目錄 2.不保留源文件 [root@localhost test]# gzip aaa 將a

Linux運維學習之文件屬性block簡析

創建文件系統 ali 小寫 linu boot 特點什麽 lock sys Linux運維基礎學習中，關於block了解多少呢？咱們今天就來簡單了解下。1、block是怎麽來的？是格式化創建文件系統的時候誕生的。2、block是什麽意思？實際用來存放數據的地方3、bloc

elasticsearch學習之文件分組

GET tmdb/_search { "size": 1, "_source": " ", "query": { "match": { "title": "star trek" } }, "aggs": { "statuses":

Blockly學習之文件閱讀筆記

概述一個用於Web、Android、iOS的視覺化程式碼編輯器庫。使用了相互關聯的積木來表示表達程式碼中變數、邏輯表示式、迴圈等。讓使用者能夠了解程式設計，而不用面對命令列上讓人恐懼和枯燥的程式碼和語法。在使用者看來，Blockly能夠非常直

Android學習——數據存儲之文件存儲

file 方法 amr view 運行 toa lns 運行程序 util 將數據存儲到文件中並讀取數據 1、新建FilePersistenceTest項目，並修改activity_main.xml中的代碼，如下：（只加入了EditText，用於輸入文本內容，不管輸入什麽按

Python學習筆記之文件和流

關閉 write finall 存儲路徑大文件描述可選參數針對硬盤打開文件：open(name[,mode[,buffering]])，返回一個文件對象，模式（mode）和緩沖（buffering）是兩個可選參數。假設有一個名為somefile.txt的文件，

linux編程之文件操作 linux知識學習

往回開始查看宋體 more 用法 tab tro 學習 Linux文件內容查看是Linux操作中不可或缺的操作，以下是對Linux下常用的文件內容查看命令的匯總。1. cat 由第一行開始顯示文件內容用法：cat -A 相當於-vET的整合選項，可列出一些特殊字符而

Python第三周之文件的讀寫以及簡單的爬蟲介紹

以及 under url error: except __name__ quest for div 文件的讀寫　　讀 import time def main(): """ 文件的讀寫，註意open的用法以及，文件地址的輸入。 :retur

Linux學習筆記之文件系統

type 個人入口鼠標動態鏈接庫資源功能系統文件學習筆記文件系統→rootfs:根文件系統FHS（Filesystem Hierarchy Standard）:文件系統層次化標準一、/boot:系統啟動相關文件，如內核，initrd,以及grub(bootl

struts2框架之文件上傳（參考第三天學習筆記）

input 允許 jakarta ges too 文件的 ESS dir 參數上傳 1. 上傳對表單的要求 * method=post * enctype=multipart/form-data 2. 上傳對servlet要求 * getParameter()不能再使用！

Liunx學習筆記之文件權限與I/O重定向

ima 所有讀取區別命令執行 com ror 重定向 cd 命令一、文件權限 1. 文件權限在 Linux 中有四種訪問權限，列舉如下可讀取：r，Readable 可寫入：w，Writable 可執行：x，Execute 無權限：- 我們可以使用 ls -l

爬蟲學習之-返回合法文件名

turn python 爬蟲學習 urn ida file 返回 pan 文件名 def validateTitle(title): rstr = r"[\/\\\:\*\?\"\<\>\|]" # ‘/ \ : * ? " < > |‘

Python學習之路——Linux基礎之文件系統管理

不能查看分區 RoCE ESS 更新註意 python sha 文件系統管理硬盤分區 fdisk /dev/sdb : 新建分區 fdisk -l /dev/sdb ：查看磁盤信息 df （-h）:查看分區信息新建主分區新建擴展分區 partprobe :新建

Python 爬蟲利器 Beautiful Soup 4 之文件樹的搜尋

前面兩篇介紹的是 Beautiful Soup 4 的基本物件型別和文件樹的遍歷, 本篇介紹 Beautiful Soup 4 的文件搜尋搜尋文件樹主要使用兩個方法 find() 和 find_all() find_all(): find_all 是用於搜尋節

Python學習—基礎篇之文件操作

flush 一次格式化 mod 當前 ani author 對象 run 文件操作文件操作也是編程中需要熟練掌握的技能，尤其是在後臺接口編寫和數據分析過程中，對各種類型的文件進行操作，獲取文件信息或者對信息進行存儲是十分重要的。本篇博客中將主要對常見的文本格式文件和Ex

C++學習（二十八）（C語言部分）之文件操作

文件路徑 c++學習多文件一點 pan etc 文件包含聲明部分復習：#define 定義一個宏#include 文件包含#if 條件防止頭文件重復包含定義一個宏判斷宏是否定義判斷頭文件是否包含#define _STDIO_H_#include<stdi

Python基礎之文件操作流與函數

默認 true 格式 nbsp 文件內容 print 寫入追加一.文件操作打開文件的方式有： r，只讀模式（默認） w，只寫模式（不可讀；不存在則創建；存在則刪除內容；） a，追加模式（可讀；不存在則創建；存在則只追加內容；） "+"表示可以同時讀寫某個文

[Linux]-Linux常用命令之文件解壓

壓縮 com nbsp linux常用命令 inux 不同 gunzip gzip all 不壓縮方式壓縮的文件需要不同的命令來解壓縮,下面是Linux的各種文件解壓命令。對於.tar結尾的文件： tar -xf 對於.gz結尾的文件： gzip -d all.

爬蟲學習之-文件管道重寫

相關推薦