Scrapy 爬取圖片/gif/視訊

阿新 • • 發佈：2019-02-20

Scrapy 爬取資料（圖片/gif/視訊）

Scrapy是一個為了爬取網站資料，提取結構性資料而編寫的應用框架。可以應用在包括資料探勘，資訊處理或儲存歷史資料等一系列的程式中。

1. 安裝scrapy

我用的是Anaconda，所以執行

conda install scrapy

2. 新建專案

切換到目標資料夾，然後執行

scrapy startproject one_night_in_shanghai

生成以下目錄結構：

one_night_in_shanghai/
1 scrapy.cfg
2 one_night_in_shanghai/
2.1 __init__.py
2.2 items.py
2.3 pipelines.py
2.4 settings.py
2.5 spiders/
2.5.1 __init__.py
其中

- scrapy.cfg: 專案的配置檔案
- one_night_in_shanghai/: 該專案的python模組，之後將在此加入程式碼；
- one_night_in_shanghai/items.py: 專案中的item檔案；
- one_night_in_shanghai/pipelines.py: 專案中的pipelines檔案；
- one_night_in_shanghai/settings.py: 專案的設定檔案；
- one_night_in_shanghai/spiders/: 放置spider程式碼的目錄。

3. 定義Item

Item 是儲存爬取到的資料的容器；其使用方法和python字典類似，並且提供了額外保護機制來避免拼寫錯誤導致的未定義欄位錯誤。

import scrapy
class OneNightInShanghaiItem(scrapy.Item):
    img = scrapy.Field() # 我這裡想爬圖片，那麼為圖片定義一個關鍵字
    #vedio = scrapy.Field() # 如果後面還要爬視訊
    gif = scrapy.Field() # 用於爬gif`

4. 在spider/下新建爬蟲檔案（last_day_in_September.py）

#-*- coding: utf-8 -*-
import scrapy

from one_night_in_shanghai.items import 
 OneNightInShanghaiItem

class last_day_in_September_spider(scrapy.Spider):

    #爬蟲名字，唯一，用於區分以後新建的爬蟲
    name = "img"

    #可選，定義爬取區域，超出區域的連結不爬取
    allowed_domains = ["so.redocn.com"]  #如果對於頁面沒有特殊要求，也可以不寫

    #定義開始爬取的頁面
    start_urls=["http://so.redocn.com/shuiguo/cbaeb9fb.htm"]

    def parse(self, response):  # 友情提示：不能更改此函式名，否則後果自負 )= =( 
        #用xpath的方式獲取圖片的src，具體語法移步scrapy教程->見末尾連結
        urls = response.xpath('//div[@class="wrap g-bd"]/div/dl/dd/a/img[not(contains(@class, "lazy"))]/@src').extract()

        for url in urls:

            # 前面我們定義過item，此處將其例項化
            imgItem = OneNightInShanghaiItem()

                #將獲得url賦值給定義好的item
            imgItem['img'] = [url]
            imgItem['gif'] = [] #上面如果定義了gif關鍵字，就得給初始化

            #將結果交給Pipeline處理
            yield imgItem

        #翻頁
        ##response.xpath('//a[@class="next"]//@href').extract() #也可以這樣
        nexturl=response.xpath(u'//a[contains(text(),"下一頁")]/@href').extract()


        domains = ['http://so.redocn.com']
        nexturl_all = domains[0] + nexturl[0]

        if nexturl_all:
            yield scrapy.Request(nexturl_all.encode("utf-8"), callback=self.parse)

同時，更改pipelines.py來對爬取到的資料進一步處理

# -*- coding: utf-8 -*-

# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import os
import urllib

class OneNightInShanghaiPipeline(object):
    def process_item(self, item, spider):
        dir_path = '/home/archer/for_fun/one_night_in_shanghai/result/'

        #print'dir_path',dir_path

        if item['img'] is not None:

            for image_url in  item['img']:

                list_name=image_url.split('/')

                file_name=list_name[len(list_name)-1]#圖片名稱

                file_path='%s/%s'%(dir_path,file_name)

                if  os.path.exists(file_name):

                    continue

                with  open(file_path,'wb') as file_writer:

                    conn=urllib.urlopen(image_url)#下載圖片

                    file_writer.write(conn.read())

                    file_writer.close()


        if item['gif'] is not None:

            for image_url in  item['gif']:

                list_name=image_url.split('/')

                file_name=list_name[len(list_name)-1]#圖片名稱

                file_path='%s/%s'%(dir_path,file_name)

                if  os.path.exists(file_name):

                    continue

                with  open(file_path,'wb') as file_writer:

                    conn=urllib.urlopen(image_url)#下載圖片

                    file_writer.write(conn.read())

                    file_writer.close()

        return item

當然，有興致的話，還可以更改settings.py檔案

# -*- coding: utf-8 -*-

# Scrapy settings for one_night_in_shanghai project
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'one_night_in_shanghai'

SPIDER_MODULES = ['one_night_in_shanghai.spiders']
NEWSPIDER_MODULE = 'one_night_in_shanghai.spiders'

# 啟動對應的Pipeline，有多個Pipeline時，數字小的先執行
ITEM_PIPELINES={
    'one_night_in_shanghai.pipelines.OneNightInShanghaiPipeline':1,
}

DOWNLOAD_DELAY=1

ROBOTSTXT_OBEY = True

5. 執行

切換到專案根目錄下
執行scrapy crawl img

6.

聽著wake me up when September ends把部落格整理完，已經迫不及待為祖國母親慶生了；
gif的pipeline部分已經寫好，由於抓取的gif和img不在同一頁面，所以需要自己再定義一個spider；
如果嫌程式碼看著亂，完整專案移步github；
本篇部落格大部分都參考自scrapy教程和這篇帖子，寫部落格只是按照自己的思路整理一下，以後要用時，拾得快一些。

Scrapy 爬取圖片/gif/視訊

Scrapy 爬取資料（圖片/gif/視訊） Scrapy是一個為了爬取網站資料，提取結構性資料而編寫的應用框架。可以應用在包括資料探勘，資訊處理或儲存歷史資料等一系列的程式中。 1. 安裝scrapy 我用的是Anaconda，所以執行 c

scrapy爬取圖片

深復制 cal xtra n) containe ... line example 定義一.遇到的問題總結 scrapy中爬取的多有數據（通過spider.py）,最後必須通過items實例格式化後，傳遞到pipelines中進行進一步的處理（註意scrapy內置的pip

Scrapy 爬取圖片例項

目標:360攝影美圖建立scrapy: scrapy startproject images360 建立spider: scrapy genspider images images.so.com 修改程式碼：修改spider：修改images.py：程式碼是根據下拉網頁的

scrapy爬取圖片並自定義圖片名字

　　前言　　　　Scrapy使用ImagesPipeline類中函式get_media_requests下載到圖片後，預設的圖片命名為圖片下載連結的雜湊值，例如：它的下載連結是，雜湊值為7710759a8e3444c8d28ba81a4421ed,那麼最終的圖片下載到指定路徑後名稱為771075

Scrapy爬取圖片並儲存

Scrapy提供了一個 item pipeline ，來下載屬於某個特定專案的圖片，比如，當你抓取產品時，也想把它們的圖片下載到本地。本文接豆瓣top250電影，爬取海報圖片。一、Images

使用Scrapy爬取圖片入庫,並儲存在本地

使用Scrapy爬取圖片入庫,並儲存在本地上篇部落格已經簡單的介紹了爬取資料流程,現在讓我們繼續學習scrapy 目標: 爬取愛卡汽車標題,價格以及圖片存入資料庫,並存圖到本地好了不多說,讓我們實現下效果我們仍用scrapy框架來編寫我們的專案

用scrapy爬取搜狗Lofter圖片

request index import rap .so 圖片 file loader clas 用scrapy爬取搜狗Lofter圖片 # -*- coding: utf-8 -*- import json import scrapy from scrapy.http

使用scrapy爬取手機版鬥魚主播的房間圖片及昵稱

發現對手 std pipeline obj ted += 指定 foo 目的：通過fiddler在電腦上對手機版鬥魚主播進行抓包，爬取所有主播的昵稱和圖片鏈接關於使用fiddler抓取手機包的設置：把手機和裝有fiddler的電腦處在同一個網段（同一個wifi），手機

scrapy爬取校花網圖片

xiaohua.py # -*- coding: utf-8 -*- import scrapy from pyquery import PyQuery from scrapy.http import Request from ..items import XiaohuarItem class

python+scrapy爬取鬥魚圖片

建立scrapy的專案請參考：https://blog.csdn.net/qq_35723619/article/details/83614670 items的實現： DouyumeinvSpider建立：這次我們爬去的是json資料包：我們可以通過network監控：

python scrapy爬取知乎問題和收藏夾下所有答案的內容和圖片

上文介紹了爬取知乎問題資訊的整個過程,這裡介紹下爬取問題下所有答案的內容和圖片,大致過程相同,部分核心程式碼不同. 爬取一個問題的所有內容流程大致如下: 一個問題url 請求url,獲取問題下的答案個數(我不需要,因為之前獲取問題資訊的時候儲存了問題的回答個數) 通過答案的介面去獲取答案(如果一次獲取5

scrapy爬取校花網男神圖片儲存到本地

爬蟲四部曲，本人按自己的步驟來寫，可能有很多漏洞，望各位大神指點指點 1、建立專案 scrapy startproject xiaohuawang scrapy.cfg: 專案的配置檔案xiaohuawang/: 該專案的python模組。之後您將在此加入程式碼。xiaohuawang/items.py: 專

Scrapy爬取搜狗圖片

1.新建專案 scrapy startproject images360 scrapy genspider images images.so.com 2.構造請求在setting.py定義爬取的頁數 MAX_PAGE=50 定義start_re

使用scrapy簡單爬取圖片並儲存

# -*- coding: utf-8 -*- import scrapy class Tu699Spider(scrapy.Spider): name = 'tu_699' allowed_domains = ['699pic.com'] start_urls = ['http:

爬蟲專案：scrapy爬取暱圖網全站圖片

一、建立專案、spider，item以及配置setting建立專案：scrapy startproject nitu建立爬蟲：scrapy genspider -t basic nituwang nipic.com寫個item：# -*- coding: utf-8 -*-

scrapy爬取百度圖片

百度圖片基本沒什麼反爬蟲措施，我們爬取圖片時直接滑鼠右鍵--->檢查----->network---->XHR，往下拖動得到頁面，可以看到headers下的General,檢視實際的請求Request URL,提取其中的關鍵資訊即可話不多說，直接上程式碼

scrapy爬取美女圖片

setting ipa ins star font @class del 分頁 date 使用scrapy爬取整個網站的圖片數據。並且使用 CrawlerProcess 啟動。 1 # -*- coding: utf-8 -* 2 import scrapy 3 impor

使用Scrapy爬取鬥魚圖片

1.具體檔案結構 2.程式碼： items # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.

scrapy爬蟲系列之三--爬取圖片保存到本地及日誌的基本用法

用法 request 讀取配置 turn 重寫方法沒有 elf sel jpg 功能點：如何爬取圖片，並保存到本地爬取網站：鬥魚主播完整代碼：https://files.cnblogs.com/files/bookwed/Douyu.zip 主要代碼： dou

scrapy爬取中關村在線手機頻道

tex ice extract base .section title .html release nbsp 1 # -*- coding: utf-8 -*- 2 import scrapy 3 from pyquery import PyQuery as pq

Scrapy 爬取圖片/gif/視訊

Scrapy 爬取資料（圖片/gif/視訊）

1. 安裝scrapy

2. 新建專案

3. 定義Item

4. 在spider/下新建爬蟲檔案 （last_day_in_September.py）

5. 執行

6.

相關推薦

4. 在spider/下新建爬蟲檔案（last_day_in_September.py）