Scrapy 爬取圖片/gif/視訊
阿新 • • 發佈:2019-02-20
Scrapy 爬取資料(圖片/gif/視訊)
Scrapy是一個為了爬取網站資料,提取結構性資料而編寫的應用框架。可以應用在包括資料探勘,資訊處理或儲存歷史資料等一系列的程式中。
1. 安裝scrapy
我用的是Anaconda,所以執行
conda install scrapy
2. 新建專案
切換到目標資料夾,然後執行
scrapy startproject one_night_in_shanghai
生成以下目錄結構:
one_night_in_shanghai/
1 scrapy.cfg
2 one_night_in_shanghai/
2.1 __init__.py
2.2 items.py
2.3 pipelines.py
2.4 settings.py
2.5 spiders/
2.5.1 __init__.py
其中
- scrapy.cfg: 專案的配置檔案
- one_night_in_shanghai/: 該專案的python模組,之後將在此加入程式碼;
- one_night_in_shanghai/items.py: 專案中的item檔案;
- one_night_in_shanghai/pipelines.py: 專案中的pipelines檔案;
- one_night_in_shanghai/settings.py: 專案的設定檔案;
- one_night_in_shanghai/spiders/: 放置spider程式碼的目錄。
3. 定義Item
Item 是儲存爬取到的資料的容器;其使用方法和python字典類似, 並且提供了額外保護機制來避免拼寫錯誤導致的未定義欄位錯誤。
import scrapy
class OneNightInShanghaiItem(scrapy.Item):
img = scrapy.Field() # 我這裡想爬圖片,那麼為圖片定義一個關鍵字
#vedio = scrapy.Field() # 如果後面還要爬視訊
gif = scrapy.Field() # 用於爬gif`
4. 在spider/下新建爬蟲檔案 (last_day_in_September.py)
#-*- coding: utf-8 -*-
import scrapy
from one_night_in_shanghai.items import OneNightInShanghaiItem
class last_day_in_September_spider(scrapy.Spider):
#爬蟲名字,唯一,用於區分以後新建的爬蟲
name = "img"
#可選,定義爬取區域,超出區域的連結不爬取
allowed_domains = ["so.redocn.com"] #如果對於頁面沒有特殊要求,也可以不寫
#定義開始爬取的頁面
start_urls=["http://so.redocn.com/shuiguo/cbaeb9fb.htm"]
def parse(self, response): # 友情提示:不能更改此函式名,否則後果自負 )= =(
#用xpath的方式獲取圖片的src,具體語法移步scrapy教程->見末尾連結
urls = response.xpath('//div[@class="wrap g-bd"]/div/dl/dd/a/img[not(contains(@class, "lazy"))]/@src').extract()
for url in urls:
# 前面我們定義過item,此處將其例項化
imgItem = OneNightInShanghaiItem()
#將獲得url賦值給定義好的item
imgItem['img'] = [url]
imgItem['gif'] = [] #上面如果定義了gif關鍵字,就得給初始化
#將結果交給Pipeline處理
yield imgItem
#翻頁
##response.xpath('//a[@class="next"]//@href').extract() #也可以這樣
nexturl=response.xpath(u'//a[contains(text(),"下一頁")]/@href').extract()
domains = ['http://so.redocn.com']
nexturl_all = domains[0] + nexturl[0]
if nexturl_all:
yield scrapy.Request(nexturl_all.encode("utf-8"), callback=self.parse)
同時,更改pipelines.py來對爬取到的資料進一步處理
# -*- coding: utf-8 -*-
# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import os
import urllib
class OneNightInShanghaiPipeline(object):
def process_item(self, item, spider):
dir_path = '/home/archer/for_fun/one_night_in_shanghai/result/'
#print'dir_path',dir_path
if item['img'] is not None:
for image_url in item['img']:
list_name=image_url.split('/')
file_name=list_name[len(list_name)-1]#圖片名稱
file_path='%s/%s'%(dir_path,file_name)
if os.path.exists(file_name):
continue
with open(file_path,'wb') as file_writer:
conn=urllib.urlopen(image_url)#下載圖片
file_writer.write(conn.read())
file_writer.close()
if item['gif'] is not None:
for image_url in item['gif']:
list_name=image_url.split('/')
file_name=list_name[len(list_name)-1]#圖片名稱
file_path='%s/%s'%(dir_path,file_name)
if os.path.exists(file_name):
continue
with open(file_path,'wb') as file_writer:
conn=urllib.urlopen(image_url)#下載圖片
file_writer.write(conn.read())
file_writer.close()
return item
當然,有興致的話,還可以更改settings.py檔案
# -*- coding: utf-8 -*-
# Scrapy settings for one_night_in_shanghai project
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'one_night_in_shanghai'
SPIDER_MODULES = ['one_night_in_shanghai.spiders']
NEWSPIDER_MODULE = 'one_night_in_shanghai.spiders'
# 啟動對應的Pipeline,有多個Pipeline時,數字小的先執行
ITEM_PIPELINES={
'one_night_in_shanghai.pipelines.OneNightInShanghaiPipeline':1,
}
DOWNLOAD_DELAY=1
ROBOTSTXT_OBEY = True
5. 執行
- 切換到專案根目錄下
- 執行
scrapy crawl img