使用Scrapy爬取鬥魚圖片

阿新 • • 發佈：2019-02-17

1.具體檔案結構

2.程式碼：

items

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class DouyuItem(scrapy.Item):
    # define the fields for your item here like:
    nickname = scrapy.Field()
    imagelink = scrapy.Field()
    imagePath = scrapy.Field()
    #pass

douyumeinv

# -*- coding: utf-8 -*-
import scrapy
from douyu.items import DouyuItem
import json


class DouyumeinvSpider(scrapy.Spider):
    #爬蟲名
    name = "douyumeinv"
    #表示允許爬蟲爬的網站
    allowed_domains = ["capi.douyucdn.cn"]

    #拼接請求連結
    offset = 0
    url = "http://capi.douyucdn.cn/api/v1/getVerticalRoom?limit=20&offset="
    start_urls = [url + str(offset)]

    def parse(self, response):
        # 把json格式的資料轉換為python格式，data段是列表
        data = json.loads(response.text)["data"]
        #遍歷資料獲取值
        for each in data:
            item = DouyuItem()
            item["nickname"] = each["nickname"]
            item["imagelink"] = each["vertical_src"]
            yield item

        #獲取下一頁的資料
        self.offset += 20
        yield scrapy.Request(self.url + str(self.offset), callback = self.parse)

pipelines

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import scrapy
import os
from scrapy.utils.project import get_project_settings
from scrapy.pipelines.images import ImagesPipeline

class DouyuPipeline(ImagesPipeline):
    # def process_item(self, item, spider):
    #     return item
    #設定圖片儲存路徑
    IMAGES_STORE = get_project_settings().get("IMAGES_STORE")

    def get_media_requests(self, item, info):
        #獲取圖片連結
        image_url = item["imagelink"]
        #傳送圖片連結請求
        yield scrapy.Request(image_url)

    def item_completed(self, result, item, info):
        image_path = [x["path"] for ok, x in result if ok]
        #更改檔名
        os.rename(self.IMAGES_STORE + "/" + image_path[0], self.IMAGES_STORE + "/" + item["nickname"] + ".jpg")
        #更愛圖片路徑名
        item["imagePath"] = self.IMAGES_STORE + "/" + item["nickname"]
        return item

settings

# -*- coding: utf-8 -*-

# Scrapy settings for douyu project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'douyu'

SPIDER_MODULES = ['douyu.spiders']
NEWSPIDER_MODULE = 'douyu.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'douyu (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  "User-Agent" : "DYZB/1 CFNetwork/808.2.16 Darwin/16.3.0"
  # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  # 'Accept-Language': 'en',
}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'douyu.middlewares.DouyuSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'douyu.middlewares.MyCustomDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'douyu.pipelines.DouyuPipeline': 300,
}
IMAGES_STORE = "Images"
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

啟動：

cd 到專案目錄

scrapy crawl 爬蟲名

結果：

python+scrapy爬取鬥魚圖片

建立scrapy的專案請參考：https://blog.csdn.net/qq_35723619/article/details/83614670 items的實現： DouyumeinvSpider建立：這次我們爬去的是json資料包：我們可以通過network監控：

使用Scrapy爬取鬥魚圖片

1.具體檔案結構 2.程式碼： items # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.

多線程Beatiful Soup爬取鬥魚所有在線主播的信息

category con 讀取教程 stc https rom webkit date 　　最近看了個爬蟲的教程，想著自己也常在鬥魚看直播，不如就拿它來練練手。於是就寫了個爬取鬥魚所有在線主播的信息，分別為類別、主播ID、房間標題、人氣值、房間地址。　　需要用到的工具p

爬取鬥魚平臺

知識點： 1.運用selenium自動化驅動模組 2.find_elements_by_xpath（）與fin_element_by_xpath（）的區別，以及對元素的定位，內容的提取 3.獲取請求下一頁方法，注：time.sleep() 程式碼： #encoding=utf-8

Python3使用selenium爬取鬥魚直播平臺數據

進入鬥魚平臺首頁，點選頁面底部下一頁，發現url地址沒有發生變化，這樣的話再使用urllib2傳送請求將獲取不到完整的資料，這時候我們可以使用selenium和Chrome來模擬瀏覽器點選下一頁，這樣就可以獲取到完整的響應資料了程式程式碼： from selenium import

【爬蟲小程式：爬取鬥魚所有房間資訊】Xpath(執行緒池版)

# 本程式親測有效,用於理解爬蟲相關的基礎知識，不足之處希望大家批評指正 from queue import Queue import requests from lxml import etree from multiprocessing.dummy import Pool import t

【爬蟲小程式：爬取鬥魚所有房間資訊】Xpath(多執行緒版)

# 本程式親測有效,用於理解爬蟲相關的基礎知識，不足之處希望大家批評指正 from queue import Queue import requests from lxml import etree from threading import Thread "

【爬蟲小程式：爬取鬥魚所有房間資訊】Xpath(多程序版)

# 本程式親測有效,用於理解爬蟲相關的基礎知識，不足之處希望大家批評指正 1 import requests 2 from lxml import etree 3 from multiprocessing import JoinableQueue as Queue 4 from

爬取鬥魚主播名字和熱度

from selenium import webdriver from lxml import etree import csv # 建立瀏覽器物件,發請求 driver = webdriver.Chrome() driver.get("https://www.douyu.com/director

Python爬蟲-爬取鬥魚網頁selenium+bs

爬取鬥魚網頁（selenium+chromedriver得到網頁，用Beasutiful Soup提取資訊） ============================= ================================= =============================

多執行緒爬取鬥圖圖片

結果演示程式碼： #encoding:utf-8 # __author__ = 'donghao' # __time__ = 2018/12/24 15:20 import requests import threading import urllib.re

Scrapy爬取搜狗圖片

1.新建專案 scrapy startproject images360 scrapy genspider images images.so.com 2.構造請求在setting.py定義爬取的頁數 MAX_PAGE=50 定義start_re

Python爬取鬥魚的彈幕，看看奇葩網友都說了些什麽

run 重要技術直接執行 number encoding noop 一段 0.前言前幾天(寒假前咯)閑著無聊，看到舍友們都在看鬥魚TV，雖然我對那些網絡遊戲都不是非常感興趣,但是我突然間想到，如果我可以獲取上面的彈幕內容,不就有點意思了麽？ 1.分析階段如果我想要

Python爬取鬥魚的彈幕，看看奇葩網友都說了些什麼

0.前言前幾天(寒假前咯)閒著無聊，看到舍友們都在看鬥魚TV，雖然我對那些網路遊戲都不是非常感興趣,但是我突然間想到，如果我可以獲取上面的彈幕內容,不就有點意思了麼？ 1.分析階段如果我想要抓取網頁上面的東西，無非就是兩種方法使用瀏覽器，手工（自己點選）或者非手工（

scrapy爬取百度圖片

百度圖片基本沒什麼反爬蟲措施，我們爬取圖片時直接滑鼠右鍵--->檢查----->network---->XHR，往下拖動得到頁面，可以看到headers下的General,檢視實際的請求Request URL,提取其中的關鍵資訊即可話不多說，直接上程式碼

Python3-selenium\phantomjs\bs4爬取鬥魚頁面

from selenium import webdriver import time from bs4 import BeautifulSoup class douyuSelenium():

java爬取鬥魚：與虎牙對比

看了下鬥魚的基礎頁面，感覺和虎牙的有點像，但是實際上確有點不同。首先想要獲取頁數，看了鬥魚的html頁面，發現卻不顯示頁面。這是因為鬥魚把介面藏在了js頁面中，谷歌瀏覽器的右擊檢查那麼接下來就是選取一款可以解析js的工具並且提取裡面的資訊了。

python 爬取鬥魚 Ajax動態載入js分頁使用phontomjs無介面瀏覽器

python2.7版本 #coding:utf8 import unittest from selenium import webdriver from bs4 import BeautifulSo

使用scrapy爬取手機版鬥魚主播的房間圖片及昵稱

發現對手 std pipeline obj ted += 指定 foo 目的：通過fiddler在電腦上對手機版鬥魚主播進行抓包，爬取所有主播的昵稱和圖片鏈接關於使用fiddler抓取手機包的設置：把手機和裝有fiddler的電腦處在同一個網段（同一個wifi），手機

用scrapy爬取搜狗Lofter圖片

request index import rap .so 圖片 file loader clas 用scrapy爬取搜狗Lofter圖片 # -*- coding: utf-8 -*- import json import scrapy from scrapy.http

使用Scrapy爬取鬥魚圖片

相關推薦