1. 程式人生 > >Scrapy專案(鬥魚直播)---利用Spider爬取顏值下的美女資訊

Scrapy專案(鬥魚直播)---利用Spider爬取顏值下的美女資訊

1、建立Scrapy專案

scrapy startproject douyu

2.進入專案目錄,使用命令genspider建立Spider

scrapy genspider douyumeinv "capi.douyucdn.cn"

3、定義要抓取的資料(處理items.py檔案)

# -*- coding: utf-8 -*-

import scrapy

class DouyuItem(scrapy.Item):

    name = scrapy.Field()  # 儲存照片的名字

    imagesUrls = scrapy.Field()  # 照片的url路徑

    imagesPath = scrapy.Field()  # 照片儲存在本地的路徑

4、編寫提取item資料的Spider(在spiders資料夾下:douyumeinv.py)

# -*- coding: utf-8 -*-

import scrapy

import json

# 如果下面在pycharm中有紅色波浪線,參照這個設定:https://blog.csdn.net/z564359805/article/details/80650843

from douyu.items import DouyuItem

class DouyumeinvSpider(scrapy.Spider):

    name = 'douyumeinv'

    allowed_domains = ['capi.douyucdn.cn']

    offset = 0

    url = "http://capi.douyucdn.cn/api/v1/getVerticalRoom?limit=20&offset="

    start_urls = [url + str(offset)]

    def parse(self, response):

        data = json.loads(response.text)['data']

        for each in data:

            item = DouyuItem()

            item['name'] = each['nickname']

            item['imagesUrls'] = each['vertical_src']

            yield item

        self.offset += 20

        yield scrapy.Request(self.url + str(self.offset),callback=self.parse)

5.處理pipelines管道檔案儲存資料,可將結果儲存到檔案中(pipelines.py)

# -*- coding: utf-8 -*-

import scrapy

import os

from scrapy.utils.project import get_project_settings

from scrapy.pipelines.images import ImagesPipeline

# 繼承ImagesPipeline()的子類處理圖片並儲存,參考:https://blog.csdn.net/z564359805/article/details/80693578

class ImagePipeline(ImagesPipeline):

    # 獲取settings檔案中設定的圖片儲存地址IMAGES_STORE

    IMAGES_STORE =get_project_settings().get("IMAGES_STORE")

    def get_media_requests(self, item, info):

        image_url = item['imagesUrls']

        yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):

        image_path = [x['path'] for ok, x in results if ok]

        os.rename(self.IMAGES_STORE + '/' + image_path[0],self.IMAGES_STORE + '/' + item['name'] + '.jpg')

        item['imagesPath'] = self.IMAGES_STORE + '/' + item['name']

        return item

6.配置settings檔案(settings.py)

# Obey robots.txt rules,具體含義參照:https://blog.csdn.net/z564359805/article/details/80691677

ROBOTSTXT_OBEY = False     

# Override the default request headers:新增User-Agent資訊   

DEFAULT_REQUEST_HEADERS = {

  'USER_AGENT':'DYZB/2.290 (iPhone; iOS 9.3.4; Scale/2.00)',

}

# 圖片儲存地址,這樣會在當前執行的目錄下建立images資料夾,也可以寫具體地址 

IMAGES_STORE = "./images" 

# Configure item pipelines 

ITEM_PIPELINES = { 

  'douyu.pipelines.ImagePipeline': 300,

# 還可以將日誌存到本地檔案中(可選新增設定) 

LOG_FILE = "douyulog.log" 

LOG_LEVEL = "DEBUG" 

7.以上設定完畢,進行爬取:執行專案命令crawl,啟動Spider:

scrapy crawl douyumeinv

---------------------

作者:執筆冩回憶

來源:CSDN

原文:https://blog.csdn.net/z564359805/article/details/80707165