Scrapy專案(鬥魚直播)---利用Spider爬取顏值下的美女資訊
1、建立Scrapy專案
scrapy startproject douyu
2.進入專案目錄,使用命令genspider建立Spider
scrapy genspider douyumeinv "capi.douyucdn.cn"
3、定義要抓取的資料(處理items.py檔案)
# -*- coding: utf-8 -*-
import scrapy
class DouyuItem(scrapy.Item):
name = scrapy.Field() # 儲存照片的名字
imagesUrls = scrapy.Field() # 照片的url路徑
imagesPath = scrapy.Field() # 照片儲存在本地的路徑
4、編寫提取item資料的Spider(在spiders資料夾下:douyumeinv.py)
# -*- coding: utf-8 -*-
import scrapy
import json
# 如果下面在pycharm中有紅色波浪線,參照這個設定:https://blog.csdn.net/z564359805/article/details/80650843
from douyu.items import DouyuItem
class DouyumeinvSpider(scrapy.Spider):
name = 'douyumeinv'
allowed_domains = ['capi.douyucdn.cn']
offset = 0
url = "http://capi.douyucdn.cn/api/v1/getVerticalRoom?limit=20&offset="
start_urls = [url + str(offset)]
def parse(self, response):
data = json.loads(response.text)['data']
for each in data:
item = DouyuItem()
item['name'] = each['nickname']
item['imagesUrls'] = each['vertical_src']
yield item
self.offset += 20
yield scrapy.Request(self.url + str(self.offset),callback=self.parse)
5.處理pipelines管道檔案儲存資料,可將結果儲存到檔案中(pipelines.py)
# -*- coding: utf-8 -*-
import scrapy
import os
from scrapy.utils.project import get_project_settings
from scrapy.pipelines.images import ImagesPipeline
# 繼承ImagesPipeline()的子類處理圖片並儲存,參考:https://blog.csdn.net/z564359805/article/details/80693578
class ImagePipeline(ImagesPipeline):
# 獲取settings檔案中設定的圖片儲存地址IMAGES_STORE
IMAGES_STORE =get_project_settings().get("IMAGES_STORE")
def get_media_requests(self, item, info):
image_url = item['imagesUrls']
yield scrapy.Request(image_url)
def item_completed(self, results, item, info):
image_path = [x['path'] for ok, x in results if ok]
os.rename(self.IMAGES_STORE + '/' + image_path[0],self.IMAGES_STORE + '/' + item['name'] + '.jpg')
item['imagesPath'] = self.IMAGES_STORE + '/' + item['name']
return item
6.配置settings檔案(settings.py)
# Obey robots.txt rules,具體含義參照:https://blog.csdn.net/z564359805/article/details/80691677
ROBOTSTXT_OBEY = False
# Override the default request headers:新增User-Agent資訊
DEFAULT_REQUEST_HEADERS = {
'USER_AGENT':'DYZB/2.290 (iPhone; iOS 9.3.4; Scale/2.00)',
}
# 圖片儲存地址,這樣會在當前執行的目錄下建立images資料夾,也可以寫具體地址
IMAGES_STORE = "./images"
# Configure item pipelines
ITEM_PIPELINES = {
'douyu.pipelines.ImagePipeline': 300,
}
# 還可以將日誌存到本地檔案中(可選新增設定)
LOG_FILE = "douyulog.log"
LOG_LEVEL = "DEBUG"
7.以上設定完畢,進行爬取:執行專案命令crawl,啟動Spider:
scrapy crawl douyumeinv
---------------------
作者:執筆冩回憶
來源:CSDN
原文:https://blog.csdn.net/z564359805/article/details/80707165