1. 程式人生 > >Scrapy 框架抓取美拍視訊

Scrapy 框架抓取美拍視訊

抓取美拍的資料並不算是很難 關鍵是他的視訊url的加密演算法是有點難搞。

開啟美拍的網址我們檢視一下原始碼,他的網頁載入方式跟其它的網站差不多,video_url也是在原始碼中,但是我們仔細看,諾就是下邊這一串,是人都能猜測這應該是是他的video_url的地址,但是經過某種加密或者編碼,根據我的經驗我猜是base64,自己有一套演算法在裡面新增一些隨機字串

我把我寫的程式碼貼下 ,有興趣的同法可以嘗試下:

items.py

class MeipaiItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    cut_url = scrapy.Field()
    create_time = scrapy.Field()
    video_url = scrapy.Field()
    title = scrapy.Field()
    author = scrapy.Field()

spider

# -*- coding, utf-8 -*-
import scrapy
import json
import base64
from Meipai.items import MeipaiItem
import logging


class MeipaiSpider(scrapy.Spider):
    name = 'meipai'
    allowed_domains = ['meipai.com']
    start_urls = ['http://www.meipai.com/']
    offset = 1
    MeiPai = [
        ('搞笑', '13'),
        ('明星', '16'),
        ('高顏值','474'),
        ('舞蹈', '5872239354896137479'),
        ('精選', '488'),
        ('音樂', '5871155236525660080'),
        ('美食', '5870490265939297486'),
        ('時尚', '27'),
        ('美狀', '6161763227134314911'),
        ('吃秀', '5871963671268989887'),
        ('寶寶', '5864549574576746574'),
        ('創意', '5875185672678760586'),
        ('遊戲', '5879621667768487138'),
        ('體育', '5872639793429995335'),
        ('娛樂','6204189999771523532'),
    ]
    def parse(self, response):


        for channel,id in self.MeiPai:
            JsonUrl = 'http://www.meipai.com/topics/hot_timeline?page=1&count=24&tid={}'.format(id)
            yield scrapy.Request(url=JsonUrl,callback=self.parse_item)
       

    def system(self,string_num):
        return str(int(string_num.upper(), 16))


    def parse_item(self,response):
        item = MeipaiItem()
        OriginalHtml = json.loads(response.body.decode('utf-8'))
        NowHtml = OriginalHtml.get('medias')
        for NowData in NowHtml:
            # print(NowData)
            CutPicture = NowData.get('cover_pic')
            item['cut_url'] = CutPicture
            CreateTime = NowData.get('created_at')
            item['create_time'] = CreateTime
            Title = NowData.get('caption')
            if Title:
                item['title'] = Title
            else:
                return
            User = NowData.get('user').get('screen_name')
            if User:
                item['author'] = User
            else:
                return
            try:
                EncryptionVideoUrl = NowData.get('video')
                Num = self.system(EncryptionVideoUrl[:4][::-1])
                StartNum = Num[0]
                StartCount = Num[1]
                EndNum = Num[2]
                EndCount = Num[3]
                AddendNum = -(int(EndCount) + int(EndNum))
                HeaderNUm = int(StartCount) + int(StartNum) + int(4)
                TrueMindPart = EncryptionVideoUrl[HeaderNUm:AddendNum]
                TrueStartUrl = EncryptionVideoUrl[4:4 + int(StartNum)]
                TrueEndtUrl = EncryptionVideoUrl[(-int(EndNum)):]
                DealWithFinalUrl = TrueStartUrl + TrueMindPart + str(TrueEndtUrl)
                # Mp4Url = base64.b64decode(DealWithFinalUrl)
                FinalMp4UrlData = (str(base64.b64decode(DealWithFinalUrl), 'utf-8'))
                item['video_url'] = FinalMp4UrlData
            except Exception as e:
                logging.info(e)
                return


            # print(CutPicture,CreateTime,User,Title,Mp4Url)
            if not item['video_url']:
                return
            yield item

piplines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
import logging

class MeipaiPipeline(object):

    def __init__(self):
        self.name = open('data.json','w')


    def process_item(self, item, spider):
        content = json.dumps(dict(item)) + '\n'
        logging.info(type(content))
        self.name.write(content.encode('utf-8').decode('unicode-escape'))
        return item

    def close_item(self,spider):
        self.name.close()

我們可以看下日誌資訊,視訊的url,現在是我們常見的MP4格式哈,也是可以在瀏覽器開啟的,可以請求的。

我們隨便找一個拿到瀏覽器,是沒有問題的:

然後我們這些資料的處理方式,我們可以儲存到本地,也可以存放資料庫,感興趣的夥伴可以吧視訊下載下來,我這裡僅僅是以檔案的格式儲存到了本地: