Scrapy 框架抓取美拍視訊
阿新 • • 發佈:2019-01-22
抓取美拍的資料並不算是很難 關鍵是他的視訊url的加密演算法是有點難搞。
開啟美拍的網址我們檢視一下原始碼,他的網頁載入方式跟其它的網站差不多,video_url也是在原始碼中,但是我們仔細看,諾就是下邊這一串,是人都能猜測這應該是是他的video_url的地址,但是經過某種加密或者編碼,根據我的經驗我猜是base64,自己有一套演算法在裡面新增一些隨機字串
我把我寫的程式碼貼下 ,有興趣的同法可以嘗試下:
items.py
class MeipaiItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() cut_url = scrapy.Field() create_time = scrapy.Field() video_url = scrapy.Field() title = scrapy.Field() author = scrapy.Field()
spider
# -*- coding, utf-8 -*- import scrapy import json import base64 from Meipai.items import MeipaiItem import logging class MeipaiSpider(scrapy.Spider): name = 'meipai' allowed_domains = ['meipai.com'] start_urls = ['http://www.meipai.com/'] offset = 1 MeiPai = [ ('搞笑', '13'), ('明星', '16'), ('高顏值','474'), ('舞蹈', '5872239354896137479'), ('精選', '488'), ('音樂', '5871155236525660080'), ('美食', '5870490265939297486'), ('時尚', '27'), ('美狀', '6161763227134314911'), ('吃秀', '5871963671268989887'), ('寶寶', '5864549574576746574'), ('創意', '5875185672678760586'), ('遊戲', '5879621667768487138'), ('體育', '5872639793429995335'), ('娛樂','6204189999771523532'), ] def parse(self, response): for channel,id in self.MeiPai: JsonUrl = 'http://www.meipai.com/topics/hot_timeline?page=1&count=24&tid={}'.format(id) yield scrapy.Request(url=JsonUrl,callback=self.parse_item) def system(self,string_num): return str(int(string_num.upper(), 16)) def parse_item(self,response): item = MeipaiItem() OriginalHtml = json.loads(response.body.decode('utf-8')) NowHtml = OriginalHtml.get('medias') for NowData in NowHtml: # print(NowData) CutPicture = NowData.get('cover_pic') item['cut_url'] = CutPicture CreateTime = NowData.get('created_at') item['create_time'] = CreateTime Title = NowData.get('caption') if Title: item['title'] = Title else: return User = NowData.get('user').get('screen_name') if User: item['author'] = User else: return try: EncryptionVideoUrl = NowData.get('video') Num = self.system(EncryptionVideoUrl[:4][::-1]) StartNum = Num[0] StartCount = Num[1] EndNum = Num[2] EndCount = Num[3] AddendNum = -(int(EndCount) + int(EndNum)) HeaderNUm = int(StartCount) + int(StartNum) + int(4) TrueMindPart = EncryptionVideoUrl[HeaderNUm:AddendNum] TrueStartUrl = EncryptionVideoUrl[4:4 + int(StartNum)] TrueEndtUrl = EncryptionVideoUrl[(-int(EndNum)):] DealWithFinalUrl = TrueStartUrl + TrueMindPart + str(TrueEndtUrl) # Mp4Url = base64.b64decode(DealWithFinalUrl) FinalMp4UrlData = (str(base64.b64decode(DealWithFinalUrl), 'utf-8')) item['video_url'] = FinalMp4UrlData except Exception as e: logging.info(e) return # print(CutPicture,CreateTime,User,Title,Mp4Url) if not item['video_url']: return yield item
piplines.py
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import json import logging class MeipaiPipeline(object): def __init__(self): self.name = open('data.json','w') def process_item(self, item, spider): content = json.dumps(dict(item)) + '\n' logging.info(type(content)) self.name.write(content.encode('utf-8').decode('unicode-escape')) return item def close_item(self,spider): self.name.close()
我們可以看下日誌資訊,視訊的url,現在是我們常見的MP4格式哈,也是可以在瀏覽器開啟的,可以請求的。
我們隨便找一個拿到瀏覽器,是沒有問題的:
然後我們這些資料的處理方式,我們可以儲存到本地,也可以存放資料庫,感興趣的夥伴可以吧視訊下載下來,我這裡僅僅是以檔案的格式儲存到了本地: