用python做youtube自動化下載器 程式碼
阿新 • • 發佈:2021-01-12
[TOC](用python做youtube自動化下載器 程式碼)
> 根據 [savefrom條例](https://en.savefrom.net/terms.html)
> 本例項及教程只用於學習交流用,權利歸**savefrom.net**所有
> 最後程式碼+註釋大概100行左右,具體程式碼以github程式碼為主(可以會在上面修復bug),本文只做具體講解
# 專案地址
[github倉庫](https://github.com/Nambers/YoutubeDownloader)
# 思路
[用python做youtube自動化下載器 思路](https://blog.csdn.net/qq_40832960/article/details/112470584)
# 流程
## 1. post
根據思路里的第一步,我們首先需要用`post`方式取到加密後的js欄位,筆者使用了`requests`第三方庫來執行,關於爬蟲可以參考[我之前的文章](https://blog.csdn.net/qq_40832960/article/details/103854145)
### i. 先把post中的headers格式化
```python
# set the headers or the website will not return information
# the cookies in here you may need to change
headers = {
"cache-Control": "no-cache",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,"
"*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
"content-type": "application/x-www-form-urlencoded",
"cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; "
"clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; "
"helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; "
"_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; "
"PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; "
"PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1",
"origin": "https://en.savefrom.net",
"pragma": "no-cache",
"referer": "https://en.savefrom.net/1-youtube-video-downloader-4/",
"sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"",
"sec-ch-ua-mobile": "?0",
"sec-fetch-dest": "iframe",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/87.0.4280.88 Safari/537.36"}
```
其中`cookie`部分可能要改,然後最好以你們瀏覽器上的為主,具體每個引數的含義不是本文範圍,可以自行去搜索引擎搜
### ii.然後把引數也格式化
```python
# set the parameter, we can get from chrome
kv = {"sf_url": url,
"sf_submit": "",
"new": "1",
"lang": "en",
"app": "",
"country": "cn",
"os": "Windows",
"browser": "Chrome"}
```
其中`sf_url`欄位是我們要下載的youtube視訊的url,其他引數都不變
### iii. 最後再執行`requests`庫的post請求
```python
# do the POST request
r = requests.post(url="https://en.savefrom.net/savefrom.php", headers=headers,
data=kv)
r.raise_for_status()
```
注意是`data=kv`
### iv. 封裝成一個函式
```python
import requests
def gethtml(url):
# set the headers or the website will not return information
# the cookies in here you may need to change
headers = {
"cache-Control": "no-cache",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,"
"*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
"content-type": "application/x-www-form-urlencoded",
"cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; "
"clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; "
"helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; "
"_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; "
"PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; "
"PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1",
"origin": "https://en.savefrom.net",
"pragma": "no-cache",
"referer": "https://en.savefrom.net/1-youtube-video-downloader-4/",
"sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"",
"sec-ch-ua-mobile": "?0",
"sec-fetch-dest": "iframe",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/87.0.4280.88 Safari/537.36"}
# set the parameter, we can get from chrome
kv = {"sf_url": url,
"sf_submit": "",
"new": "1",
"lang": "en",
"app": "",
"country": "cn",
"os": "Windows",
"browser": "Chrome"}
# do the POST request
r = requests.post(url="https://en.savefrom.net/savefrom.php", headers=headers,
data=kv)
r.raise_for_status()
# get the result
return r.text
```
## 2. 呼叫解密函式
## i. 分析
這其中的難點在於在python裡執行javascript程式碼,而晚上的解決方法有`PyV8`等,本文選用`execjs`。在思路部分我們可以發現js部分的最後幾行是解密函式,所以我們只需要在`execjs`中先執行一遍全部,然後再單獨執行解密函式就好了
## ii. 先取出js部分
```python
# target(youtube address) url
url = "https://www.youtube.com/watch?v=YPvtz1lHRiw"
# get the target text
reo = gethtml(url)
# Remove the code from the head and tail (we need the javascript part, information store with encryption in js part)
reo = reo.split("")[0]
```
這裡其實可以用正則,不過由於筆者正則表示式還不太熟練就直接用`split`了
## iii. 取第一個解密函式作為我們用的解密函式
當你多取幾次不同視訊的結果,你就會發現每次的解密函式都不一樣,不過位置都是還是在固定行數
```python
# split each line(help us find the decrypt function in last few line)
reA = reo.split("\n")
# get the depcrypt function
name = reA[len(reA) - 3].split(";")[0] + ";"
```
所以`name`就是我們的解密函數了(變數名沒取太好hhh)
## iv. 用execjs執行
```python
# use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
ct = execjs.compile(reo)
# do the decryption
text = ct.eval(name.split("=")[1].replace(";", ""))
```
其中只取`=`後面的和去掉分號是指指執行這個函式而不用賦值,當先執行賦值+解密然後取值也不是不可以
但是我們可以發現馬上就報錯了(要是有這麼簡單就好了)
### 1. this也就是window變數不存在
如果沒記錯是報錯`this`或者`$b`,筆者嘗試把全部`this`去掉或者把全部框在一個`class`裡面(這樣子this就變成那個class了)不過都沒有成功,然後發現在`npm`下有個`jsdom`可以在`execjs`裡模擬window變數(其實應該有更好方法的),所以我們需要下載`npm`和裡面的`jsdom`,然後改寫以上程式碼
```python
addition = """
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const dom = new JSDOM(` `);
window = dom.window;
document = window.document;
XMLHttpRequest = window.XMLHttpRequest;
"""
# use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
ct = execjs.compile(addition + reo, cwd=r'C:\Users\xxx\AppData\Roaming\npm\node_modules')
```
其中
- `cwd`欄位是`npm root -g`的結果,也就是npm的modules路徑
- `addition`是用來模擬`window`的
但是我們又可以發現下一個錯誤
### 2. alert不存在
這個錯誤是因為在`execjs`下執行`alert`函式是沒有意義的,因為我們沒有瀏覽器讓他彈窗,且原本`alert`函式的定義是來源`window`而我們自定義了`window`,所以我們要在程式碼前重寫覆蓋`alert`函式(相當於定義一個alert)
```python
# override the alert function, because in the code there has one place using
# and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error
reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")
```
## v. 整合程式碼
```python
# target(youtube address) url
url = "https://www.youtube.com/watch?v=YPvtz1lHRiw"
# get the target text
reo = gethtml(url)
# Remove the code from the head and tail (we need the javascript part, information store with encryption in js part)
reo = reo.split("")[0]
# override the alert function, because in the code there has one place using
# and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error
reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")
# split each line(help us find the decrypt function in last few line)
reA = reo.split("\n")
# get the depcrypt function
name = reA[len(reA) - 3].split(";")[0] + ";"
# add jsdom into the execjs because the code will use(maybe there is a solution without jsdom, but i have no idea)
addition = """
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const dom = new JSDOM(` `);
window = dom.window;
document = window.document;
XMLHttpRequest = window.XMLHttpRequest;
"""
# use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
ct = execjs.compile(addition + reo, cwd=r'C:\Users\19308\AppData\Roaming\npm\node_modules')
# do the decryption
text = ct.eval(name.split("=")[1].replace(";", ""))
```
## 3. 分析解密結果
### i. 取關鍵json
執行完上面的部分,解密結果就存在text裡了,而我們在思路中可以發現,真正對我們重要的就是存在`window.parent.sf.videoResult.show()`裡的json,所以用正則表示式取這一部分的json
```python
# get the result in json
result = re.search('show\((.*?)\);;', text, re.I | re.M).group(0).replace("show(", "").replace(");;", "")
```
### ii. 格式化json
python可以格式化json的庫有很多,這裡筆者用了`json`庫(記得import)
```python
# use `json` to load json
j = json.loads(result)
```
### iii. 取下載地址
接下來就到了最後一步,根據思路里和json格式化工具我們可以發現`j["url"][num]["url"]`就是下載連結,而`num`是我們要的視訊格式(不同解析度和型別)
```python
# the selection of video(in this case, num=1 mean the video is
# - 360p known from j["url"][num]["quality"]
# - MP4 known from j["url"][num]["type"]
# - audio known from j["url"][num]["audio"]
num = 1
downurl = j["url"][num]["url"]
# do some download
# thanks :)
# - EOF -
```
# 3. 全部程式碼
```python
# -*- coding: utf-8 -*-
# @Time: 2021/1/10
# @Author: Eritque arcus
# @File: Youtube.py
# @License: MIT
# @Environment:
# - windows 10
# - python 3.6.2
# @Dependence:
# - jsdom in npm(windows also can use)
# - requests, execjs, re, json in python
import requests
import execjs
import re
import json
def gethtml(url):
# set the headers or the website will not return information
# the cookies in here you may need to change
headers = {
"cache-Control": "no-cache",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,"
"*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
"content-type": "application/x-www-form-urlencoded",
"cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; "
"clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; "
"helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; "
"_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; "
"PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; "
"PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1",
"origin": "https://en.savefrom.net",
"pragma": "no-cache",
"referer": "https://en.savefrom.net/1-youtube-video-downloader-4/",
"sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"",
"sec-ch-ua-mobile": "?0",
"sec-fetch-dest": "iframe",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/87.0.4280.88 Safari/537.36"}
# set the parameter, we can get from chrome
kv = {"sf_url": url,
"sf_submit": "",
"new": "1",
"lang": "en",
"app": "",
"country": "cn",
"os": "Windows",
"browser": "Chrome"}
# do the POST request
r = requests.post(url="https://en.savefrom.net/savefrom.php", headers=headers,
data=kv)
r.raise_for_status()
# get the result
return r.text
if __name__ == '__main__':
# target(youtube address) url
url = "https://www.youtube.com/watch?v=YPvtz1lHRiw"
# get the target text
reo = gethtml(url)
# Remove the code from the head and tail (we need the javascript part, information store with encryption in js part)
reo = reo.split("")[0]
# override the alert function, because in the code there has one place using
# and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error
reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")
# split each line(help us find the decrypt function in last few line)
reA = reo.split("\n")
# get the depcrypt function
name = reA[len(reA) - 3].split(";")[0] + ";"
# add jsdom into the execjs because the code will use(maybe there is a solution without jsdom, but i have no idea)
addition = """
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const dom = new JSDOM(` `);
window = dom.window;
document = window.document;
XMLHttpRequest = window.XMLHttpRequest;
"""
# use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
ct = execjs.compile(addition + reo, cwd=r'C:\Users\19308\AppData\Roaming\npm\node_modules')
# do the decryption
text = ct.eval(name.split("=")[1].replace(";", ""))
# get the result in json
result = re.search('show\((.*?)\);;', text, re.I | re.M).group(0).replace("show(", "").replace(");;", "")
# use `json` to load json
j = json.loads(result)
# the selection of video(in this case, num=1 mean the video is
# - 360p known from j["url"][num]["quality"]
# - MP4 known from j["url"][num]["type"]
# - audio known from j["url"][num]["audio"]
num = 1
downurl = j["url"][num]["url"]
# do some download
# thanks :)
# - EOF -
```
- 總計102行
- 開發環境
```python
# @Environment:
# - windows 10
# - python 3.6.2
```
- 依賴
```python
# @Dependence:
# - jsdom in npm(windows also can use)
# - requests, execjs, re, json in python
```
-end-
> For 爬蟲
> 版權宣告:本文為博主原創文章,遵循 CC 4.0 BY-SA 版權協議,轉載請附上原文出處連結和本宣告。
> 本文作者: [https://www.cnblogs.com/Eritque-arcus/](https://www.cnblogs.com/Eritque-arcus/) 或[https://blog.csdn.net/qq_40832960](https://blog.csdn.net/qq_40832960)
Hello world
Hello world
Hello world