使用requests爬取梨視訊、bilibili視訊、汽車之家,bs4遍歷文件樹、搜尋文件樹,css選擇器
阿新 • • 發佈:2022-05-09
今日內容概要
- 使用requests爬取梨視訊
- requests+bs4爬取汽車之家
- bs4遍歷文件樹
- bs4搜尋文件樹
- css選擇器
內容詳細
1、使用requests爬取梨視訊
# 模擬傳送http請求的庫:requests---》只能傳送http請求----》沒有解析庫--》re、bs4、lxml
# requests-html:傳送請求+解析xml
# 視訊m3u8格式,分段---》會員試看6分鐘---》只加載了6分鐘
# 收費視訊:視訊解析
### 完整的視訊檔案儲存到本地
# re 解析想要的資料 # import requests # res=requests.get("https://www.pearvideo.com/") # print(res.text) ### 爬取到的視訊檔案放到本地 video 目錄下 # https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=8&start=24 import requests import re res = requests.get('https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=8&start=24') # print(res.text) # 解析出頁面中所有的視訊地址 video_list = re.findall('<a href="(.*?)" class="vervideo-lilink actplay">', res.text) # print(video_list) for video in video_list: video_url = 'https://www.pearvideo.com/' + video video_id = video_url.split('_')[-1] header = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36', 'Referer': video_url } # 第一層反扒是加refer res_video = requests.get('https://www.pearvideo.com/videoStatus.jsp?contId=%s&mrd=0.7113776105084832' % video_id, headers=header) mp4_url = res_video.json()['videoInfo']['videos']['srcUrl'] # 第二層反扒是把不能播放地址變成能播放地址 mp4_url = mp4_url.replace(mp4_url.split('/')[-1].split('-')[0], 'cont-%s' % video_id) print(mp4_url) # 下載到本地 res_video_detail = requests.get(mp4_url) with open('./video/%s.mp4' % video_id, 'wb') as f: for line in res_video_detail.iter_content(1024): f.write(line) # 單執行緒下載,速度不快,全是io操作,開啟多執行緒能夠顯著提高速度---》使用多執行緒全站下載視訊 # 執行緒池整站爬取 # 不能播放的地址 # https://video.pearvideo.com/mp4/third/20220314/ 1652060493892 -10097838-231626-hd.mp4 # 不正常地址 # https://video.pearvideo.com/mp4/third/20220314/ cont-1754713 -10097838-231626-hd.mp4 # 正常地址 # mp4_url='https://video.pearvideo.com/mp4/third/20220314/ 1652060493892 -10097838-231626-hd.mp4' # mp4_url=mp4_url.replace(mp4_url.split('/')[-1].split('-')[0],'cont-%s'%video_id)
2、爬取bilibili視訊
# 爬取的b站視訊儲存到本地 是分為兩個檔案:
視訊檔案
音訊檔案
# 通過第三方軟體做了整合之後才是完整的視訊
# 視訊去水印--》fmmpeg--》加水印,拼接裁剪,摳圖,轉碼。。。
# 裝上使用python來呼叫處理
# 匯入requests模組,模擬傳送請求 import requests # 匯入json import json # 匯入re import re # 定義請求頭 headers = { 'Accept': '*/*', 'Accept-Language': 'en-US,en;q=0.5', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36' } # 正則表示式,根據條件匹配出值 def my_match(text, pattern): match = re.search(pattern, text) print(match.group(1)) print() return json.loads(match.group(1)) def download_video(old_video_url, video_url, audio_url, video_name): headers.update({"Referer": old_video_url}) print("開始下載視訊:%s" % video_name) video_content = requests.get(video_url, headers=headers) print('%s視訊大小:' % video_name, video_content.headers['content-length']) audio_content = requests.get(audio_url, headers=headers) print('%s音訊大小:' % video_name, audio_content.headers['content-length']) # 下載視訊開始 received_video = 0 with open('%s_video.mp4' % video_name, 'ab') as output: while int(video_content.headers['content-length']) > received_video: headers['Range'] = 'bytes=' + str(received_video) + '-' response = requests.get(video_url, headers=headers) output.write(response.content) received_video += len(response.content) # 下載視訊結束 # 下載音訊開始 audio_content = requests.get(audio_url, headers=headers) received_audio = 0 with open('%s_audio.mp4' % video_name, 'ab') as output: while int(audio_content.headers['content-length']) > received_audio: # 視訊分片下載 headers['Range'] = 'bytes=' + str(received_audio) + '-' response = requests.get(audio_url, headers=headers) output.write(response.content) received_audio += len(response.content) # 下載音訊結束 return video_name if __name__ == '__main__': # 換成你要爬取的視訊地址 url = 'https://www.bilibili.com/video/BV17F411M7pQ' # 傳送請求,拿回資料 res = requests.get(url, headers=headers) # 視訊詳情json playinfo = my_match(res.text, '__playinfo__=(.*?)</script><script>') # 視訊內容json initial_state = my_match(res.text, r'__INITIAL_STATE__=(.*?);\(function\(\)') # 視訊分多種格式,直接取解析度最高的視訊 1080p video_url = playinfo['data']['dash']['video'][0]['baseUrl'] # 取出音訊地址 audio_url = playinfo['data']['dash']['audio'][0]['baseUrl'] video_name = initial_state['videoData']['title'] print('視訊名字為:video_name') print('視訊地址為:', video_url) print('音訊地址為:', audio_url) download_video(url, video_url, audio_url, video_name)
3、requests+bs4爬取汽車之家
### 爬取到的圖片都會儲存到本地
### 也可以在pycharm直接打印出 新聞地址 和 圖片地址 單獨訪問
import requests # pip3 install beautifulsoup4 from bs4 import BeautifulSoup res = requests.get('https://www.autohome.com.cn/news/1/#liststart') # print(res.text) # html.parser bs4預設的解析庫 soup = BeautifulSoup(res.text, 'html.parser') # 使用bs4的查詢 ul_list = soup.find_all(name='ul', class_='article') # print(len(ul_list)) for ul in ul_list: # 找ul標籤下所有的li標籤 li_list = ul.find_all(name='li') for li in li_list: h3 = li.find(name='h3') if h3: title = h3.text # 獲取h3標籤的文字內容 desc = li.find(name='p').text img = li.find(name='img')['src'] if not img.startswith('http'): img = 'https:' + img url = 'https:' + li.find('a')['href'] print(''' 新聞標題:%s 新聞摘要:%s 新聞圖片:%s 新聞地址:%s ''' % (title, desc, img, url)) # 把圖片儲存到本地 res_img = requests.get(img) img_name = img.split('/')[-1] with open('./img/%s' % img_name, 'wb') as f: for line in res_img.iter_content(1024): f.write(line) # 把資料存到資料庫 pymysql寫入資料庫--》建庫建表--》cursor.exec(insert ..)-->commit
4、bs4遍歷文件樹(快速定位查詢)
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" id="id_p">lqz<b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
### html.parser 內建的,速度一般,容錯能力強
# lxml 第三方的,速度快,容錯能力強
# soup = BeautifulSoup(html_doc, 'html.parser')
# pip3 install lxml
soup = BeautifulSoup(html_doc, 'lxml')
# print(soup.prettify()) # 對html進行美化 美化成完整的書寫格式
# 1 遍歷文件樹之 . 遍歷 速度快
print(soup.title)
print(soup.body.p)
print(soup.body.p.b)
# 2、獲取標籤的名稱
print(soup.title.name) # title
print(soup.body.name) # body
# 3、獲取標籤的屬性
print(soup.body.p) # 找 p標籤 速度快些
print(soup.p['class']) # ['title'] 因為class可能有多個,所以是列表
print(soup.p['id']) # id_p
print(soup.p.attrs) # {'class': ['title'], 'id': 'id_p'} 所有屬性放到字典中
# 4、獲取標籤的內容--文字內容
print(soup.p.text) # lqzThe Dormouse's story 當前標籤和子子孫的文字內容拼到一起
print(soup.p.string) # None 當前標籤只有文字或只有一個子有文字才拿出來,如果有多個子子孫孫,返回None
print(list(soup.p.strings)) # ['lqz', "The Dormouse's story"] 把子子孫孫的文字內容放到generator
# 5、巢狀選擇
# 可以連續點巢狀選擇
print(soup.head.title.string) # The Dormouse's story
# 6、子節點、子孫節點
print(soup.p.contents) # p下所有子節點,放到列表中
print(list(soup.p.children)) # 得到一個迭代器,包含p下所有子節點,跟contents本質一樣,只是節約記憶體
print(list(soup.p.descendants)) # 獲取子孫節點,p下所有的標籤都會選擇出來 子子孫孫
for i, child in enumerate(soup.p.children):
print(i, child)
for i, child in enumerate(soup.p.descendants):
print(i, child)
# 7、父節點、祖先節點
print(soup.a.parent) # 獲取a標籤的父節點
print(list(soup.a.parents)) # 找到a標籤所有的祖先節點,父親的父親,父親的父親的父親...
# 8、兄弟節點
print(soup.a.next_sibling) # 下一個兄弟
print(soup.a.previous_sibling) # 上一個兄弟
print(list(soup.a.next_siblings)) # 下面的兄弟們=>生成器物件
print(soup.a.previous_siblings) # 上面的兄弟們=>生成器物件
### 重點記憶:
. 遍歷
取屬性 [] attrrs.get()
取文字 text string strings
5、bs4搜尋文件樹
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" id="id_p">lqz<b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
##### 五種過濾器: 字串、正則表示式、列表、True、方法
# find:找到第一個 find_all:找所有
# 1、字串 --->value值是字串
res = soup.find_all(name='p')
# res = soup.find(id='id_p')
# res = soup.find_all(class_='story')
# res = soup.find_all(name='p', class_='story') # and條件
# res = soup.find(name='a', id='link2').text
# res = soup.find(name='a', id='link2').attrs.get('href')
# res = soup.find(attrs={'id': 'link2', 'class': 'sister'}).attrs.get('href')
print(res)
# 2、正則表示式--->value是正則表示式
import re
# res = soup.find_all(name=re.compile('^b'))
# res = soup.find_all(href=re.compile('^http'))
res = soup.find_all(class_=re.compile('^s'))
print(res)
# 3、列表 value值是列表
# res = soup.find_all(name=['body', 'a'])
# res = soup.find_all(class_=['sister', 'story'])
res = soup.find_all(id=['link2', 'link3'])
print(res)
# 4、True value值是True
# res = soup.find_all(name=True)
# res = soup.find_all(id=True)
res = soup.find_all(href=True)
print(res)
# 5、方法
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')
print(soup.find_all(name=has_class_but_no_id)) # 有class但是沒有id的標籤
### 補充總結:
# 1 html頁面中,只要有的東西,通過bs4都可以解析出來
# 2 遍歷文件樹+搜尋文件樹混用
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')
print(soup.find(name=has_class_but_no_id).a.text)
# 3 find_all的其他引數limit:限制取幾條 recursive:是否遞迴查詢
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')
res = soup.find_all(name=has_class_but_no_id, limit=1)
print(res)
res = soup.find_all(name='a', recursive=False) # 不遞迴查詢,速度快,只找一層
print(res)
6、css選擇器
### css,xpath選擇器是通用的---》基本所有的解析庫(bs4,lxml,pyquery,selenium的解析庫)--->都支援css選擇器-->css在前端通用
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" id="id_p">lqz<b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
# soup.select() # 找所有
# soup.select_one() # 找一個
'''
div 找div標籤
div>a 找div下的緊鄰的a
div a 找div下的子子孫孫的a
.sister 找類名為sister的標籤
#id_p 找id為id_p的標籤
'''
# res = soup.select('#id_p')
# res = soup.select('.sister')
res = soup.select_one('.story>a').attrs.get('href')
print(res)
# 終極大招
import requests
response = requests.get('https://www.runoob.com/cssref/css-selectors.html')
soup = BeautifulSoup(response.text, 'lxml')
res = soup.select_one('#content > table > tbody > tr:nth-child(2) > td:nth-child(3)').text # 找到標籤 右鍵 copy--selector
print(res)
# 只要頁面中有的通過bs4都能解析出來