python介面自動化（二十六） requests-html爬蟲框架

阿新 • • 發佈：2020-08-07

前言

python擁有眾多的爬蟲框架，其中最簡單的莫過於requests-html了。它和著名的網路請求庫requests是同一個作者，著重於XML資料提取，可以說是最簡單的爬蟲框架了。

環境安裝

pip install requests-html

該庫旨在解析HTML（例如，抓取Web）儘可能簡單直觀，有以下優勢

完整的JavaScript支援
CSS Selectors（又名jQuery風格，感謝PyQuery）
XPath Selectors，對於膽小的人來說。
模擬使用者代理（如真實的Web瀏覽器）
自動跟蹤重定向。
連線池和cookie永續性
令人欣喜的請求體驗，具有神奇的解析能力。

非同步支援

抓頁面url地址

先發個get請求，返回response物件，通過r.html.links獲取頁面的全部連結，包含一些相對的地址，如果只想獲取絕對路徑的地址，可以用r.html.absolute_links

from requests_html import HTMLSession
session=HTMLSession()
r=session.get("https://www.baidu.com/")
#獲取頁面上的所有連結
all_links=r.html.links
print(all_links)
print(type(all_links))
print(len(all_links))
#絕對路徑連結
all_absolute_links=r.html.absolute_links
print(all_absolute_links)
print(len(all_absolute_links))

執行結果

從返回的結果可以看出，返回的是set集合，會自動的去除重複的連結地址

xpath定位

用過lxml庫的小夥伴應該知道,lxml可以支援xpath查詢元素物件，requests-html也可以完美的支援xpath

接下來我需要獲取目標網站：https://www.cnblogs.com/canglongdao/category/1588084.html（備註，下面嘗試的是未登陸狀態，所以這裡的網址是不需要登入的），獲取所有的文章標題，

谷歌瀏覽器中開啟該網站，按F12，彈出DevTools，根據下圖步驟，複製xpath地址。

滑鼠在href行時，xpath地址： //*[@id="main"]/div[2]/h5/a；嘗試其它文章標題，xpath地址：//*[@id="main"]/div[1]/h5/a；發現div後面的數字不同，所以若是想提取到所有文章的href行，則將數字去掉，//*[@id="main"]/div/h5/a

參考程式碼：

from requests_html import HTMLSession
session=HTMLSession()
r=session.get("https://www.cnblogs.com/canglongdao/category/1588084.html")
#只獲取第一個
one=r.html.xpath("//*[@id='main']/div/h5/a/span",first=True).text
print("one:{}".format(one))

#獲取全部
f=r.html.xpath("//*[@id='main']/div/h5/a/span")
print(type(f),len(f))
u=r.html.xpath("//*[@id='main']/div/h5/a")
for i,j in zip(f,u):
    print(i.text,j.absolute_links) #文字，連結

　執行結果　

xpath方法返回是一個list，加上first=True引數返回第一個結果

css定位支援

rquests-html同樣支援CSS Selector的方法，把上面案例用css定位也可以實現同樣效果

from requests_html import HTMLSession
session=HTMLSession()
r=session.get("https://www.cnblogs.com/canglongdao/category/1588084.html")
#只獲取第一個 css語法
one=r.html.find(".post>h5>a>span",first=True).text
print("one:{}".format(one))

#獲取全部 css語法
f=r.html.find(".post>h5>a>span")
print(type(f),len(f))
u=r.html.find(".post>h5>a")
for i,j in zip(f,u):
    print(i.text,j.absolute_links) #文字，連結

其它方法

from requests_html import HTMLSession
session=HTMLSession()
r=session.get("https://www.cnblogs.com/canglongdao/category/1588084.html")
a=r.html.find(".post>h5",first=True)
#1.獲取文字
print(a.text)
#2.獲取html內容
print(a.html)
#3.獲取全部屬性
print(a.attrs)
#4.獲取連結
print(a.absolute_links)
#5.搜尋文字
print(a.search('python介面自動化（{}）  urlencode編碼%E7%94%9F%E6%B4%BB與解碼')[0])
#6.繼續定位子元素
print(a.find('a'))
#7.containing 模糊匹配 只包含‘python'文字的物件
b=a.find('a',containing='python')
print(b)

　執行結果

python介面自動化（二十五） urlencode編碼%E7%94%9F%E6%B4%BB與解碼
<h5>
<a class="entrylistItemTitle" href="https://www.cnblogs.com/canglongdao/p/13451393.html">
<span>python介面自動化（二十五）  urlencode編碼%E7%94%9F%E6%B4%BB與解碼</span>
</a>
</h5>
{}
{'https://www.cnblogs.com/canglongdao/p/13451393.html'}
二十五
[<Element 'a' class=('entrylistItemTitle',) href='https://www.cnblogs.com/canglongdao/p/13451393.html'>]
[<Element 'a' class=('entrylistItemTitle',) href='https://www.cnblogs.com/canglongdao/p/13451393.html'>]