最新的爬蟲工具requests-html

阿新 • • 發佈：2018-04-22

rom 聽說 {} 推薦接口人員查找 python 2 日子

使用Python開發的同學一定聽說過Requsts庫，它是一個用於發送HTTP請求的測試。如比我們用Python做基於HTTP協議的接口測試，那麽一定會首選Requsts，因為它即簡單又強大。現在作者Kenneth Reitz 又開發了requests-html 用於做爬蟲。

GiHub項目地址：

https://github.com/kennethreitz/requests-html

requests-html 是基於現有的框架 PyQuery、Requests、lxml、beautifulsoup4等庫進行了二次封裝，作者將Requests設計的簡單強大的優點帶到了該項目中。

安裝：

pip install requests-html

先來看看requests的基本使用。

from requests_html import HTMLSession
session = HTMLSession()

r = session.get(‘https://python.org/‘)

# 獲取頁面上的所有鏈接。
all_links =  r.html.links
print(all_links)

# 獲取頁面上的所有鏈接，以絕對路徑的方式。
all_absolute_links = r.html.absolute_links
print(all_absolute_links)

小試牛刀

作為一個IT技術人員，是不是要時時關心一下科技圈的新聞，上博客園新聞頻道，抓取最新的推薦新聞。

from requests_html import HTMLSession

session = HTMLSession()

r = session.get("https://news.cnblogs.com/n/recommend")

# 通過CSS找到新聞標簽
news = r.html.find(‘h2.news_entry > a‘, first=True)

for new in news:
    print(new.text)  # 獲得新聞標題
    print(new.absolute_links)  # 獲得新聞鏈接

執行結果：

外賣小哥擊敗北大碩士，獲《中國詩詞大會》冠軍！董卿點贊
{ 
‘https://news.cnblogs.com/n/593573/‘}
一圖看懂中國科學家如何用“魔法藥水”制備幹細胞
{‘https://news.cnblogs.com/n/593572/‘}
再見Windows：你曾是我的全部
{‘https://news.cnblogs.com/n/593559/‘}
復盤摩拜賣身美團：美女創始人背後有3個男人
{‘https://news.cnblogs.com/n/593536/‘}
不要把認錯當成一種PR！
{‘https://news.cnblogs.com/n/593494/‘}
胡瑋煒的胳膊擰不過馬化騰的大腿
{‘https://news.cnblogs.com/n/593450/‘}
被廢四年奇跡逆轉！XP用戶竟越來越多
{‘https://news.cnblogs.com/n/593445/‘}
天宮一號：我這一輩子
{‘https://news.cnblogs.com/n/593414/‘}
微軟重組的背後：未來屬於“微軟”，而不只是Windows
{‘https://news.cnblogs.com/n/593375/‘}
低俗、鬼畜受限令出臺後，內容原創者的日子會好過點嗎？
{‘https://news.cnblogs.com/n/593370/‘}
《頭號玩家》：一封獻給極客的情書
{‘https://news.cnblogs.com/n/593307/‘}
翻了下Google日語輸入法愚人節作品集，這個團隊真的超有病…
{‘https://news.cnblogs.com/n/593285/‘}
C++委員會決定在C++20中棄用原始指針
{‘https://news.cnblogs.com/n/593271/‘}
一文讀懂阿裏收購餓了麽：餓了麽和美團外賣決戰之日到了
{‘https://news.cnblogs.com/n/593261/‘}
剛剛天宮一號墜落 而中國空間站即將騰飛！
{‘https://news.cnblogs.com/n/593248/‘}
小抖怡情適可而止 不要為了博眼球而去做一些危險的事情
{‘https://news.cnblogs.com/n/593238/‘}
拼多多淘寶低價阿膠背後：放馬皮牛皮 摻禁用明膠
{‘https://news.cnblogs.com/n/593232/‘}
吳曉波對話劉強東：太保守 讓我只能看著機會流走
{‘https://news.cnblogs.com/n/593176/‘}

擴展：我們可以進一步將這裏數據做持久化處理，設計出自己的“頭條”。

小有成就

接下來我們到網站上下載壁紙，以美桌網（www.win4000.com）為例。

from requests_html import HTMLSession
import requests


# 保存圖片到bg/目錄
def save_image(url, title):
    img_response = requests.get(url)
    with open(‘./bg/‘+title+‘.jpg‘, ‘wb‘) as file:
        file.write(img_response.content)

# 背景圖片地址，這裏選擇1920*1080的背景圖片
url = "http://www.win4000.com/wallpaper_2358_0_10_1.html"

session = HTMLSession()
r = session.get(url)

# 查找頁面中背景圖，找到鏈接，訪問查看大圖，並獲取大圖地址
items_img = r.html.find(‘ul.clearfix > li > a‘)
for img in items_img:
    img_url = img.attrs[‘href‘]
    if "/wallpaper_detail" in img_url:
        r = session.get(img_url)
        item_img = r.html.find(‘img.pic-large‘, first=True)
        url = item_img.attrs[‘src‘]
        title = item_img.attrs[‘title‘]
        print(url+title)
        save_image(url, title)

這個網站上的圖片還是很容易獲取的，在上面的代碼塊中我加了註釋。這裏不再說明。

選擇一個CSS選擇器的元素：

>>> about = r.html.find(‘#about‘, first=True)

獲取元素的文本內容：

>>> print(about.text)

獲取元素的屬性

>>> about.attrs
{‘id‘: ‘about‘, ‘class‘: (‘tier-1‘, ‘element-1‘), ‘aria-haspopup‘: ‘true‘}

渲染出一個元素的HTML：

>>> about.html

選擇元素內的元素：

>>> about.find(‘a‘)

頁面上的文本搜索：

>>> r.html.search(‘Python is a {} language‘)[0]

更復雜的CSS選擇器的例子（從Chrome開發工具復制）：

>>> r = session.get(‘https://github.com/‘)
>>> sel = ‘body > div.application-main > div.jumbotron.jumbotron-codelines > div > div > div.col-md-7.text-center.text-md-left > p‘

>>> print(r.html.find(sel, first=True).text)

它也支持：

>>> r.html.xpath(‘/html/body/div[1]/a‘)

JavaScript的支持

讓我們抓住一些文本呈現的JavaScript：

>>> r = session.get(‘http://python-requests.org‘)

>>> r.html.render()

>>> r.html.search(‘Python 2 will retire in only {months} months!‘)[‘months‘]
‘<time>25</time>‘

註意，當你第一次運行render()方法，它將downloadchromium進入你的家目錄（例如~ .pyppeteer / /）。

>>> from requests_html import HTML
>>> doc = """<a href=‘https://httpbin.org‘>"""

>>> html = HTML(html=doc)
>>> html.links
{‘https://httpbin.org‘}

最新的爬蟲工具requests-html

rom 聽說 {} 推薦接口人員查找 python 2 日子使用Python開發的同學一定聽說過Requsts庫，它是一個用於發送HTTP請求的測試。如比我們用Python做基於HTTP協議的接口測試，那麽一定會首選Requsts，因為它即簡單又強大。現在作者

最新的爬蟲工具requests-html

JavaScript的支持

最新的爬蟲工具requests-html

Python爬蟲實戰——requests-html

Python 爬蟲實戰（二）：使用 requests-html

【Python3~爬蟲工具】使用requests庫

今天用了一下Requests-HTML庫（Python爬蟲）

【網路爬蟲】【java】微博爬蟲（四）：資料處理——jsoup工具解析html、dom4j讀寫xml

python接口自動化28-requests-html爬蟲框架

20170717_python爬蟲之requests+cookie模擬登陸

Defcon 23最新開源工具NetRipper代碼分析與利用

python爬蟲之requests模塊

爬蟲之requests介紹

C# 網絡爬蟲利器之Html Agility Pack如何快速實現解析Html

爬蟲之requests庫

json數據爬蟲。requests實現

python3 爬蟲之requests模塊使用總結

爬蟲基礎(requests庫的基本使用)--02

簡單實現nodejs爬蟲工具

【Python爬蟲】Requests庫的安裝

爬蟲入門requests庫疑惑

【Python爬蟲】從html裏爬取中國大學排名

最新的爬蟲工具requests-html

JavaScript的支持

相關推薦