Python爬蟲:知乎熱榜(靜態網頁)的爬取
阿新 • • 發佈:2022-03-05
1. 請求知乎熱榜網頁
參考程式碼如下:
import requests
url = 'https://www.zhihu.com/hot'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36',
'cookie':'知乎賬號下請求頭的cookie的值'
}
rsp2 = requests.get(url=url,headers=headers)
因為知乎這個網站不論瀏覽它下面什麼內容,都需要登入,所以在請求頭上加了cookie這個欄位以及它的值。
2. 解析熱榜上的資訊
2.1 使用模組pyquery進行資料解析
pq2 = pq(rsp2.text)
list1 = pq2('.HotList-list section .HotItem-content a')
for index,ele in enumerate(list1.items()):
print(index+1,ele.attr('title'),ele.attr('href'))
2.2 使用模組lxml(xpath語法)進行資料解析
html2 = etree.HTML(rsp2.text) list2 = html2.xpath('//*[@class="HotList-list"]/section/*[@class="HotItem-content"]/a') for i in range(len(list2)): ele = list2[i] print(i+1,ele.xpath('./@title')[0],ele.xpath('./@href')[0])
2.3 使用模組bs4進行資料解析
html2 = BeautifulSoup(rsp2.text,'lxml')
list3 = html2.select('.HotList-list section .HotItem-content a')
for i in range(len(list3)):
ele = list3[i]
print(i,ele['title'],ele['href'])
2.4 執行結果
3. 全部參考程式碼
from pyquery import PyQuery as pq from lxml import etree from bs4 import BeautifulSoup import requests url = 'https://www.zhihu.com/hot' headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36', 'cookie':'知乎賬號下請求頭的cookie的值' } rsp2 = requests.get(url=url,headers=headers) pq2 = pq(rsp2.text) list1 = pq2('.HotList-list section .HotItem-content a') for index,ele in enumerate(list1.items()): print(index+1,ele.attr('title'),ele.attr('href')) html2 = etree.HTML(rsp2.text) list2 = html2.xpath('//*[@class="HotList-list"]/section/*[@class="HotItem-content"]/a') for i in range(len(list2)): ele = list2[i] print(i+1,ele.xpath('./@title')[0],ele.xpath('./@href')[0]) html2 = BeautifulSoup(rsp2.text,'lxml') list3 = html2.select('.HotList-list section .HotItem-content a') for i in range(len(list3)): ele = list3[i] print(i,ele['title'],ele['href'])