1. 程式人生 > 其它 >Python爬蟲:知乎熱榜(靜態網頁)的爬取

Python爬蟲:知乎熱榜(靜態網頁)的爬取

1. 請求知乎熱榜網頁

參考程式碼如下:

import requests


url = 'https://www.zhihu.com/hot'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36',
    'cookie':'知乎賬號下請求頭的cookie的值'
}
rsp2 = requests.get(url=url,headers=headers)

因為知乎這個網站不論瀏覽它下面什麼內容,都需要登入,所以在請求頭上加了cookie這個欄位以及它的值。

2. 解析熱榜上的資訊

2.1 使用模組pyquery進行資料解析

pq2 = pq(rsp2.text)
list1 = pq2('.HotList-list section .HotItem-content a')
for index,ele in enumerate(list1.items()):
    print(index+1,ele.attr('title'),ele.attr('href'))

2.2 使用模組lxml(xpath語法)進行資料解析

html2 = etree.HTML(rsp2.text)
list2 = html2.xpath('//*[@class="HotList-list"]/section/*[@class="HotItem-content"]/a')
for i in range(len(list2)):
    ele = list2[i]
    print(i+1,ele.xpath('./@title')[0],ele.xpath('./@href')[0])

2.3 使用模組bs4進行資料解析

html2 = BeautifulSoup(rsp2.text,'lxml')
list3 = html2.select('.HotList-list section .HotItem-content a')
for i in range(len(list3)):
    ele = list3[i]
    print(i,ele['title'],ele['href'])

2.4 執行結果

3. 全部參考程式碼

from pyquery import PyQuery as pq
from lxml import etree
from bs4 import BeautifulSoup
import requests


url = 'https://www.zhihu.com/hot'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36',
    'cookie':'知乎賬號下請求頭的cookie的值'
}
rsp2 = requests.get(url=url,headers=headers)

pq2 = pq(rsp2.text)
list1 = pq2('.HotList-list section .HotItem-content a')
for index,ele in enumerate(list1.items()):
    print(index+1,ele.attr('title'),ele.attr('href'))


html2 = etree.HTML(rsp2.text)
list2 = html2.xpath('//*[@class="HotList-list"]/section/*[@class="HotItem-content"]/a')
for i in range(len(list2)):
    ele = list2[i]
    print(i+1,ele.xpath('./@title')[0],ele.xpath('./@href')[0])

html2 = BeautifulSoup(rsp2.text,'lxml')
list3 = html2.select('.HotList-list section .HotItem-content a')
for i in range(len(list3)):
    ele = list3[i]
    print(i,ele['title'],ele['href'])