Python爬蟲：知乎熱榜(靜態網頁)的爬取

阿新 • • 發佈：2022-03-05

1. 請求知乎熱榜網頁

參考程式碼如下：

import requests


url = 'https://www.zhihu.com/hot'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36',
    'cookie':'知乎賬號下請求頭的cookie的值'
}
rsp2 = requests.get(url=url,headers=headers)

因為知乎這個網站不論瀏覽它下面什麼內容，都需要登入，所以在請求頭上加了cookie這個欄位以及它的值。

2. 解析熱榜上的資訊

2.1 使用模組pyquery進行資料解析

pq2 = pq(rsp2.text)
list1 = pq2('.HotList-list section .HotItem-content a')
for index,ele in enumerate(list1.items()):
    print(index+1,ele.attr('title'),ele.attr('href'))

2.2 使用模組lxml(xpath語法)進行資料解析

html2 = etree.HTML(rsp2.text)
list2 = html2.xpath('//*[@class="HotList-list"]/section/*[@class="HotItem-content"]/a')
for i in range(len(list2)):
    ele = list2[i]
    print(i+1,ele.xpath('./@title')[0],ele.xpath('./@href')[0])

2.3 使用模組bs4進行資料解析

html2 = BeautifulSoup(rsp2.text,'lxml')
list3 = html2.select('.HotList-list section .HotItem-content a')
for i in range(len(list3)):
    ele = list3[i]
    print(i,ele['title'],ele['href'])

2.4 執行結果

3. 全部參考程式碼

from pyquery import PyQuery as pq
from lxml import etree
from bs4 import BeautifulSoup
import requests


url = 'https://www.zhihu.com/hot'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36',
    'cookie':'知乎賬號下請求頭的cookie的值'
}
rsp2 = requests.get(url=url,headers=headers)

pq2 = pq(rsp2.text)
list1 = pq2('.HotList-list section .HotItem-content a')
for index,ele in enumerate(list1.items()):
    print(index+1,ele.attr('title'),ele.attr('href'))


html2 = etree.HTML(rsp2.text)
list2 = html2.xpath('//*[@class="HotList-list"]/section/*[@class="HotItem-content"]/a')
for i in range(len(list2)):
    ele = list2[i]
    print(i+1,ele.xpath('./@title')[0],ele.xpath('./@href')[0])

html2 = BeautifulSoup(rsp2.text,'lxml')
list3 = html2.select('.HotList-list section .HotItem-content a')
for i in range(len(list3)):
    ele = list3[i]
    print(i,ele['title'],ele['href'])

Python爬蟲：知乎熱榜(靜態網頁)的爬取

1. 請求知乎熱榜網頁參考程式碼如下： import requests url = \'https://www.zhihu.com/hot\'

知乎熱榜：計算機專業的學生要怎樣做才能避免成為低階的碼農？

點選上方“五分鐘學演算法”，選擇“星標”公眾號重磅乾貨，第一時間送達

知乎熱榜：如何渡過小白期，不再當菜鳥程式設計師？

點選上方“五分鐘學演算法”，選擇“星標”公眾號重磅乾貨，第一時間送達

[Python]爬蟲獲取知乎某個問題下所有圖片並去除水印

獲取URL 進入某個知乎問題的主頁下，按F12開啟開發者工具後檢視network面板。 network面板可以檢視頁面向伺服器請求的資源、資源的大小、載入資源花費的時間以及哪些資源載入失敗等資訊。還可以檢視HTTP的請求頭，

Python 爬蟲：跳轉頁面時，網頁連結(url)不變，XHR中的含有網頁資訊的檔案連結(url)不變，解決辦法

轉載自：https://blog.csdn.net/weixin_44606217/article/details/98629593 解決辦法在文末問題：

python爬蟲學習筆記(二十八)-Scrapy 框架爬取JS生成的動態頁面

問題有的頁面的很多部分都是用JS生成的，而對於用scrapy爬蟲來說就是一個很大的問題，因為scrapy沒有JS engine，所以爬取的都是靜態頁面，對於JS生成的動態頁面都無法獲得

另類Python爬蟲，利用pandas庫的read_html()方法爬取網頁表格型資料

文章目錄一、簡介二、原理三、爬取實戰例項1 例項2 一、簡介很多人學習python，不知道從何學起。很多人學習python，掌握了基本語法過後，不知道在哪裡尋找案例上手。很多已經做案例的人，卻不知道

Spider--基礎總結--實戰--bs靜態網頁爬取TOP250電影

import requests from bs4 import BeautifulSoup def gettop250(): headers={ \'user-agent\':\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/