python網路爬蟲（動態網頁）

阿新 • • 發佈：2021-08-08

爬取動態網頁

爬取動態網頁

新手入門學習爬蟲，一般都是從爬豆瓣開始的。在我學會爬豆瓣後，想要爬其他網頁玩玩，然後我就選取了爬取貓眼電影。像是豆瓣一樣，我們進入貓眼電影，我們檢視原始碼，是可以看到很多我們想要的東西的，但是，用requests解析後，裡面什麼都沒有。剛開始還以為我搞錯了，然後問了別人才知道，這原來是動態網頁，是需要瀏覽器瀏覽，或者是滑鼠發生移動後才才會生成HTML程式碼的。因此，為了能夠順便爬取到我們需要的資訊，需要我們藉助工具來模仿使用者的行為來生成HTML程式碼。

解析網站

在解析網站中，用的是requests庫來訪問，用Xpath或者BeautifulSoup來解析也都是可以的。

import requests
import lxml
from bs4 import BeautifulSoup

def get_one_page(url):
        headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'}
        response=requests.get(url,headers=headers)
        soup = BeautifulSoup(response.text,'html.parser')
        titles = soup.find_all('title')
        print(titles)
#         if response.status_code==200:# 狀態響應碼為200 表示正常訪問
#             return response.text
#         return None
    
def main():
    url='https://maoyan.com/board/4?offset=0'
    html=get_one_page(url)
#     print(html)
    
main()

通過上述程式碼解析後，發現並沒有我們需要的內容

最後一樣顯示了</script><script src="/public/js/verify.e92127.js" type="text/javascript"></script></body></html>

解析不出什麼東西，應該是採取了Javascript來動態生成資訊，所以，我又學習動態Javascript的抓取教程。

在Python中用Selenium執行JavaScript

Selenium是一個強大的網頁抓取工具，最初是為網站自動化測試而開發的。Selenium可以讓瀏覽器自動載入網站，獲取需要的資料，甚至對網頁截圖，或者判斷網站上是否發生了某些操作。

安裝iSelenium

首先安裝Selenium

pip install Selenium

下載PhantomJS

安裝完成後，我們還需要安裝PhantomJS。因為是一個功能完善（雖無開頭）的瀏覽器，並非一個python庫，所以不能pip安裝

瀏覽器輸入https://phantomjs.org/download.html

根據需要自行下載，下載完成後解壓，將解壓包內bin資料夾新增至環境變數

driver = webdriver.PhantomJS(executable_path=r'D:/phantomjs/phantomjs-2.1.1-windows/bin/phantomjs.exe')
driver.get('https://pythonscraping.com/pages/javascript/ajaxDemo.html')
time.sleep(3)
print(driver.find_element_by_id('content').text)
driver.close()
'''
Here is some important text you want to retrieve!
A button to click!
'''

測試完成後，即可完成安裝。

網頁爬取

我們設定了等待時間，因為需要等Javascript完全載入才可以獲取全部資訊。

程式碼如下：

from  selenium import  webdriver  #引入網頁驅動包
import time
from selenium.webdriver.remote.webelement import WebElement
from selenium.common.exceptions import StaleElementReferenceException

def waitForLoad(driver):
    elem = driver.find_element_by_tag_name("html")
    count = 0
    while True:
        count += 1
        if count > 20:
            print('Timing out after 10 seconds and returning')
            return
        time.sleep(.5)
        try:
            elem == driver.find_element_by_tag_name('html')
        except StaleElementReferenceException:
            return
        
driver = webdriver.PhantomJS(executable_path=r'D:/phantomjs/phantomjs-2.1.1-windows/bin/phantomjs.exe')
url='https://maoyan.com/board/4?offset=0'
driver.get(url)
waitForLoad(driver)	#等待網頁JavaScript載入完成
print(driver.page_source)
driver.close()	#最後要將driver關閉

網頁html的部分截圖。

電影成功爬取，就和爬取豆瓣一樣即可。程式碼參考《Python網路爬蟲權威指南》

python網路爬蟲（動態網頁）

爬取動態網頁

解析網站

在Python中用Selenium執行JavaScript

安裝iSelenium

下載PhantomJS

網頁爬取

python網路爬蟲（動態網頁）

Python網路爬蟲（實踐篇）

Python網路爬蟲（瀏覽器偽裝技術）

06.Python網路爬蟲之requests模組（2）

04.Python網路爬蟲之requests模組（1）

16.Python網路爬蟲之Scrapy框架（CrawlSpider）

15行Python程式碼能幹嘛？能寫一個抖音網頁版的簡易爬蟲（附原始碼）

python網路爬蟲-資料儲存（七）

Python+Selenium+Mysql（動態獲取資料，資料庫互動）

90分鐘掌握Python多執行緒爬蟲（全程實戰）

解決Python web框架（如Flask）與vue渲染變數的衝突（衝突會導致vue渲染報錯，網頁無法顯示）

python爬蟲（含scrapy）

iOS逆向學習之八（動態除錯）

python網路爬蟲 CrawlSpider使用詳解

python網路爬蟲 Scrapy中selenium用法詳解

python lambda表示式（匿名函式）寫法解析

利用python計算時間差（返回天數）

Python網路爬蟲資訊提取mooc程式碼例項

Python網路爬蟲四大選擇器用法原理總結

Python正則（Re模組）

python網路爬蟲（動態網頁）

爬取動態網頁

解析網站

在Python中用Selenium執行JavaScript

安裝iSelenium

下載PhantomJS

網頁爬取

相關推薦