爬蟲--python3.6+selenium+BeautifulSoup實現動態網頁的資料抓取，適用於對抓取頻率不高的情況

阿新 • • 發佈：2018-12-28

說在前面：本文主要介紹如何抓取頁面載入後需要通過JS載入的資料和圖片

本文是通過python中的selenium（pyhton包） + chrome（谷歌瀏覽器） + chromedrive（谷歌瀏覽器驅動）

chrome 和chromdrive建議都下最新版本（參考地址：https://blog.csdn.net/yoyocat915/article/details/80580066）

同樣支援無頭模式（不需要開啟瀏覽器）

直接上程式碼：site_url：需要爬取的地址，CHROME_DRIVER_PATH：chromedrive存放地址

def get_dynamic_html(site_url):
    print('開始載入',site_url,'動態頁面')
    chrome_options = webdriver.ChromeOptions()
    #ban sandbox
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    #use headless，無頭模式
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--disable-gpu')
    chrome_options.add_argument('--ignore-ssl-errors')
    driver = webdriver.Chrome(executable_path=CHROME_DRIVER_PATH,chrome_options=chrome_options)
    #print('dynamic laod web is', site_url)
    driver.set_page_load_timeout(100)
    #driver.set_script_timeout(100)
    try:
        driver.get(site_url)
    except Exception as e:
        #driver.execute_script('window.stop()') # 超出時間則不載入
        print(e, 'dynamic web load timeout')
    data = driver.page_source
    soup = BeautifulSoup(data, 'html.parser')
    try:
        driver.quit()
    except:
        pass
    return soup

返回的一個soup，這樣可以對這個soup進行搜尋節點，使用select，search，find等方法找到你想要的節點或者資料

同樣如果你想變成文字下載下來，則

try:
        with open(xxx.html, 'w+', encoding="utf-8") as f:
            #print ('html content is:',content)
            f.write(get_dynamic_html('https://xxx.com').prettify())
            f.close()
    except Exception as e:
        print(e)

下面詳細說一下，beautifusoup的搜尋

首先如何定位到一個標籤

1.使用 find （這位博主詳細介紹了https://www.jb51.net/article/109782.htm）

find() 返回匹配第一個：如soup.find(name='ul',attrs={class:'hh'}) 返回第一個 class='hh'的ul
find_all() 返回全部
find_parent() 搜尋父標籤，返回第一個
find_parents()搜尋父標籤，返回全部
find_next_sibling()返回下一個同級標籤
find_next_siblings()

find_previous_sibling() 返回上一個同級標籤
find_previous()返回前面的標籤
find_all_previous()
find_next()返回後面的標籤
find_all_next()

2.使用select 通過標籤名，類名，id 類似 Jquery的選擇器如 soup.select('p .link #link1') 選擇定位到 <p class='link' id='link1'></p> 通過屬性查詢，如href ，title，link等屬性，如 soup.select('p a[href="http://example.com/elsie"]') 這裡匹配到的是最小的 <a href='http://example.com/elsie'></a> 並且他的上級為<p></p> 最後通過beautifusoup是篩選元素的一種好的方法，下篇我們介紹正則表示式匹配篩選爬蟲內容

爬蟲--python3.6+selenium+BeautifulSoup實現動態網頁的資料抓取，適用於對抓取頻率不高的情況

爬蟲--python3.6+selenium+BeautifulSoup實現動態網頁的資料抓取，適用於對抓取頻率不高的情況

Python3+Selenium爬取動態網頁資料

【2018.05.09】python3.6 + selenium 實現GitHub自動登入，自動退出

【2018.05.07】python3.6+selenium 實現自動登入126/163郵箱，然後傳送郵件

selenium+BeautifulSoup實現強大的爬蟲功能

Web前端攻城獅培養計劃之HTML5與CSS3實現動態網頁

python3.6 PIL + tkinter 實現帶UI的字元畫轉換器

Ubuntu+PyQt5+Python3.6+Qt Designer 實現視覺化視窗的編輯

記第一次爬蟲 python3.6+pyquery

反爬蟲-python3.6抓取貓眼電影資訊

使用python3.6 selenium啟動chromedriver

FreeMarker實現動態網頁靜態化

[html5/Css3] HTML5與CSS3實現動態網頁視訊教程 H5零基礎實戰教程

【2018.05.14】python3.6+selenium 單選框和複選框

【2018.05.11】python3.6+selenium JS處理滾動條，聚焦元素

【2018.05.11】python3.6+selenium 彈框的處理alert\confirm\prompt

【2018.05.11】python3.6+selenium 下拉框select定位百度-設定-搜尋設定...儲存設定

【2018.05.11】python3.6+selenium 知乎自動登入+驗證碼問題

【2018.05.11】python3.6+selenium 多視窗，控制代碼，切換

【2018.05.09】python3.6+selenium 定位一組元素，獲取url,並隨機點選某個url

爬蟲--python3.6+selenium+BeautifulSoup實現動態網頁的資料抓取，適用於對抓取頻率不高的情況

相關推薦