python之selenium的坑總結

阿新 • • 發佈：2019-01-04

最近鎖王唐唐很厲害啊，甚是迷戀，鄙人觀望了小呆的微博已經兩天了，路轉粉，絕對鐵粉，閒來無事，我要開始分析wuli唐唐了。爬取了小呆的新浪微博，贊帖，帖子內容，發帖時間，轉發數，評論數，點贊數。其中，不乏入坑無數，然而，都一一克服了，不得不說，這就是小呆的力量。

模擬人類自動登入記住cookie

捕獲當前所有視窗

在這裡，不得不說，新浪微博的爬取並非易事。
首先，進入”https://weibo.com/”的登入頁面，自動化介入，點選“登入”的同時，網頁跳轉了，此時，需要捕捉當前所有視窗。

handles = browser.window_handles  #獲取當前開啟的所有視窗的控制代碼

頁面跳轉/視窗切換

捕獲全部視窗，鎖定當前視窗。

browser.switch_to.window(handles[1])  #切換到第二個視窗的控制代碼
print(browser.current_window_handle)
print(browser.current_url)

iframe視窗解決自動登入

經過無數次xpath的元素定位查詢，總是出現NoSuchElementException，並未能解決，經過查閱大量資料，無意中發現iframe型別，檢視原碼，發現該登陸視窗是iframe形式，大喜！
這裡寫圖片描述

browser.implicitly_wait(5)
#先找到到iframe（id ） 

browser.switch_to_frame("ptlogin_iframe")
browser.find_element_by_xpath('//*[@id="qlogin_list"]/a[1]').click()

定位並點選進行登陸，需要說明，這裡我是通過扣扣註冊的新浪微博，同時需要把扣扣登陸上。
這裡寫圖片描述

搜尋框自動搜尋

定位並點選搜尋，接下來就是一些列的自動click，直到進入目標任務的主頁。

browser.implicitly_wait(5)
search = browser.find_element_by_xpath('//*[@id="plc_top"]/div/div/div[2]/input' 
).send_keys("唐禹哲")
browser.find_element_by_xpath('//*[@id="plc_top"]/div/div/div[2]/a').click()

這裡寫圖片描述

需要保留cookie的保留，下次直接cookie就可以。
同時，在爬取過程中，頁面載入可能需要時間，此時，需要設定等待時間。

非同步載入

新浪微博的頁面比較特殊，需要大概三次拖拽方可將第一頁載入完全，其他頁也是一樣，找了很多selenium非同步載入的文章都未能解決問題，直到遇到下面的解決方案。何為非同步載入，當滑到底部，頁面出現待載入的內容。
這裡寫圖片描述

time.sleep(10)

def selenuim_loading_more(browser, method_index=0): 
  if method_index==0: 
    browser.implicitly_wait(3) # 為了快速滑動，先設定超時時間為3秒 
    # while True: 
    for i in range(1, 4): # at most 3 times 
      print("loading more, window.scrollTo bettom for the", i,"time ...")
      browser.execute_script("window.scrollTo(0,document.body.scrollHeight);") 
      try: 
        # 定位頁面底部的換頁tab 
        browser.find_element_by_css_selector("div[class='W_pages']") 
        break # 如果沒丟擲異常就說明找到了底部標誌，跳出迴圈 
      except : 
        pass # 丟擲異常說明沒找到底部標誌，繼續向下滑動 
    browser.implicitly_wait(4) # 將超時時間改回10秒 
  elif method_index==1: 
    browser.find_element_by_css_selector("div[class='empty_con clearfix']").click() # loading more 
    print("loading more, sleep 4 seconds ... 1")
    time.sleep(4) 
    browser.find_element_by_css_selector("div[class='empty_con clearfix']").click() # loading more 
    print("loading more, sleep 3 seconds ... 2")
    time.sleep(2) 
  elif method_index==2: 
    load_more_1 = browser.find_element_by_css_selector("div[class='empty_con clearfix']") # loading more         
    ActionChains(browser).click(load_more_1).perform() 
    print("loading more, sleep 4 seconds ... 1")
    time.sleep(4) 
    load_more_2 = browser.find_element_by_css_selector("div[class='empty_con clearfix']") # loading more         
    ActionChains(browser).click(load_more_2).perform() 
    print("loading more, sleep 3 seconds ... 2") 
    time.sleep(2) 
  elif method_index==3: 
    print("loading more, sleep 4 seconds ... 1") 
    element = WebDriverWait(browser, 4).until( 
      EC.element_to_be_clickable((By.CSS_SELECTOR, "div[class='empty_con clearfix']")) 
    ) 
    element.click() 
    print("loading more, sleep 2 seconds ... 2") 
    WebDriverWait(browser, 2).until( 
      EC.element_to_be_clickable((By.CSS_SELECTOR, "div[class='empty_con clearfix']")) 
    ).click() 
  return browser 

browser = selenuim_loading_more(browser, method_index=0)

ok ！非同步載入問題的解決就真的解決了多數了，下面就是元素定位問題。

selenium元素定位

最簡單的額一種方法就是通過選定找到xpath，而xpath的語法又有很多種，
第一種是通過制定contains

 browser.find_elements_by_xpath('//div[contains(@class,"WB_info")]//a[contains(@href,"//weibo.com/")]')

第二種是全部路徑形式

browser.find_element_by_xpath('//*[@id="Pl_Official_MyProfileFeed__21"]/div/div[{0}]/div[2]/div/ul/li[2]/a/span/span/span/em[2]'.format(i+1))

這樣的話，可能是點讚的文章與自己發的帖子的xpath路徑不同，就需要設定判斷語句來進行不同的操作。
其他方法也有很多，這裡就用了兩種，其他方法，自行搜尋。

異常報錯判斷

在爬蟲過程中，需要增加相應的異常判斷，否則，爬了半天，發現最後報錯了，那就簡直*了狗了，心中一萬隻你懂得！
常用的異常判斷，無非就是try except

from requests.exceptions import ReadTimeout,HTTPError,ConnectionError,RequestException
from selenium.common.exceptions import NoSuchElementException

try:
    weibo_content.append(browser.find_element_by_xpath('//*[@id="Pl_Official_MyProfileFeed__21"]/div/div[{0}]/div[2]/div[3]/div[4]'.format(i+1)).text)

except NoSuchElementException:
    weibo_content.append(browser.find_element_by_xpath('//*[@id="Pl_Official_MyProfileFeed__21"]/div/div[{0}]/div[2]/div[4]/div[4]'.format(i+1)).text)

這裡也就列舉了幾種，其他的需要自己查閱。

python之selenium的坑總結

模擬人類自動登入記住cookie

捕獲當前所有視窗

頁面跳轉/視窗切換

iframe視窗解決自動登入

搜尋框自動搜尋

非同步載入

selenium元素定位

異常報錯判斷

python之selenium的坑總結

Python之Selenium的安裝

python之Selenium+pyquery爬取有大量反爬蟲的天眼查

Selenuim+Python之元素定位總結及例項說明

python之scipy概括總結

python之selenium呼叫js(execute_script)

[Python-Selenium] 入門總結+坑點陳列

【Python】Selenium元素定位錯誤之解決辦法

python爬蟲從入門到放棄（八）之 Selenium庫的使用

Python之文件操作：經驗總結

selenium-Python之進行文件的上傳和下載文件

selenium+python之HTML測試報告

selenium+python之辨識alert、window以及操作

【轉載】【selenium+Python WebDriver】之selenium的定位以及切換frame（iframe）

Python學習筆記之selenium 定制啟動 chrome 的選項

selenium +python之Page Obiect設計模式

selenium+python之python多線程

Python之——遇到的小知識點總結

Python之旅.第四章.模塊與包.總結（未完待遇）

人生苦短，我學python之python+selenium 隨機生成手機號碼和身份證

python之selenium的坑總結

模擬人類自動登入記住cookie

捕獲當前所有視窗

頁面跳轉/視窗切換

iframe視窗解決自動登入

搜尋框自動搜尋

非同步載入

selenium元素定位

異常報錯判斷

相關推薦