1. 程式人生 > >【爬蟲】如何用python+selenium網頁爬蟲

【爬蟲】如何用python+selenium網頁爬蟲

spl query page selenium ota selector 方法 exc timeout

一、前提

爬蟲網頁(只是演示,切勿頻繁請求):https://www.kaola.com/

需要的知識:Python,selenium 庫,PyQuery

參考網站:https://selenium-python-zh.readthedocs.io/en/latest/waits.html

二、簡單的分析下網站

技術分享圖片

三、步驟

  1.目標:

    1.open brower

    2.open url

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
import time
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from pyquery import PyQuery as py
brower = webdriver.Chrome() //定義一個brower ,聲明webdriver,調用Chrome()方法
wait = WebDriverWait(brower,20) //設置一個全局等待時間
brower.get("https://www.kaola.com/")

  

  2.搜索【年貨】

def search():
    try:
        brower.get("https://www.kaola.com/")
     //紅包 close_windows = wait.until( EC.presence_of_element_located((By.XPATH,‘//div[@class="cntbox"]//div[@class="u-close"]‘)) )
     //輸入框 input = wait.until( EC.presence_of_element_located((By.CSS_SELECTOR,‘#topSearchInput‘)) )
//搜索 submit = wait.until( EC.presence_of_element_located((By.XPATH,‘//*[@id="topSearchBtn"]‘)) ) close_windows.click() input.send_keys(‘年貨‘) time.sleep(2) submit.click()
     //獲取年貨所有的頁數 total = wait.until( EC.presence_of_element_located((By.CSS_SELECTOR,‘#resultwrap > div.splitPages > a:nth-child(11)‘)) ) return total.text except TimeoutException: return ‘error‘

  

  3.獲取頁面的信息

//使用pyQurey解析頁面
def get_product(): wait.until( EC.presence_of_element_located((By.XPATH,‘//*[@id="result"]//li[@class="goods"]‘)) ) html = brower.page_source doc = py(html) goods = doc(‘#result .goods .goodswrap‘) for good in goods.items(): product = { ‘image‘ : good.find(‘a‘).attr(‘href‘), ‘title‘:good.find(‘a‘).attr(‘title‘), ‘price‘:good.find(‘.price .cur‘).text() } print(product)
def main():
  get_product()
  brower.close

  

.....後續更新

【爬蟲】如何用python+selenium網頁爬蟲