[selenium]selenium驅動chrome爬取網頁/無介面chrome/使用代理

阿新 • • 發佈：2019-01-02

selenium與chromedriver安裝

安裝chrome（有版本要求，linux和windows版本要求不同，可自行查閱，儘量使用61+版本的chrome）
先安裝selenium庫，在下載chromedriver，將chromedriver放入環境變數方便selenium呼叫。

selenium呼叫chrome

程式碼可以直接執行，只需修改代理引數，該程式碼實現了：
1. 無代理爬取京東單個商品：selenium+headless chrome
2. 普通代理爬取京東單個商品：selenium+headless chrome+proxy
3. 需要驗證的代理爬取京東單個商品：selenium+headless chrome+proxy(auth)(暫時無法使用headless方式)

#!/usr/bin/env python3
# coding=utf-8
import logging
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException, TimeoutException, StaleElementReferenceException
from selenium.webdriver.chrome.options import Options
import time
import json
from json import decoder
import 
 os
import re
import zipfile
# https://github.com/revotu/selenium-chrome-auth-proxy下載外掛放入指定資料夾
CHROME_PROXY_HELPER_DIR = 'proxy_helper/'
CUSTOM_CHROME_PROXY_EXTENSIONS_DIR = 'proxy_helper/'
def get_chrome_proxy_extension():  # 若要使用需要驗證的代理，使用該函式生成外掛,自行填寫賬號密碼
    username = 'xxxxxxxxxx'
    password = 'xxxxxxxxxx' 

    ip = 'xxxxxxxxx'
    port = 'xxxx'
    # 建立一個定製Chrome代理擴充套件(zip檔案)
    if not os.path.exists(CUSTOM_CHROME_PROXY_EXTENSIONS_DIR):
        os.mkdir(CUSTOM_CHROME_PROXY_EXTENSIONS_DIR)
    extension_file_path = os.path.join(CUSTOM_CHROME_PROXY_EXTENSIONS_DIR, '{}.zip'.format('[email protected]_9010'))
    if not os.path.exists(extension_file_path):
        # 擴充套件檔案不存在，建立
        zf = zipfile.ZipFile(extension_file_path, mode='w')
        zf.write(os.path.join(CHROME_PROXY_HELPER_DIR, 'manifest.json'), 'manifest.json')
        # 替換模板中的代理引數
        background_content = open(os.path.join(CHROME_PROXY_HELPER_DIR, 'background.js')).read()
        background_content = background_content.replace('%proxy_host', ip)
        background_content = background_content.replace('%proxy_port', port)
        background_content = background_content.replace('%username', username)
        background_content = background_content.replace('%password', password)
        zf.writestr('background.js', background_content)
        zf.close()
    return extension_file_path


class Crawler(object):

    def __init__(self, proxy=None):
        chrome_options = Options()
        chrome_options.add_argument('--headless')
        chrome_options.add_argument('--disable-gpu')
        prefs = {"profile.managed_default_content_settings.images": 2}  # not load image
        chrome_options.add_experimental_option("prefs", prefs)
        if proxy:
            proxy_address = proxy['https']
            chrome_options.add_argument('--proxy-server=%s' % proxy_address)  # 使用不需驗證的代理
            # chrome_options.add_extension(get_chrome_proxy_extension())  # 使用需要驗證的代理
            logging.info('Chrome using proxy: %s', proxy['https'])
        self.chrome = webdriver.Chrome(chrome_options=chrome_options)
        # wait 3 seconds for start session (may delete)
        self.chrome.implicitly_wait(5)
        # set timeout like requests.get()
        # jd sometimes load google pic takes much time
        self.chrome.set_page_load_timeout(60)
        # set timeout for script
        self.chrome.set_script_timeout(60)

    def get_jd_item(self, item_id):
        item_info = ()
        url = 'https://item.jd.com/' + item_id + '.html'
        try:
            self.chrome.get(url)
            name = self.chrome.find_element_by_xpath("//*[@class='sku-name']").text
            price = self.chrome.find_element_by_xpath("//*[@class='p-price']").text
            subtitle = self.chrome.find_element_by_xpath("//*[@id='p-ad']").text
            plus_price = self.chrome.find_element_by_xpath("//*[@class='p-price-plus']").text
            item_info = [name, price[1:], subtitle, plus_price[1:]]  # tuple cannot change plus_price
            logging.debug('item_info:{}, {}, {}, {}'.format(name, price, subtitle, plus_price))
            logging.info('Crawl SUCCESS: {}'.format(item_info))
        except NoSuchElementException as e:
            logging.warning('Crawl failure: {}'.format(e))
        except TimeoutException as e:
            logging.warning('Crawl failure: {}'.format(e))
        self.chrome.quit()
        return item_info




if __name__ == '__main__':
    logging.basicConfig(level=logging.INFO)
    while True:
        start = time.time()
        # c = Crawler()
        c = Crawler({'http': 'http-pro.abuyun.com:9010', 'https': 'http-pro.abuyun.com:9010'})
        logging.debug(c.get_jd_item('3133927'))
        end = time.time()
        print(end-start)

有問題請留言諮詢！轉載請註明！

[selenium]selenium驅動chrome爬取網頁/無介面chrome/使用代理

selenium與chromedriver安裝安裝chrome（有版本要求，linux和windows版本要求不同，可自行查閱，儘量使用61+版本的chrome）先安裝selenium庫，在下載chromedriver，將chromedriver放入環境

爬蟲-----selenium模塊自動爬取網頁資源

pri 輸入文字豆瓣移動相關 append 字符 scrollto value selenium介紹與使用 1 selenium介紹　　什麽是selenium？selenium是Python的一個第三方庫，對外提供的接口可以操作瀏覽器，然後讓瀏覽器完成自動化的操

python+selenium+PhantomJS爬取網頁動態加載內容

use for ive comm 自動化測試 mac os x page 影響 blank 一般我們使用python的第三方庫requests及框架scrapy來爬取網上的資源，但是設計javascript渲染的頁面卻不能抓取，此時，我們使用web自動化測試化工具Selen

使用selenium + chrome爬取中國大學Mooc網的計算機學科的所有課程鏈接

link_list () cat 圖片谷歌瀏覽器 dom tps 類名異步加載目的：使用selenium + chrome爬取中國大學Mooc網計算機學科的所有的課程鏈接列表思路：找到每個分頁的節點屬性為class=”m-course-list” 的div元素，再找

python爬蟲+網頁點選事件+selenium模擬瀏覽器，爬取選股寶內容

（一）PYTHON的安裝（已安裝，可跳過此步驟） 1、PYTHON下載 PYTHON官網：https://www.python.org/ 按照對應的系統下載，

Python爬蟲：selenium掛shadowsocks代理爬取網頁內容

selenium掛ss代理爬取網頁內容 from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.common.exceptions import

Python網路爬蟲（四）：selenium+chrome爬取美女圖片

說明： Python版本：Python IDE：PyCharm chrome版本：我的版本63 chromedriver.exe：因為是模擬瀏覽器訪問，chrome需要再下載一個驅動，具體方式在我的上一篇部落格，內容很詳細。傳送門：Python網路爬蟲（

Selenium學習三——利用Python爬取網頁表格資料並存到excel

利用Python爬取網頁表格資料並存到excel 1、具體要求：讀取教務系統上自己的成績單，並儲存到本地的excel中 2、技術要求：利用Selenium+Python獲取網頁，自動登陸並操作到成績單頁面通過xlwt模組，將表格儲存到本地excel （其中xlwt

Selenium學習四——利用Python爬取網頁多個頁面的表格資料並存到已有的excel中

利用Python爬取網頁多個頁面的表格資料並存到已有的excel中 1、具體要求獲取牛客網->題庫->線上程式設計->劍指Offer網頁，獲取表格中的全部題目，儲存到本地excel中 2、技術要求利用Selenium+Python獲取網頁，操

使用selenium 多線程爬取愛奇藝電影信息

連接獲取 ict 容易出錯 span column 分享圖片 odi attribute 使用selenium 多線程爬取愛奇藝電影信息轉載請註明出處。爬取目標：每個電影的評分、名稱、時長、主演、和類型爬取思路：源文件：（有註釋） from seleniu

無搜尋條件根據url獲取網頁資料(java爬取網頁資料)

jsoup jar包 <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.11.3<

Selenium+phanmJs 操作瀏覽器爬取資料

什麼是selenium？是Python的一個第三方庫，對外提供的介面可以操作瀏覽器，然後讓瀏覽器完成自動化的操作。　　環境搭建安裝selenum：pip install selenium 獲取某一款瀏覽器的驅動程式（以谷歌瀏覽器為例）谷歌瀏覽器驅動下載地址：http://chromedriv

使用selenium和pyquery來爬取淘寶ipad商品資訊

使用selenium爬取淘寶ipad商品資訊爬取過程中的重點是實現翻頁、提取商品資訊、儲存至資料庫訪問淘寶爬取過程中可以通過掃描二維碼的方式來登陸淘寶，要注意的是訪問不能過於頻繁，否則ip會被限制訪問。防止ip被限制訪問可以通過使用代理，或者降低訪問

selenium模擬 + 滑鼠滾動爬取魔方公寓租房評論資訊

因為魔方公寓的評論資訊是在一個單獨的div中, 需要模擬滑鼠滾動才能拿到評論資訊, 並且每次只能拿到在頁面顯示的內容, 頁面沒有顯示的取出來的是空. 因此只能在每次滾動後取值儲存. 程式碼: import time import win32api, win32con from win

一個鹹魚的Python爬蟲之路（三）：爬取網頁圖片

you os.path odin 路徑生成存在 parent lose exist 學完Requests庫與Beautifulsoup庫我們今天來實戰一波，爬取網頁圖片。依照現在所學只能爬取圖片在html頁面的而不能爬取由JavaScript生成的圖。所以我找了這個網站

常用正則表達式爬取網頁信息及HTML分析總結

logfile mpi 開始 order 標題 ear 爬取網頁常用 enter Python爬取網頁信息時，經常使用的正則表達式及方法。 1.獲取<tr></tr>標簽之間內容 2.獲取<a href..></a>超鏈接

python爬取網頁圖片

ima com col list https pytho 表達式 images 5% 在Python中使用正則表達式，一個小小的爬蟲，抓取百科詞條網頁的jpg圖片。下面就是我的代碼，作為參考： #coding=utf-8 # __author__ = ‘Hinfa‘ im

使用Post方法模擬登陸爬取網頁(轉)

source tin -m quest exc agen false buffered void 使用Post方法模擬登陸爬取網頁最近弄爬蟲，遇到的一個問題就是如何使用post方法模擬登陸爬取網頁。下面是極簡版的代碼： import java

簡單的爬取網頁圖片

baidu alt idt ima 修改利用表達輸入 html import reimport urllib.request# ------ 獲取網頁源代碼的方法 ---def getHtml(url): page = urllib.request.urlope

03：requests與BeautifulSoup結合爬取網頁數據應用

fas bsp 2.3 m2e bae DC 信息 type 取數 1.1 爬蟲相關模塊命令回顧　　1、requests模塊 1、 pip install requests 2、 response =

[selenium]selenium驅動chrome爬取網頁/無介面chrome/使用代理

selenium與chromedriver安裝

selenium呼叫chrome

相關推薦