爬蟲進階

阿新 • • 發佈：2019-04-20

解決 gif 現在目錄可執行 enter oca mozilla lec

利用多線程爬取數據
爬取動態數據
- selenium快速入門與基本操作
  - 關閉頁面
  - 定位元素
  - 操作表單元素
  - 行為鏈
  - Cookie操作
- 頁面等待
  - 隱式等待
  - 顯示等待
- 切換頁面
- 設置IP代理
圖像驗證碼識別技術和tesseract
- tesseract安裝與配置
  - 設置系統變量
  - 在cmd中使用
  - python代碼中實現

利用多線程爬取數據

import threading
import requests
from lxml import etree
from urllib import request
import os
import re
from queue import Queue

class HtmlSprider(threading.Thread):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
    }
    def __init__(self,page_queue,img_queue,*args,**kwargs):
        super(HtmlSprider, self).__init__(*args,**kwargs)
        self.page_queue = page_queue
        self.img_queue = img_queue


    def run(self):
        while True:
            if self.page_queue.empty():
                break
            url = self.page_queue.get()
            self.parse_page(url)

    def parse_page(self,url):
        response = requests.get(url,headers=self.headers)
        text = response.text
        html = etree.HTML(text)
        imgs = html.xpath("//div[@class='page-content text-center']//a//img")
        for img in imgs:
            if img.get('class') == 'gif':
                continue
            img_url = img.xpath(".//@data-original")[0]
            suffix = os.path.splitext(img_url)[1]
            suffix = re.sub(r"!dta", "", suffix)
            alt = img.xpath(".//@alt")[0]
            alt = re.sub(r'[，。？?,/\\·]','',alt)
            img_name = alt + suffix
            self.img_queue.put((img_url,img_name))

class DownloadPicture(threading.Thread):
    def __init__(self,page_queue,img_queue,*args,**kwargs):
        super(DownloadPicture, self).__init__(*args,**kwargs)
        self.page_queue = page_queue
        self.img_queue = img_queue

    def run(self):
        while True:
            if self.img_queue.empty():
                if self.page_queue.empty():
                    return
            img = self.img_queue.get(block=True)
            url,filename = img
            request.urlretrieve(url,'images/'+filename)
            print(filename+'  下載完成！')

def main():
    page_queue = Queue(100)
    img_queue = Queue(500)
    for x in range(1,10):
        url = "http://www.doutula.com/photo/list/?page=%d" % x
        page_queue.put(url)

    for x in range(5):
        t = HtmlSprider(page_queue,img_queue)
        t.start()

    for x in range(5):
        t = DownloadPicture(page_queue,img_queue)
        t.start()

if __name__ == '__main__':
    main()

爬取動態數據

AJAX（Asynchronouse JavaScript And XML），翻譯為異步JavaScript和XML。是在後臺與服務器進行少量數據交換，Ajax 可以使網頁實現異步更新。這意味著可以在不重新加載整個網頁的情況下，對網頁的某部分進行更新。我們有兩種辦法來處理這種情況：

直接分析ajax調用的接口。這樣雖然代碼量小，不過需要很深的js功底，要不然極容易被發現是爬蟲。
使用selenium和chromedriver來模擬登陸。這樣雖然繁瑣，不過很穩定。Selenium相當於是一個機器人。可以模擬人類在瀏覽器上的一些行為，自動處理瀏覽器上的一些行為，比如點擊，填充數據，刪除cookie

等。chromedriver是一個驅動Chrome瀏覽器的驅動程序，使用他才可以驅動瀏覽器。

`selenium`快速入門與基本操作

selenium文檔官網

from selenium import webdriver
import time

driver_path = "D:\\chromedriver\\chromedriver.exe" ##獲取chromedriver可執行文件的地址
inputTag = webdriver.Chrome(executable_path=driver_path) ##傳入chromedriver地址
inputTag.get("http://www.baidu.com")  ##訪問頁面
inputTag = inputTag.find_element_by_id("kw")
inputTag.send_keys("python")
time.sleep(4)

關閉頁面

driver.close()：關閉當前頁面
driver.quit()：退出瀏覽器

定位元素

find_element_by_id：根據id來查找元素
find_element_by_class_name：根據類名查找元素
find_element_by_name：根據name屬性的值來查找元素
find_element_by_tag_name：根據標簽名來查找元素
find_element_by_xpath：根據xpath語法來獲取元素
find_element_by_css_selector：根據css選擇器選擇元素

註意：把以上方法中的element改為elements就可以獲取所有相關元素了

操作表單元素

操作輸入框

#1.先獲取需要填寫內容的表單
#2.用send_keys方法發送要輸入的值
inputTag = driver.find_element_by_id("kw") 
inputTag.send_keys("python")

清除輸入框裏面的內容

inputTag.clear()

操作點擊

rememberTag = driver.find_element_by_name("rememberMe")
rememberTag.click()

選擇select標簽

鼠標相關其他操作
- click_and_hold(element)：點擊但不松開鼠標
- context_click(element)：右鍵點擊
- double_click(element)：雙擊

行為鏈

inputTag = driver.find_element_by_id('kw')
submitTag = driver.find_element_by_id('su')
## 我們可以使用鼠標行為鏈類ActionChains來完成
actions = ActionChains(driver)
actions.move_to_element(inputTag)
actions.send_keys_to_element(inputTag,'python')
actions.move_to_element(submitTag)
actions.click(submitTag)
actions.perform()

`Cookie`操作

獲取所有Cookie:

for cookie in driver.get_cookies():
    print(cookie

根據cookie的key獲取value：

value = driver.get_cookie(key)

刪除所有Cookie：

driver.delete_all_cookies()

刪除某個cookie：

driver.delete_cookie(key)

頁面等待

現在的網頁越來越多采用了 Ajax 技術，這樣程序便不能確定何時某個元素完全加載出來了。如果實際頁面等待時間過長導致某個dom元素還沒出來，但是你的代碼直接使用了這個WebElement，那麽就會拋出NullPointer的異常。因此，selenium提供了以下兩種解決辦法：

隱式等待

driver = webdriver.Chrome(executable_path=driver_path)
driver.implicitly_wait(10)
# 請求網頁
driver.get("https://www.baidu.com/")

顯示等待

顯示等待是表明某個條件成立後才執行獲取元素的操作。也可以在等待的時候指定一個最大的時間，如果超過這個時間那麽就拋出一個異常。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "myDynamicElement"))
     )
finally:
    driver.quit()

一些其他的等待條件：

presence_of_element_located：某個元素已經加載完畢
presence_of_all_element_located：網頁中所有滿足條件的元素都加載完畢了
element_to_be_cliable：某個元素可以點擊了

切換頁面

from selenium import webdriver

driver_path = r"D:\chromedriver\chromedriver.exe"
driver = webdriver.Chrome(executable_path=driver_path)
driver.get("https://baidu.com/")
##打開新的一個頁面
driver.execute_script("window.open('http://www.douban.com/')")
##切換到這個新的頁面中
driver.driver.switch_to.window(driver.window_handles[1])

設置IP代理

from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument("--proxy-server=http://110.52.235.241:9999")
driver_path = r"D:\chromedriver\chromedriver.exe"
driver = webdriver.Chrome(executable_path=driver_path,options=options)
driver.get("http://www.ip138.com/")

圖像驗證碼識別技術和`tesseract`

`tesseract`安裝與配置

GIT官網

設置系統變量

在Windows下把tesseract.exe所在的路徑添加到PATH環境變量中。還有一個環境變量需要設置的是，要把訓練的數據文件路徑也放到環境變量中。在環境變量中，添加一個TESSDATA_PREFIX=C:\path_to_tesseractdata\teseractdata。

在cmd中使用

tesseract 圖片路徑 文件路徑

文件路徑不需要添加後綴名也會默認為.txt

python代碼中實現

import pytesseract
from PIL import Image
##獲得tesseract可執行文件地址
pytesseract.pytesseract.tesseract_cmd = r"D:\tesseract\tesseract.exe"
##利用該模塊打開圖片文件
image = Image.open(r"C:\Users\DELL\Desktop\a.png")
##image_to_string()方法轉換成字符串
text = pytesseract.image_to_string(image)
print(text)

爬蟲進階

爬蟲進階

利用多線程爬取數據

爬取動態數據

`selenium`快速入門與基本操作

關閉頁面

定位元素

操作表單元素

行為鏈

`Cookie`操作

頁面等待

隱式等待

顯示等待

切換頁面

設置IP代理

圖像驗證碼識別技術和`tesseract`

`tesseract`安裝與配置

設置系統變量

在cmd中使用

python代碼中實現

Python爬蟲進階六之多進程的用法

python筆記26（爬蟲進階）

day26-爬蟲進階

那些年，我爬過的北科(四)——爬蟲進階之極簡併行爬蟲框架開發

爬蟲進階：Scrapy抓取boss直聘、拉勾心得經驗

爬蟲進階：反反爬蟲技巧

爬蟲進階教程：極驗(GEETEST)驗證碼破解教程

爬蟲進階（1）

爬蟲進階教程：抖音APP無水印視訊批量下載

python爬蟲進階（八）：分散式系統的高可用與高併發處理

python爬蟲進階使用多執行緒爬取小說

Java爬蟲進階-Jsoup+httpclient獲取動態生成的資料

node爬蟲進階之——登入

python爬蟲進階(模擬人為上網)

python爬蟲進階（十）：日誌系統、守護執行緒以及驗證碼處理

Java爬蟲進階-phantomJS+selenium2抓取網站圖片和小說

python爬蟲進階（一）：靜態網頁爬取

爬蟲進階

爬蟲進階之非同步協程

Python爬蟲新手進階版：怎樣讀取非結構化、圖像、視頻、語音數據

爬蟲進階

利用多線程爬取數據

爬取動態數據

selenium快速入門與基本操作

關閉頁面

定位元素

操作表單元素

行為鏈

Cookie操作

頁面等待

隱式等待

顯示等待

切換頁面

設置IP代理

圖像驗證碼識別技術和tesseract

tesseract安裝與配置

設置系統變量

在cmd中使用

python代碼中實現

相關推薦

`selenium`快速入門與基本操作

`Cookie`操作

圖像驗證碼識別技術和`tesseract`

`tesseract`安裝與配置