Python + selenium 爬取百度文庫Word文字

阿新 • • 發佈：2018-12-08

 1 # -*- coding:utf-8 -*-
 2  
 3 import time
 4 from selenium import webdriver
 5 from selenium.webdriver.chrome.options import Options
 6 from selenium.common.exceptions import NoSuchElementException
 7  
 8 chrome_options = Options()
 9 chrome_options.add_argument('--headless')
10 chrome_options.add_argument(' 
--disable-gpu')
11 chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36")
12  
13 driver = webdriver.Chrome(chrome_options=chrome_options)
14 driver.maximize_window()
15  
16 url = input("輸入文件連結，搞快點：")
17 driver.get(url)
 
18  
19 error_str = ""
20  
21 try :
22     page_num = driver.find_element_by_xpath("//span[@class='page-count']").text
23  
24     find_button = driver.find_element_by_xpath("//div[@class='doc-banner-text']")
25     driver.execute_script("arguments[0].scrollIntoView();", find_button)
26     button = driver.find_element_by_xpath(" 
//span[@class='moreBtn goBtn']")
27     button.click()
28  
29     for i in range(1,int(page_num.strip('/')) + 1) :
30         page = driver.find_element_by_xpath("//div[@data-page-no='{}']".format(i))
31         driver.execute_script("arguments[0].scrollIntoView();", page)
32         time.sleep(0.3)
33         print(driver.find_elements_by_xpath("//div[@data-page-no='{}']//div[@class='reader-txt-layer']".format(i))[-1].text)
34  
35 except NoSuchElementException :
36     if driver.find_element_by_xpath("//div[@class='doc-bottom-text']").text == "試讀已結束，如需繼續閱讀或下載" :
37         error_str = "\n------------------------------------------------------------------\n\n" \
38                       "----------百度文庫提示試讀已結束啦，無法爬取全文，等會再試試吧----------\n\n" \
39                       "------------------------------------------------------------------"
40  
41 finally :
42     print(error_str)

Python + selenium 爬取百度文庫Word文字

1 # -*- coding:utf-8 -*- 2 3 import time 4 from selenium import webdriver 5 from selenium.webdriver.chrome.options import Options 6 from seleniu

Python + selenium 爬取百度文庫Word文本

count btn nts odin ott webdriver pan from div 1 # -*- coding:utf-8 -*- 2 3 import time 4 from selenium import webdriver 5 from se

Python3爬蟲-selenium爬取百度文庫

這是筆者爬取的第一個動態載入的網頁，使用的IDE是Pycharm，選擇的是百度文庫的一篇16年六級卷子的文件。若直接使用requests模組去得到網頁原始碼，會發現所得非所見，不能獲取到文件中的內容。看了網上數篇博文的思路，最後還是嘗試了使用selenium

python爬蟲系列之爬取百度文庫（一）

一、什麼是selenium 在爬取百度文庫的過程中，我們需要使用到一個工具selenium（瀏覽器自動測試框架），selenium是一個用於web應用程式測試的工具，它可以測試直接執行在瀏覽器中，就像我們平時用瀏覽器上網一樣，支援IE(7，8，9，10，11),firefo

Python3實現QQ機器人自動爬取百度文庫的搜索結果並發送給好友（主要是爬蟲）

OS __main__ end aid 機器 https code __name__ gbk 一、效果如下：二、運行環境： win10系統；python3；PyCharm 三、QQ機器人用的是qqbot模塊用pip安裝命令是： pip

Python爬蟲 - 爬取百度html代碼前200行

http src mage bsp bubuko str 百度爬蟲圖片 Python爬蟲 - 爬取百度html代碼前200行 - 改進版, 增加了對字符串的.strip()處理 Python爬蟲 - 爬取百度html代碼前200行

Python爬蟲-爬取百度貼吧

方法 eba style name urlopen for pri url pen 爬取百度貼吧 ===================== ===== 結果示例： ===================================== 1 ‘‘‘ 2 爬去百

實戰python 爬蟲爬取百度貼吧圖片

#!/usr/bin/python import urllib,urllib2import re def getHtml(url): page = urllib2.urlopen(url) return page.read() def getImage(html): re_img = re.compil

python爬蟲爬取百度貼吧（入門練習）

需求說明：從控制檯輸入指定爬取的貼吧名稱，起始頁面，結束頁面，並在檔案中建立以貼吧名稱+“爬取內容” 為名字建立檔案件，裡面的每一個檔案都是爬取到的每一頁html檔案，檔名稱：貼吧名稱_page.html import urllib.reque

Python爬蟲爬取百度搜索內容介面-xpath

百度爬蟲搜尋介面1.0版百度爬蟲搜尋介面1.0版：通過百度關鍵字遍歷到一級頁面的url 通過百度關鍵字遍歷到一級頁面的title標題通過百度關鍵字遍歷到一級頁面的text文字爬取思路拼接url 經過測試，初始時拼接url，只需要加入keyword

python爬蟲爬取百度圖片

爬蟲爬取百度圖片因公司業務需要，而且公司人手不足，我這個測試工程師需要臨時客串一下其他職位，所以，由我來爬取百度圖片。說明 1、最近稍微有點兒忙，沒顧得上整理。而且程式碼量比較少，所以註釋比較少。 2、如果需要直接使用我的程式碼，請將相應路徑檔名稱更改。具體

python 3 爬取百度圖片

糾結於爬取百度圖片，竟然花費了一天的時間才讓程式順利跑起來。其中踩坑無數。而且還發現公司電腦實在是比較差勁。。。 import requests import urllib import os , re from os.path import join

使用python爬蟲爬取百度手機助手網站中app的資料

一、爬取程式流程圖爬蟲程式流程圖如下： Created with Raphaël 2.1.0開始分析地址結構獲得app類別頁的url爬取app詳情頁url爬取App詳情頁的資料將爬取資料儲存到json檔案結束二、具體步驟 1.分析

Python開發簡單爬蟲（二）---爬取百度百科頁面數據

class 實例實例代碼編碼 mat 分享 aik logs title 一、開發爬蟲的步驟 1.確定目標抓取策略：打開目標頁面，通過右鍵審查元素確定網頁的url格式、數據格式、和網頁編碼形式。 ①先看url的格式, F12觀察一下鏈接的形式;② 再看目標文本信息的

python爬取百度搜索圖片

知乎需要 with 異常 mage 不足 request height adr 在之前通過爬取貼吧圖片有了一點經驗，先根據之前經驗再次爬取百度搜索界面圖片廢話不說，先上代碼 #!/usr/bin/env python # -*- coding: utf-8 -*- #

Python爬取百度貼吧數據

utf-8 支持我 family encode code word keyword 上一條時間　　本渣除了工作外，在生活上還是有些愛好，有些東西，一旦染上，就無法自拔，無法上岸，從此走上一條不歸路。花鳥魚蟲便是我堅持了數十年的愛好。　　本渣還是需要上班，才能支持我的

Python簡易爬蟲爬取百度貼吧圖片

decode works 接口 def 讀取 min baidu 得到 internal 　　　　　通過python 來實現這樣一個簡單的爬蟲功能，把我們想要的圖片爬取到本地。(Python版本為3.6.0) 一.獲取整個頁面數據　　 def getHtml(url)

python爬取百度搜索結果ur匯總

百度搜索 sta attr amp end rom range 百度篩選寫了兩篇之後，我覺得關於爬蟲，重點還是分析過程分析些什麽呢： 1）首先明確自己要爬取的目標　　比如這次我們需要爬取的是使用百度搜索之後所有出來的url結果 2）分析手動進行的獲取目標的過程，以便

python 爬取百度url

style not 域名 head dex fin compile threads www 1 #!/usr/bin/env python 2 # -*- coding: utf-8 -*- 3 # @Date : 2017-08-29 18:38:23 4

【學習筆記】python爬取百度真實url

python 今天跑個腳本需要一堆測試的url，，，挨個找復制粘貼肯定不是程序員的風格，so，還是寫個腳本吧。環境：python2.7 編輯器：sublime text 3 一、分析一下首先非常感謝百度大佬的url分類非常整齊，都在一個

Python + selenium 爬取百度文庫Word文字

相關推薦