【爬蟲】python selenium 爬取資料

阿新 • • 發佈：2019-01-28

最近公司有一項爬取資料的工作，借鑑以往的程式碼將爬蟲重新更新並整理
將現有爬蟲分成幾部分
0.檔案讀取器
其實檔案讀取和4中的檔案儲存是在一個部分的
這裡簡單介紹下xls的讀取
def deal_xls_col(name,sheet_name): body = xlrd.open_workbook(name) try: sh = body.sheet_by_name(sheet_name) except: print "EORR" return sh.col_values(0)格式請忽略
這裡讀取了一豎行的xls的資料
返回的格式為list

總排程器
這裡主要是寫邏輯，及0234的順序。

2.網頁下載器
網頁下載器主要是來模擬瀏覽器訪問對應url
一個簡單的例子
class HtmlDownloader(object): def download(self,url): if url is None: return None response = urllib2.urlopen(url,timeout=300) if response.getcode() != 200: return None return response.read()

例子只是去訪問url並沒有對cookie等相關限制資訊做處理（需要請自行新增）

3.網頁分析器
網頁分析器其實就是來處理下載器返回的html的原始碼，比如用selenium來處理的話則有
company_info_text = driver.find_element_by_class_name('company_info_text') company_text = driver.find_element_by_class_name('row b-c-white company-content')
就是用selenium的一些方法來獲取你需要的資料而已

4.檔案儲存器
這裡以xls為例：
def creat_xls_6(xls_name): styleBoldRed = xlwt.easyxf('font:color-index red, bold on') headerStye = styleBoldRed wb = xlwt.Workbook() ws = wb.add_sheet(xls_name) ws.write(0, 0, "name", headerStye) ws.write(0, 1, "oper_name", headerStye) ws.write(0, 2, "start_date", headerStye) ws.write(0, 3, "xfsSearchStatus", headerStye) wb.save(xls_name)

建立xls表格

def insert_xls_6(xls_name,id, name, oper_name, start_date,xfsSearchStatus):

oldWb = xlrd.open_workbook(xls_name)

newWb = copy(oldWb)

newWs = newWb.get_sheet(0)

newWs.write(id, 0, name)

newWs.write(id, 1, oper_name)

newWs.write(id, 2, start_date)

newWs.write(id, 3, xfsSearchStatus)

newWb.save(xls_name)

插入資料到表格
這裡面沒有什麼高深的祕密，只要你封裝好自己的函式就好了
上面的例子還不是最好的版本，因為每次使用都要重新修改，應該傳入一個數據來代替那些變數，這樣就可以適配各種資料的表格建立和添加了

還有要說的就是：一些網站會限制你爬取資料，但是大多數網站都是友好的，但是這並不表示你可以肆無忌憚的毫無限制的去爬取。爬取的時間最好設定成晚上或者。。。。
還有就是不要對目標網站造成不必要的‘傷害’。

爬蟲並不難，且行且珍惜！

2016.11.10晚

【爬蟲】python selenium 爬取資料

【爬蟲】python selenium 爬取資料

【爬蟲小程式：爬取鬥魚所有房間資訊】Xpath(執行緒池版)

【爬蟲小程式：爬取鬥魚所有房間資訊】Xpath(多執行緒版)

【爬蟲小程式：爬取鬥魚所有房間資訊】Xpath(多程序版)

python爬蟲實戰筆記---selenium爬取QQ空間說說並存至本地

【go語言爬蟲】go語言爬取豆瓣電影top250

【nodejs 爬蟲】使用 puppeteer 爬取鏈家房價資訊

零基礎掌握百度地圖興趣點獲取POI爬蟲（python語言爬取）（基礎篇）

python selenium爬取QQ空間方法

python+selenium爬取動漫圖片

Python + selenium 爬取百度文庫Word文字

Python + selenium 爬取百度文庫Word文本

【爬蟲】python爬蟲工具scrapy的安裝使用

【Excel】利用SQL抓取資料

【筆記】python+selenium 一個簡單的自動化指令碼

Python—selenium爬取快代理

使用python+selenium爬取同城旅遊網機票資訊

python+selenium——爬取網站

Python + Selenium 爬取網易雲課堂課時標題及時長

【專欄】- Python Selenium自動化測試詳解

【爬蟲】python selenium 爬取資料

相關推薦