使用python+selenium爬取同城旅遊網機票資訊

阿新 • • 發佈：2018-12-25

最近使用python+selenium爬取了同城旅遊網機票資訊

相關主要程式碼如下，通過模擬人為操作，拿下了這個機票列表的html程式碼，然後就可以使用xpath或者re等方式從中提取需要的欄位資訊了。

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from spider.comm.spider_communal import is_same_month
from selenium.webdriver.chrome.options import Options
import platform
import time
import re
from lxml import etree

'''
使用selenium自動化測試工具爬取同城旅遊網機票資訊
爬取URL：https://www.ly.com
author：liu-yanlin
依賴環境：python3.6.1
pip install selenium=3.13.0
pip install lxml=4.2.1
Chrome驅動下載地址：https://pan.baidu.com/s/1564mrLmlT7vPdLBntm8hlQ 提取碼：fq33
'''
class LySpider():

    '''
    @:param date_str 查詢日期
    @:param start_city 查詢起始城市
    @:param arrive_city 查詢抵達城市
    '''
    def __init__(self,date_str,start_city,arrive_city):
        self.date_str=date_str
        self.start_city=start_city
        self.arrive_city=arrive_city
        # 判斷如果系統是Windows測試效果則彈窗模式，否則Linux部署環境下開啟無頭模式
        sys_str = platform.system()
        if sys_str == "Windows":
            # self.driver = webdriver.Chrome()
            options = Options()
            options.add_argument('--headless')
            self.driver = webdriver.Chrome(chrome_options=options)
        elif sys_str == "Linux":
            options = Options()
            options.add_argument('--headless')
            self.driver = webdriver.Chrome(chrome_options=options)

    '''
    通過selenium控制Chrome驅動，完成模擬人工輸入查詢地址和日期然後點選提交獲取查詢結果html的流程
    '''
    def get_query_results(self):

        # 隱性等待和顯性等待可以同時用，但要注意：等待的最長時間取兩者之中的大者
        self.driver.implicitly_wait(10)
        self.driver.get('https://www.ly.com/FlightQuery.aspx')
        locator = (By.ID, 'txtAirplaneCity1')
        try:
            #顯性等待
            WebDriverWait(self.driver, 20, 0.5).until(EC.presence_of_element_located(locator))
            # 起始地城市input元素獲取並清空值，然後填入城市名稱,輸入之後模擬按回車鍵
            txtAirplaneCity1 = self.driver.find_elements_by_id("txtAirplaneCity1")[0]
            txtAirplaneCity1.clear()
            txtAirplaneCity1.send_keys(self.start_city)
            txtAirplaneCity1.send_keys(Keys.ENTER)
            # 抵達地城市input元素獲取並清空值，然後填入城市名稱，輸入之後模擬按回車鍵
            txtAirplaneCity2 = self.driver.find_elements_by_id("txtAirplaneCity2")[0]
            txtAirplaneCity2.clear()
            txtAirplaneCity2.send_keys(self.arrive_city)
            txtAirplaneCity2.send_keys(Keys.ENTER)

            # 如果所查詢的日期在當月範圍內，則定位到日曆外掛中第1個div否則定位到第2個div，div1 表示當月，div2表示下一個月
            if is_same_month(self.date_str):
                # 定位到日曆外掛
                element_calendar = self.driver.find_elements_by_xpath(
                    "/html/body/div[17]/div/div[1]/div[1]/div/table/tbody/tr/td/span")
                for item in element_calendar:
                    if item.text == str(int(self.date_str.split("-")[2])):
                        item.click()
            else:
                element_calendar = self.driver.find_elements_by_xpath(
                    "/html/body/div[17]/div/div[1]/div[2]/div/table/tbody/tr/td/span")
                for item in element_calendar:
                    if item.text == str(int(self.date_str.split("-")[2])):
                        item.click()
            # 定位搜尋按鈕並模擬點選提交
            airplaneSubmit = self.driver.find_elements_by_id("airplaneSubmit")[0]
            airplaneSubmit.click()
            # 顯性等待後，定位到機票查詢結果div，然後獲取div內的html
            locator_content = (By.ID, 'allFlightListDom_1')
            WebDriverWait(self.driver, 20, 0.5).until(EC.presence_of_element_located(locator_content))
            flight_list_html=self.get_flight_list_dom()
            #返回結果
            data_list=[]
            '''
            此處判斷返回的flight_list_html裡面是否包含有機票資訊，如果有直接返回此html程式碼，否則使用for迴圈
            從新嘗試10次，每迴圈一次暫停一秒（這裡為啥要這樣寫，因為實際情況中可能會存在網路延遲載入慢等原因
            導致獲取不到內容）
            '''
            if flight_list_html:
                for item in flight_list_html:
                    data_list.append(item.get_attribute('innerHTML'))
            else:
                for x in range(10):
                    flight_list_html = self.get_flight_list_dom()
                    if flight_list_html:
                        for item in flight_list_html:
                            data_list.append(item.get_attribute('innerHTML'))
                        break
                    time.sleep(1)
            return data_list

        except Exception as ex:
            print(ex)
        finally:
            self.driver.close()
    '''
    定位到機票查詢結果div，然後獲取div內的html
    '''
    def get_flight_list_dom(self):
        # ---顯性等待後，定位到機票查詢結果div，然後獲取div內的html
        #通過觀察頁面發現這個機票列表資料有三種格式，所以將它們都提取出來拼接成一個List返回
        flight_list_html_n=self.driver.find_elements_by_xpath('//div[@class="clearfix flightList"]//div[@class="flist_box"]')
        flight_list_html_top=self.driver.find_elements_by_xpath('//div[@class="clearfix flightList"]//div[@class="flist_box f_m_top flist_boxat"]')
        flight_list_html_boxbot = self.driver.find_elements_by_xpath('//div[@class="clearfix flightList"]//div[@class="flist_box flist_boxbot"]')
        return flight_list_html_n+flight_list_html_top+flight_list_html_boxbot

    '''
    提取資料
    @:param respone get_query_results()方法中返回的結果內容
    '''
    def extract(self,respone):
        try:
            data_list=[]
            for item in respone:
                data = {}
                html = etree.HTML(item)
                # 航司
                airline = html.xpath('/html/body/table/tbody/tr/td[1]/div[1]/text()')
                data["airline"] = airline[0] if airline else ""
                # 航班號
                flight_number = re.findall("[a-zA-Z]{2}\d+", airline[0])
                data["flight_number"] = flight_number[0] if flight_number else ""
                # 出發時間
                dep_time = html.xpath('/html/body/table/tbody/tr/td[2]/div[1]/text()')
                data["dep_time"] = dep_time[0] if dep_time else ""
                # 出發機場
                dep_airport = html.xpath('/html/body/table/tbody/tr/td[2]/div[2]/text()')
                data["dep_airport"] = dep_airport[0] if dep_airport else ""
                # 飛機型別
                aircraft_type = html.xpath('/html/body/table/tbody/tr/td[1]/div[2]/a/text()')
                data["aircraft_type"] = aircraft_type[0] if aircraft_type else ""
                # 抵達時間
                arr_time = html.xpath('/html/body/table/tbody/tr/td[4]/div[1]/text()')
                data["arr_time"] = arr_time[0] if arr_time else ""
                # 抵達機場
                arr_airport = html.xpath('/html/body/table/tbody/tr/td[4]/div[2]/text()')
                data["arr_airport"] = arr_airport[0] if arr_airport else ""
                # 價格
                price = html.xpath('/html/body/table/tbody/tr/td[8]/div[1]/span[1]/em[1]/text()')
                data["price"] = price[0] if price else ""
                data_list.append(data)
            return data_list
        except Exception as ex:
            print(ex)
            return None
    '''
    儲存資料
    @:param data 要儲存的資料，預設是儲存extract()方法所返回的資料
    '''
    def save(self,data=None):
        try:
            #以下將資料儲存到kafka中
            if data:
                pass
            else:
                results=self.extract(self.get_query_results())
        except Exception as ex:
            pass

if __name__ == "__main__":

    ly_spider=LySpider("2019-01-02","成都","北京")
    res=ly_spider.get_query_results()
    data_list=ly_spider.extract(res)
    for item in data_list:
        print(item)

執行效果截圖：

使用python+selenium爬取同城旅遊網機票資訊

最近使用python+selenium爬取了同城旅遊網機票資訊相關主要程式碼如下，通過模擬人為操作，拿下了這個機票列表的html程式碼，然後就可以使用xpath或者re等方式從中提取需要的欄位資訊了。 from selenium import webdriver

Python + Selenium 爬取網易雲課堂課時標題及時長

Python + Selenium 爬取網易雲課堂課時標題及時長轉載請註明出處：https://blog.csdn.net/jpch89/article/details/84142555 文章目錄 Python + Selenium 爬取網易雲課堂課時標

Python selenium 爬取天天基金網股票型基金

對於股票市場長期的判斷，普通上班族沒有多少時間和資料可以分析。那麼，就應該藉助基金機構選擇的股票來分析，藉助各基金經理管理的股票基金來統計，哪些股票是基金經理或團隊分析購買的。所以選擇的是股票型基金，最終將分析得出哪類股票是機構最多選擇的，那隻股票是機構購買最多的。利用基金

python +selenium 爬取淘寶網商品資訊

前幾天用python爬取豆瓣關於電影《長城》的影評，發現豆瓣的網頁是靜態的，心中一陣竊喜。以為對於動態網頁瞭解的不是太多。但是主要是用cookie加headers爬取的。效果還不錯，爬取了六七萬條網友的評價，後期主要打算研究一下，如何發現那些使用者是水軍。今天研

python selenium爬取QQ空間方法

class text 空間方法 ram () end cli bdr from selenium import webdriver import time # 打開瀏覽器 dr = webdriver.Chrome() # 打開某個網址 dr.get(‘https://

python+selenium爬取動漫圖片

#在風之動漫網上下載海賊王，輸入需要下載的章節，將漫畫下載的本地#實現思路：# 1 海賊王的漫畫目錄連結是：https://www.fzdm.com/manhua/02/# 2 第X話的漫畫。連線是https://www.fzdm.com/manhua/02/X/，例如，924話連結是https://www.

Python + selenium 爬取百度文庫Word文字

1 # -*- coding:utf-8 -*- 2 3 import time 4 from selenium import webdriver 5 from selenium.webdriver.chrome.options import Options 6 from seleniu

Python + selenium 爬取百度文庫Word文本

count btn nts odin ott webdriver pan from div 1 # -*- coding:utf-8 -*- 2 3 import time 4 from selenium import webdriver 5 from se

Python—selenium爬取快代理

本篇使用 selenium技術爬取快代理上的代理IP，並判斷其是否可用。 #爬取代理IP from selenium import webdriver from selenium.webdriver.common.proxy import Proxy fro

python+selenium——爬取網站

#coding:utf-8 import re import xlrd import xlwt import time import pandas as pds from selenium import webdriver from selenium.common.exception

【爬蟲】python selenium 爬取資料

最近公司有一項爬取資料的工作，借鑑以往的程式碼將爬蟲重新更新並整理將現有爬蟲分成幾部分 0.檔案讀取器其實檔案讀取和4中的檔案儲存是在一個部分的這裡簡單介紹下xls的讀取def deal_xl

python爬蟲爬取淘寶搜尋頁面商品資訊資料

主要使用的庫： requests:爬蟲請求並獲取原始碼 re：使用正則表示式提取資料 json:使用JSON提取資料 pandas：使用pandans儲存資料以下是原始碼： #!coding=utf-8 import requests import re import

Python爬蟲-爬取騰訊QQ招聘崗位資訊（Beautiful Soup）

爬取騰訊招聘資訊-Beautiful Soup --------------------------------------- ============================================ =================================

爬取和採集趕集網租房資訊，看過來~~

本文主要介紹“趕集出租房資訊採集爬蟲”（以下簡稱“趕集租房爬蟲”）的使用教程及注意事項。趕集網二手房源資訊時效性很強，使用者基數龐大，資訊清晰明瞭，這些資訊蘊含著很高的商機，能實時的反應現階段某一地區的租房現狀和趨勢，那如何採集呢？下面，教你如何使用“趕集租房爬蟲”在雲端採集和匯

（python）如何利用python深入爬取自己想要的資料資訊

一、問題說明由於老師佈置了一個任務，需要對一個網站的城市做一些統計，並提取出這個網站上的城市的經緯度資訊然後繪製在百度地圖上。如果是一些數量不多的城市那也到好辦，但是如果對於這種存在幾百上千的城市，而且這些城市的經緯度資訊在第三級網站上才能找到，這樣一來，如果人工去完成會非常枯燥而

Django實戰: Python爬蟲爬取鏈家上海二手房資訊，存入資料庫並在前端顯示

好久沒寫Django實戰教程了，小編我今天就帶你把它與Python爬蟲結合做出個有趣的東西吧。我們將開發這樣一個應用，前端使用者可以根據行政區劃，房廳數和價格區間選擇需要爬取的二手房房源資訊，後臺Python開始爬取資料。爬取資料完成後，通過Django將爬來的資料存入資料庫

爬取分析拉勾網招聘資訊

[TOC] 愛寫bug（ID：icodebugs）作者：愛寫bug 前言：看了很多網站，只發現獲取拉勾網招聘資訊是隻用

python：爬取58同城全部二手商品資訊（轉轉網）

python_58ershou python+beautifulsoup多執行緒爬取58同城二手全部商品資訊，並在jupyter上將資料視覺化專案主程式在58_index中：建立mango資料庫表 #連線MongoDB資料庫 client

用Python爬蟲爬取廣州大學教務系統的成績（內網訪問）

enc 用途 css選擇器狀態 csv文件表格 area 加密重要用Python爬蟲爬取廣州大學教務系統的成績（內網訪問）在進行爬取前，首先要了解： 1、什麽是CSS選擇器？每一條css樣式定義由兩部分組成，形式如下： [code] 選擇器{樣式} [/code

Python爬取王者榮耀官網，實現一對一下載軟件！

案例界面 image inter ima requests mage 視頻教程 inf 效果：我沒有弄文件夾保存，因為皮膚與英雄都是一一對應，這樣子更加方便操作。點擊下載皮膚後，會自動從官網下載一個json文件，所以出了新英雄、新皮膚軟件會自動更新。高

使用python+selenium爬取同城旅遊網機票資訊

相關推薦