1. 程式人生 > 其它 >scrapy+selenium爬取馬蜂窩網實戰

scrapy+selenium爬取馬蜂窩網實戰

  • 剛開始學習selenium動態網頁的爬蟲,就想著自己做個實戰練習練習,然後就準備爬取馬蜂窩旅遊網重慶的全部旅遊景點,本來以為不是特別難,沒想到中間還是出現了很多問題,包括重寫下載中介軟體,加cookies,selenium動態重新整理下一頁網頁後提取到的資料仍然是前一頁的資料,提取元素的方法選擇,子頁面跳轉,selenium動作鏈等,折磨了很久,但是還是沒有放棄,花了3天做完這個專案,下面記錄一下遇到的問題和解決方法。
    • 動態網頁載入問題
      • 首先分清楚網頁的資料是否是動態載入的,膝上型電腦右鍵檢視網頁原始碼,按Ctrl+F搜尋想要抓取的資料是否在網頁原始碼中,不在則該網頁就是動態載入的,需要使用selenium進行爬取,在spider中編寫建構函式__init__初始化webdriver物件
      • from selenium import webdriver
        from selenium.webdriver import ChromeOptions  #規避檢測
        #定義
        option = ChromeOptions()
        option.add_argument('--ignore-certificate-errors')
        option.add_argument('--ignore-ssl-errors')
        option.add_experimental_option('excludeSwitches', ['enable-automation', 'enable-logging'])
        bro = webdriver.Chrome(executable_path=r'chromedriver.exe路徑',options=option)    #將chromedriver.exe匯入檔案右鍵copy path
        bro.get(url)
        #bro查詢元素
        bro.find_element(By.方法,value)
        例如bro.find_element(By.LINK_TEXT, ‘text’)
        #bro動作
        bro.find_element(By.方法,value).click()
        #關閉網頁
        bro.quit()
        
            def __init__(self):
                #設定cookies防止賬號被封,settings裡的COOKIES_ENABLED = True,且在下載中介軟體中重寫process_request把cookie加上才可以
                self.headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36',
                              'Cookie': 'SECKEY_ABVK=hUAgUzjagDt7tRAoeBixuHARq3o5gtYSbMcKcAkM2Ho%3D; BMAP_SECKEY=adS1Ht6D0s1kWECRhDaf4vSf6OhvVYklxDSAiZ_3W0fIGZJ8rWr9TbzVPPYVaIW5ObgotD3EzPQrdL2XdiXldciYniNJWqvUHZ8Wk_ri0IuuKOY9h0aB4i09OHC30d-kbWCSrrEQe40grf1Gj9izw6SGB5cmzIjIenxaZzpq8lmEDDU5Kvl7gAMUQauc7TUC; mfw_uuid=62357780-7ac5-d4dc-9a8e-5a02aa298353; _r=baidu; _rp=a%3A2%3A%7Bs%3A1%3A%22p%22%3Bs%3A18%3A%22www.baidu.com%2Flink%22%3Bs%3A1%3A%22t%22%3Bi%3A1647671168%3B%7D; oad_n=a%3A5%3A%7Bs%3A5%3A%22refer%22%3Bs%3A21%3A%22https%3A%2F%2Fwww.baidu.com%22%3Bs%3A2%3A%22hp%22%3Bs%3A13%3A%22www.baidu.com%22%3Bs%3A3%3A%22oid%22%3Bi%3A1026%3Bs%3A2%3A%22dm%22%3Bs%3A15%3A%22www.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222022-03-19+14%3A26%3A08%22%3B%7D; __jsluid_h=01784e2b1c452421aa25034fbbde3ed9; __mfwothchid=referrer%7Cwww.baidu.com; __omc_chl=; __mfwc=referrer%7Cwww.baidu.com; uva=s%3A307%3A%22a%3A4%3A%7Bs%3A13%3A%22host_pre_time%22%3Bs%3A10%3A%222022-03-19%22%3Bs%3A2%3A%22lt%22%3Bi%3A1647671169%3Bs%3A10%3A%22last_refer%22%3Bs%3A180%3A%22https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3DKZtwUSmw3x4cyZcTJdfrzYa8Pr4pEgDbvJU1Pv7yOxRPpeRIeoKj_rydoZuVdCf0_IXBx40vQyB-xiuXsf_AyQ1y3t3mO4En4c5USvOZ_ya%26wd%3D%26eqid%3Df0f5303d000ee843000000036235777b%22%3Bs%3A5%3A%22rhost%22%3Bs%3A13%3A%22www.baidu.com%22%3B%7D%22%3B; __mfwurd=a%3A3%3A%7Bs%3A6%3A%22f_time%22%3Bi%3A1647671169%3Bs%3A9%3A%22f_rdomain%22%3Bs%3A13%3A%22www.baidu.com%22%3Bs%3A6%3A%22f_host%22%3Bs%3A3%3A%22www%22%3B%7D; __mfwuuid=62357780-7ac5-d4dc-9a8e-5a02aa298353; UM_distinctid=17fa0dad264324-08a77f4b1aeaeb-9771539-e1000-17fa0dad26543a; __omc_r=; PHPSESSID=cdbbvncvrd5rqepai636p6pos7; Hm_lvt_8288b2ed37e5bc9b4c9f7008798d2de0=1647743622,1647851451,1647908302,1648001811; bottom_ad_status=0; __jsl_clearance=1648007349.914|0|yoFqmnWY6O7Msv1j5KemUKE3POE%3D; __mfwa=1647671168836.14813.17.1648001810109.1648007353609; CNZZDATA30065558=cnzz_eid%3D2058254581-1647670157-null%26ntime%3D1648003704; __mfwb=b20cf490195f.2.direct; __mfwlv=1648009103; __mfwvn=13; __mfwlt=1648009103; Hm_lpvt_8288b2ed37e5bc9b4c9f7008798d2de0=1648009104; ariaDefaultTheme=undefined'}
        
                option = ChromeOptions()
                option.add_argument('--ignore-certificate-errors')
                option.add_argument('--ignore-ssl-errors')
                option.add_experimental_option('excludeSwitches', ['enable-automation', 'enable-logging'])
                self.bro = webdriver.Chrome(executable_path=r'E:/爬蟲/vocation/vocation/spiders/chromedriver.exe',options=option)
                self.item = VocationItem()

         

    • 子頁面跳轉
      • 本例項進行旅遊景點名稱,詳細介紹,電話,遊覽用時,門票,開放時間,交通欄位的爬取,由於除了第一個欄位以外其他欄位都需要點選主頁面的標題進行跳轉到子頁面才能獲取其他資訊,所以需要編寫回調函式,而主頁面資料是動態載入的但是子頁面的資料不是動態載入的,所以需要重寫下載中介軟體對不同請求的response進行處理
        • 在主頁面獲取景點的title和詳情頁href,在獲取到每一個旅遊景點的詳情頁的href後就呼叫回撥函式進行詳情頁資訊的爬取,注意需要設定等待時間,否則會因為網頁資料還未加載出來就操作而出錯。
        • 關於元素定位不使用xpath,因為在點選每一頁之後頁面都會重新整理,有些標籤會隨之而改變,會報selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element等錯誤,就算不報錯也會導致頁面更新後資料不更新,爬取到的還是第一頁的資料,所以可以定位CLASS_NAME來獲取WebElement物件,根據物件的屬性來進行操作
        • WebElement物件的屬性
          #匯入模組
          from selenium import webdriver
          #建立一個瀏覽器物件
          driver = webdriver.Firefox()
          #訪問url地址
          url = "https://www.douban.com/"
          #呼叫瀏覽器物件
          driver.get(url)
          #使用name定位豆瓣的輸入框
          elem = driver.find_element_by_name("q")
          #列印elem物件
          print(elem)
          #列印標籤名
          print(elem.tag_name)
          #列印當前元素的上一級
          print(elem.parent)
          #列印當前元素的屬性值
          print(elem.get_attribute('type'))
          #使用xpath的方式定位豆瓣7.0文字內容
          elem_1 = driver.find_element('xpath','//p[@class="app-title"]')
          #列印當前元素的文字內容
          print(elem_1.text)
          #退出瀏覽器
          driver.quit()
          
          tag_name的作用就是獲取對應元素的標籤名
          parent的作用就是獲取對應元素的父級
          get_attribute('type')的作用就是獲取對應元素中的屬性值,框號中的屬性可更改
          text的作用就是獲取當前元素的文字內容
        • WebElement物件的操作
          from selenium import webdriver
          import time
          def test_start_selenium():
          #開啟瀏覽器驅動,並輸入百度地址
          driver = webdriver.Firefox()
          url = "https://www.baidu.com/"
          driver.get(url)
          input_el = driver.find_element_by_id("kw")
          time.sleep(3)
          #輸入內容
          input_el.send_keys("老友記")
          #點選百度一下按鈕
          input_e2 = driver.find_element('xpath','//input[@type="submit"]')
          input_e2.click()
          time.sleep(3)
          #清除輸入框中輸入的內容
          input_el.clear()
          time.sleep(3)
          input_el.send_keys("西遊記")
          time.sleep(3)
          #提交
          input_el.submit()
          driver.quit()
          test_start_selenium()
          
          send_keys(""):輸入文字內容
          click():點選
          clear():清空
          submit():提交
           
      • 編寫回調函式
        • 回撥函式時出現521錯誤 (<521.......>HTTP status code is not  handled or not allowed)是因為傳送request請求時沒有加上cookies或cookies過期了,需要重新獲取cookie,獲取網頁cookie方法:在目標網頁開啟網頁抓包工具(Fn+F12),選擇network——>選擇doc——>cookie,複製後貼上到__init__建構函式的headers中即可。
        •  
        •  
        • def parse(self, response):
              for i in range(20):
                  title_list = str((self.bro.find_element(By.CLASS_NAME,'scenic-list').text))
                  title = title_list.split('\n')
                  for j in range(15):
                      self.item['title']=title[j]
                      # print(title[j])
                      detail_url = self.bro.find_element(By.LINK_TEXT, title[j]).get_attribute('href')
                      yield scrapy.Request(str(detail_url),callback=self.detail_parse,meta=self.item,headers=self.headers)  #item也需要作為引數傳送
                      self.bro.find_element(By.CLASS_NAME,'pg-next').click()
                      sleep(2)

           

    • 重寫下載中介軟體
      • 本例中主頁面資料為動態載入,需要用對該response進行處理,將資料載入在網頁中再返回,所以需要重寫下載中介軟體中的process_response函式,另外也需要重寫其中的process_request函式加上請求的cookie
        • 重寫下載中介軟體後需要修改setting中的相應引數並取消註釋
          COOKIES_ENABLED = True
          
        •  

          DOWNLOADER_MIDDLEWARES = {
             'vocation.middlewares.VocationDownloaderMiddleware': 543,
          }
          
      •  

        class VocationDownloaderMiddleware(object):
            def process_request(self, request,spider):
                Cookie='SECKEY_ABVK=hUAgUzjagDt7tRAoeBixuHARq3o5gtYSbMcKcAkM2Ho%3D; BMAP_SECKEY=adS1Ht6D0s1kWECRhDaf4vSf6OhvVYklxDSAiZ_3W0fIGZJ8rWr9TbzVPPYVaIW5ObgotD3EzPQrdL2XdiXldciYniNJWqvUHZ8Wk_ri0IuuKOY9h0aB4i09OHC30d-kbWCSrrEQe40grf1Gj9izw6SGB5cmzIjIenxaZzpq8lmEDDU5Kvl7gAMUQauc7TUC; mfw_uuid=62357780-7ac5-d4dc-9a8e-5a02aa298353; _r=baidu; _rp=a%3A2%3A%7Bs%3A1%3A%22p%22%3Bs%3A18%3A%22www.baidu.com%2Flink%22%3Bs%3A1%3A%22t%22%3Bi%3A1647671168%3B%7D; oad_n=a%3A5%3A%7Bs%3A5%3A%22refer%22%3Bs%3A21%3A%22https%3A%2F%2Fwww.baidu.com%22%3Bs%3A2%3A%22hp%22%3Bs%3A13%3A%22www.baidu.com%22%3Bs%3A3%3A%22oid%22%3Bi%3A1026%3Bs%3A2%3A%22dm%22%3Bs%3A15%3A%22www.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222022-03-19+14%3A26%3A08%22%3B%7D; __jsluid_h=01784e2b1c452421aa25034fbbde3ed9; __mfwothchid=referrer%7Cwww.baidu.com; __omc_chl=; __mfwc=referrer%7Cwww.baidu.com; uva=s%3A307%3A%22a%3A4%3A%7Bs%3A13%3A%22host_pre_time%22%3Bs%3A10%3A%222022-03-19%22%3Bs%3A2%3A%22lt%22%3Bi%3A1647671169%3Bs%3A10%3A%22last_refer%22%3Bs%3A180%3A%22https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3DKZtwUSmw3x4cyZcTJdfrzYa8Pr4pEgDbvJU1Pv7yOxRPpeRIeoKj_rydoZuVdCf0_IXBx40vQyB-xiuXsf_AyQ1y3t3mO4En4c5USvOZ_ya%26wd%3D%26eqid%3Df0f5303d000ee843000000036235777b%22%3Bs%3A5%3A%22rhost%22%3Bs%3A13%3A%22www.baidu.com%22%3B%7D%22%3B; __mfwurd=a%3A3%3A%7Bs%3A6%3A%22f_time%22%3Bi%3A1647671169%3Bs%3A9%3A%22f_rdomain%22%3Bs%3A13%3A%22www.baidu.com%22%3Bs%3A6%3A%22f_host%22%3Bs%3A3%3A%22www%22%3B%7D; __mfwuuid=62357780-7ac5-d4dc-9a8e-5a02aa298353; UM_distinctid=17fa0dad264324-08a77f4b1aeaeb-9771539-e1000-17fa0dad26543a; __omc_r=; PHPSESSID=cdbbvncvrd5rqepai636p6pos7; Hm_lvt_8288b2ed37e5bc9b4c9f7008798d2de0=1647743622,1647851451,1647908302,1648001811; bottom_ad_status=0; __jsl_clearance=1648007349.914|0|yoFqmnWY6O7Msv1j5KemUKE3POE%3D; __mfwa=1647671168836.14813.17.1648001810109.1648007353609; CNZZDATA30065558=cnzz_eid%3D2058254581-1647670157-null%26ntime%3D1648003704; __mfwb=b20cf490195f.2.direct; __mfwlv=1648009103; __mfwvn=13; __mfwlt=1648009103; Hm_lpvt_8288b2ed37e5bc9b4c9f7008798d2de0=1648009104; ariaDefaultTheme=undefined'
                cookies = {i.split('=')[0]: i.split('=')[1] for i in Cookie.split('; ')}
                request.cookies =cookies
                return None
        
        
            def process_response(self, request, response, spider):
                bro=spider.bro
                if request.url in spider.start_urls:
                    bro.get(request.url)
                    sleep(2)
                    page_text=bro.page_source
                    new_response=HtmlResponse(url=request.url,body=page_text,encoding='utf-8',request=request)
                    return new_response
                else:
                    return response
        

         

    • 編寫子頁面處理函式
      •  其中有些欄位可能是空的導致抓取錯位,所以儘量選擇標籤值來定位元素
      •  

        def detail_parse(self,response):
            item=response.meta
            item['title']=item['title']
            item['introduction']=response.xpath('/html/body/div[2]/div[3]/div[2]/div[1]/text()').get().strip()
            item['phone']=response.xpath('/html/body/div[2]/div[3]/div[2]/ul/li[@class="tel"]/div[2]/text()').get()
            item['time']=response.xpath('/html/body/div[2]/div[3]/div[2]/ul/li[@class="item-time"]/div[@class="content"]/text()').get()  #儘量選擇標籤定位而非位置定位
            item['traffic']=response.xpath('/html/body/div[2]/div[3]/div[2]/dl[1]/dd/text()').get()
            item['ticket']=response.xpath('/html/body/div[2]/div[3]/div[2]/dl[2]/dd/div/text()').get()
            item['open_time']=response.xpath('/html/body/div[2]/div[3]/div[2]/dl[3]/dd/text()').get()
            yield item

         

    • 關閉網頁
      •  

        def close(self,spider):
            self.bro.quit()
        

         

  • 至此全部實驗到此結束,再回頭看時好像沒有多少問題,但是在遇到問題的時候在網頁上搜索不到解決問題時真的會很崩潰。