爬取汽車之家車型配置檔案

阿新 • • 發佈：2021-01-21

一、需求

獲取指定品牌的所有車型配置資訊，並儲存到excel中。

流程大致思路：

1.獲取品牌id：brand_id

2.通過品牌id獲取車型id：series_id

3.獲取車型配置頁面

4.解析配置頁面內容（這步最複雜，使用了之前一些大神的程式碼）

二、程式碼

測試完美執行

import requests
import json
import xlwt
from bs4 import BeautifulSoup
import re
from urllib import parse
from selenium import webdriver


class Car_home_config(object):
     
def __init__(self):
        self.session = requests.Session()
        self.params = None
        self.brand_dict = {}
        self.series_dict = {}
        self.brand_name = None

    def get_header(self):
        self.headers = {
            "authority": "car.autohome.com.cn",
            "method": " 
GET",
            "path": "/AsLeftMenu/As_LeftListNew.ashx?%s" % parse.urlencode(self.params),
            "scheme": "https",
            "accept": "*/*",
            "accept-encoding": "gzip, deflate, br",
            "accept-language": "zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7",
            "cache-control": " 
no-cache",
            "pragma": "no-cache",
            "sec-ch-ua": "Google Chrome;v=87,Not;A Brand;v=99,Chromium;v=87",
            "sec-ch-ua-mobile": "?0",
            "sec-fetch-dest": "document",
            "sec-fetch-mode": "navigate",
            "sec-fetch-site": "none",
            "sec-fetch-user": "?1",
            "upgrade-insecure-requests": "1",
            "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36"
        }

    # 獲取所有品牌id號
    def get_brand_id(self):
        self.params = {
            "typeId": "1",
            "brandId": "0",
            "fctId": "0",
            "seriesId": "0"
        }
        self.get_header()
        url = r"https://car.autohome.com.cn/AsLeftMenu/As_LeftListNew.ashx"
        res = self.session.get(url=url, headers=self.headers, params=self.params)
        res.encoding = res.apparent_encoding
        html = res.text
        # print(html)
        soup = BeautifulSoup(html, 'lxml')
        ul_list = soup.find_all("ul")
        for ul in ul_list:
            li_list = ul.find_all("li")
            for li in li_list:
                a_href = li.find("a").attrs.get('href')
                a_text = li.find("a").text
                # print(a_href)
                # print(a_text)
                brand_id = re.findall("[0-9]\d*", a_href)[0]
                self.brand_dict[brand_id] = a_text
        return self.brand_dict

    def get_AsLeftMenu(self):
        url = r"https://car.autohome.com.cn/AsLeftMenu/As_LeftListNew.ashx"
        res = self.session.get(url=url, headers=self.headers, params=self.params)
        res.encoding = res.apparent_encoding
        html = res.text
        soup = BeautifulSoup(html, 'lxml')
        dd_list = soup.find_all("dd")
        for dd in dd_list:
            a_list = dd.find_all("a")
            for a in a_list:
                a_href = a.attrs.get('href')
                a_text = a.text
                print(a_href)
                print(a_text)
                series_id = re.findall("[0-9]\d*", a_href)[0]
                self.series_dict[series_id] = a_text

    # 獲取某一品牌下車型的id號
    def get_series_id(self):
        self.get_brand_id()
        if self.brand_name:
            for k, v in self.brand_dict.items():
                if self.brand_name in v:
                    self.params = {
                        "typeId": "1",
                        "brandId": k,
                        "fctId": "0",
                        "seriesId": "0"
                    }
                    self.get_header()
                    self.get_AsLeftMenu()
                    return self.series_dict
        else:
            for k, v in self.brand_dict.items():
                self.params = {
                    "typeId": "1",
                    "brandId": k,
                    "fctId": "0",
                    "seriesId": "0"
                }
                self.get_header()
                self.get_AsLeftMenu()
            return self.series_dict

    # 獲取車型配置資訊
    def get_config_content(self, series_id):
        res = self.session.get(r"https://car.autohome.com.cn/config/series/{}.html".format(series_id), verify=False,
                           headers={
                               "authority": "car.autohome.com.cn",
                               "method": "GET",
                               "path": "/config/series/{}.html".format(series_id),
                               "scheme": "https",
                               "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
                               "accept-encoding": "gzip, deflate, br",
                               "accept-language": "zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7",
                               "cache-control": "no-cache",
                               "referer": "https://www.autohome.com.cn/",
                               "sec-ch-ua": "Google Chrome;v=87,Not;A Brand;v=99,Chromium;v=87",
                               "sec-ch-ua-mobile": "?0",
                               "sec-fetch-dest": "document",
                               "sec-fetch-mode": "navigate",
                               "sec-fetch-site": "same-site",
                               "ec-fetch-user": "?1",
                               "upgrade-insecure-requests": "1",
                               "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36"})
        html = res.content.decode("utf-8")
        return html

    def car_info(self, html):
        config = re.search("var config = (.*?)};", html)  # 車的引數
        option = re.search("var option = (.*?)};", html)  # 主被動安全裝備
        bag = re.search("var bag = (.*?)};", html)  # 選裝包
        # 處理汽車引數
        car_info = ""
        if config and option and bag:
            car_info = car_info + config.group(0) + option.group(0) + bag.group(0)
        return car_info

    def write_html(self, js_list, car_info):
        # 執行JS的DOM -- 這部破解是最麻煩的，非常耗時間~參考了網際網路上的大神程式碼
        DOM = ("var rules = '2';"
               "var document = {};"
               "function getRules(){return rules}"
               "document.createElement = function() {"
               "      return {"
               "              sheet: {"
               "                      insertRule: function(rule, i) {"
               "                              if (rules.length == 0) {"
               "                                      rules = rule;"
               "                              } else {"
               "                                      rules = rules + '#' + rule;"
               "                              }"
               "                      }"
               "              }"
               "      }"
               "};"
               "document.querySelectorAll = function() {"
               "      return {};"           "};"
               "document.head = {};"
               "document.head.appendChild = function() {};"

               "var window = {};"
               "window.decodeURIComponent = decodeURIComponent;")

        # 把JS檔案寫入到檔案中去
        for item in js_list:
            DOM = DOM + item
        html_type = "<html><meta http-equiv='Content-Type' content='text/html; charset=utf-8' /><head></head><body>    <script type='text/javascript'>"
        # 拼接成一個可以執行的網頁
        js = html_type + DOM + " document.write(rules)</script></body></html>"
        # 再次執行的時候，請把檔案刪除，否則無法建立同名檔案，或者自行加驗證即可
        with open("original.html", "w", encoding="utf-8") as f:
            f.write(js)
        try:
            driver = webdriver.PhantomJS(
                executable_path=r"phantomjs.exe")
            driver.get("original.html")
            # 讀取body部分
            text = driver.find_element_by_tag_name('body').text
            if not text:
                return
        except Exception as e:
            print(e)
        finally:
            driver.close()
        # 匹配車輛引數中所有的span標籤
        span_list = re.findall("<span(.*?)></span>", car_info)  # car_info 是我上面拼接的字串
        # 按照span標籤與text中的關鍵字進行替換
        for span in span_list:
            # 這個地方匹配的是class的名稱  例如 <span class='hs_kw7_optionZl'></span> 匹配   hs_kw7_optionZl 出來
            info = re.search("'(.*?)'", span)
            if info:
                class_info = str(info.group(
                    1)) + "::before { content:(.*?)}"  # 拼接為  hs_kw7_optionZl::before { content:(.*?)}
                content = re.search(class_info, text).group(1)  # 匹配文字內容，返回結果為 "實測""油耗""質保"
                car_info = car_info.replace(str("<span class='" + info.group(1) + "'></span>"),
                                            re.search("\"(.*?)\"", content).group(1))
        return car_info

    def save(self, car_info, car_name, save_path):
        # 持久化
        car_item = {}
        config = re.search("var config = (.*?);", car_info).group(1)
        option = re.search("var option = (.*?);var", car_info).group(1)
        bag = re.search("var bag = (.*?);", car_info).group(1)
        config_re = json.loads(config)
        option_re = json.loads(option)
        bag_re = json.loads(bag)
        config_item =[]
        option_item = []
        for i in config_re['result']['paramtypeitems']:
            config_item+=i['paramitems']
        for i in option_re['result']['configtypeitems']:
            option_item+=i['configitems']
        # bag_item = bag_re['result']['bagtypeitems'][0]['bagitems']
        for car in config_item:
            car_item[car['name']] = []
            for value in car['valueitems']:
                car_item[car['name']].append(value['value'])
        for car in option_item:
            car_item[car['name']] = []
            for value in car['valueitems']:
                car_item[car['name']].append(value['value'])
        # for car in bag_item[0]['valueitems']:
        #     car_item[car['name']] = []
        #     car_item[car['name']].append(car['bagid'])
        #     car_item[car['name']].append(car['pricedesc'])
        #     car_item[car['name']].append(car['description'])
        # 生成表格
        workbook = xlwt.Workbook(encoding='ascii')  # 建立一個檔案
        worksheet = workbook.add_sheet('汽車之家')  # 建立一個表
        cols = 0
        start_row = 0
        for co in car_item:
            worksheet.write(start_row, cols, co)  # 在第0(一)行寫入車的配置資訊
            cols = cols + 1
        end_row_num = start_row + len(car_item['車型名稱'])  # 車輛款式記錄數
        for row in range(start_row, end_row_num):
            col_num = 0  # 列數
            row += 1
            for col in car_item:
                try:
                    con = str(car_item[col][row - 1])
                except:
                    con = ""
                worksheet.write(row, col_num, con)
                col_num = col_num + 1
        workbook.save('{}/{}.xls'.format(save_path, car_name))

    # 查詢車型配置，brand_name不填就是查詢所有
    def check(self, brand_name, save_path="./"):
        self.brand_name = brand_name
        self.get_series_id()
        for series_id, car_name in self.series_dict.items():
            print(series_id, car_name)
            html = self.get_config_content(series_id)
            car_info = self.car_info(html)
            js_list = re.findall('(\(function\([a-zA-Z]{2}.*?_\).*?\(document\);)', html)
            car_info = self.write_html(js_list, car_info)
            if car_info:
                self.save(car_info, car_name, save_path)


car = Car_home_config()
car.check("奧迪")

phantomjs.exe下載地址：https://phantomjs.org/download.html

感謝以下作者：
https://www.cnblogs.com/kangz/p/10011348.html
https://www.cnblogs.com/pontoon/p/10459471.html

爬取汽車之家車型配置檔案

一、需求獲取指定品牌的所有車型配置資訊，並儲存到excel中。流程大致思路：

資訊時代程式設計師輕鬆帶你爬取汽車之家資料

使用BeautifulSoup模組使用正則表示式使用到多執行緒爬取使用說明使用前請安裝BeauifulSoup執行程式後會在當前目錄下生成txt檔案,內容為json格式.如下所示:

爬取站長之家免費簡歷

爬取站長之家免費簡歷 import os import requests from lxml import etree dir_name = \'./簡歷模板\'

Python爬蟲實戰詳解：爬取圖片之家

前言本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,版權歸原作者所有,如有問題請及時聯絡我們以作處理

scrapy爬取圖片（以汽車之家的圖片為例子）

setting from fake_useragent import UserAgent BOT_NAME = \'carhome\' SPIDER_MODULES = [\'carhome.spiders\']

使用requests爬取梨視訊、bilibili視訊、汽車之家，bs4遍歷文件樹、搜尋文件樹，css選擇器

今日內容概要使用requests爬取梨視訊 requests+bs4爬取汽車之家 bs4遍歷文件樹 bs4搜尋文件樹

PuppeteerSharp+AngleSharp的爬蟲實戰之汽車之家資料抓取

參考了DotNetSpider示例，感覺DotNetSpider太重了，它是一個比較完整的爬蟲框架。對比了以下各種無頭瀏覽器，最終採用PuppeteerSharp+AngleSharp寫一個爬蟲示例。和上面的博文一樣，都是用汽車之家的https://store.

Python爬取視訊之愛情電影及解密TS檔案和兩種合併ts！

俗話說，興趣所在，方能大展拳腳。so結合興趣的學習才能事半功倍，更加努力專心，apparently本次任務是在視訊網站爬取一些好看的小電影，地址不放（狗頭保命）只記錄過程。

汽車之家從 SQL Server 到 TiDB 的異構變遷

作者：汽車之家技術學院-技術架構組 SQL Server + .Net 是很多早期網際網路企業的標配技術棧，雖然 TiDB 是相容 MySQL 協議和生態的資料庫，但是 TiDB 適用的業務場景是通用的。在開源新技術大行其道的今天，如何從

SpringAop之日誌(讀配置檔案方式)

讀配置檔案的目的在於減少程式碼上的冗餘，這個冗餘通常指加註解之類的。

MySQL之my.cnf配置檔案優化

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55

通過Python的requests庫爬取資料並儲存為csv檔案

目錄一、選擇資料來源三、整體程式碼實現 4、總結同時推薦前面作者另外兩個系列文章：

server2008實驗之二安裝配置檔案伺服器

安裝好server 2003之後，我們通常通過在資料夾上點選右鍵共享資料夾，然後配置共享許可權和NTFS許可權的方式可以實現檔案伺服器的基本功能。升級到Server2008之後，我依然可以通過這種方式來實現簡單的檔案

基於Python的介面自動化實戰-基礎篇之讀寫配置檔案

引言在編寫介面自動化測試指令碼時，有時我們需要在程式碼中定義變數並給變數固定的賦值。為了統一管理和操作這些固定的變數，咱們一般會將這些固定的變數以一定規則配置到指定的配置檔案中，後續需要用到這些變數和

Spring之拆分spring配置檔案，整合web專案

1、java專案直接 ApplicationContext context = new ClassPathXmlApplicationContext(\"applicationContext1.xml\");

博主我配置了開機自啟動vino-server_systemd service之：服務配置檔案編寫(1)

技術標籤：博主我配置了開機自啟動vino-server systemd service：簡介 Systemd Service是systemd提供的用於管理服務啟動、停止和相關操作的功能，它極大的簡化了服務管理的配置過程，使用者只需要配置幾項指令

IDEA之啟動引數,配置檔案預設引數的操作

配置檔案寫法： isPay是key，如果啟動引數不傳，則採用預設值是1。啟動類傳參：

港媒：汽車之家將於 3 月 15 日在港上市

北京時間 3 月 3 日上午訊息，據《香港經濟日報》報道，汽車之家已通過上市聆訊，香港上市將發售 3030 萬股股票，擬以最高每股 251.8 港元發行，尋求通過香港上市籌資至多 9.84 億美元。

汽車之家每股發行價最高 251.8 港元，預期 3 月 15 日開始交易

北京時間 3 月 4 日早間訊息，汽車之家在港交所釋出公告稱，計劃全球發行 3029 萬股股份，每股發行價最高為 251.8 港元，預期股份將於 3 月 15 日於聯交所開始交易。

汽車之家正式登陸港交所，市值超 900 億港元

3月15日訊息今日，汽車之家正式登陸港交所。截至發稿時，汽車之家上漲 2.67%，盤中股價達 181 港元，市值達 913.77 億港元，約合人民幣 765.29 億元。

爬取汽車之家車型配置檔案

一、需求

二、程式碼

相關推薦