爬取汽車之家北京二手車資訊

阿新 • • 發佈：2018-12-07

爬取汽車之家北京二手車資訊

經測試，該網站：https://www.che168.com/beijing/list/ 反爬機制較低，僅需要偽造請求頭設定爬取速率，但是100頁之後需要登入，登入之後再爬要慎重，一不小心就會永久封號。爬取的資料以各種型別存放，下面展示儲存到mysql資料庫中：
在這裡插入圖片描述

程式碼解析：

程式原始碼自提Github：https://github.com/H-Ang/carsSpider

爬蟲主程式

# 汽車之家爬蟲，北京二手車
import requests
from lxml import etree
from data_save import *
import time

class Car_second():
    name = ''
    gonglishu = ''
    brought_year = ''
    location = ''
    img_url = ''
    price = ''

def getInfors(url,i):
    print("Page %d is saving." % i)
    # 構造請求頭
    headers = {
        "Cache-Control":"no-cache",
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
        "Referer":"https://www.che168.com/beijing/list/",
    }

    response = requests.get(url=url,headers=headers)

    html = response.text

    ob_xml = etree.HTML(html)

    infos = ob_xml.xpath('//*[@id="viewlist_ul"]//li[not(contains(@class,"adv-img"))]/a')

    secondCars = []
    for info in infos:
        if info.xpath('.//img/@src2') == []:
            img = info.xpath('.//img/@src')[0]
        else:
            img = info.xpath('.//img/@src2')[0]

        name = info.xpath('.//h4/text()')[0]

        price = info.xpath('.//span[@class="price"]/text()')[0] + info.xpath('.//em/text()')[0]

        myl = info.xpath('.//p/text()')[0].split('／')
        gonglishu = myl[0]
        brought_year = myl[1]
        location = myl[2]

        secondCar = Car_second()
        secondCar.name = name
        secondCar.img_url = img
        secondCar.brought_year = brought_year
        secondCar.location = location
        secondCar.gonglishu = gonglishu
        secondCar.price = price

        secondCars.append(secondCar)

    return secondCars

if __name__ == '__main__':
    url = 'https://www.che168.com/beijing/a0_0msdgscncgpi1ltocsp{}exx0/'
    for i in range(1,101):
        car_infors = getInfors(url.format(i),i)
        time.sleep(0.95)
        #savdFile(car_infors)
        saveMysql(car_infors)

儲存資料

def savdFile(datas):
    # 儲存到文字檔案
    with open('J:\DATAs\北京市二手車(汽車之家)\data.txt','a+',encoding='utf-8') as f:
        for data in datas:
            # try:
            #     name = data.name
            #     gonglishu = data.gonglishu
            #     brought_year = data.brought_year
            #     location = data.location
            #     img_url = data.img_url
            #     price = data.price
            #     writeCont = name+"/"+gonglishu+"/"+brought_year+"/"+location+"\n"+price+"圖片地址:"+img_url
            #     f.write(writeCont+'\n\n')
            # except:
            #     print(writeCont)
            name = data.name
            gonglishu = data.gonglishu
            brought_year = data.brought_year
            location = data.location
            img_url = data.img_url
            price = data.price
            writeCont = name+"/"+gonglishu+"/"+brought_year+"/"+location+"\n"+price+"圖片地址:"+img_url
            f.write(writeCont+'\n\n')
    print('儲存完成。')

# 將資料儲存到資料庫中
from  sqlalchemy import Column,create_engine,Integer,String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker

Base = declarative_base()

class Car(Base):
    __tablename__ = "second_cars"
    id = Column(Integer,primary_key=True,autoincrement=True,nullable=False)
    carName = Column(String(100))
    gonglishu = Column(String(20))
    brought_year = Column(String(10))
    location = Column(String(10))
    image_url = Column(String(200))
    price = Column(String(10))

def saveMysql(datas):
    connect = create_engine("mysql+pymysql://root: 
[email protected]:3306/second_cars",
                            encoding='utf-8',
                            echo=True)


    Base.metadata.create_all(connect)

    DBsession = sessionmaker(bind=connect)
    session = DBsession()

    for data in datas:

        car = Car(
            carName=data.name,
            gonglishu = data.gonglishu,
            brought_year = data.brought_year,
            price=data.price,
            location = data.location,
            image_url = data.img_url,
        )

        session.add(car)
        session.commit()
        session.close()

反思

儲存到mysql資料庫是，建立新物件並傳參時有點複雜，我曾經記得有種很簡單明瞭的方法，現在怎麼也想不起來，望指教哈。

爬取汽車之家北京二手車資訊

爬取汽車之家北京二手車資訊經測試，該網站：https://www.che168.com/beijing/list/ 反爬機制較低，僅需要偽造請求頭設定爬取速率，但是100頁之後需要登入，登入之後再爬要慎重，一不小心就會永久封號。爬取的資料以各種型別存放，下面展示儲存到mysql資料

python網路爬蟲爬取汽車之家的最新資訊和照片

實現的功能是爬取汽車之家的最新資訊的連結題目和文章中的照片爬蟲需要用到我們使用了 requests 做網路請求，拿到網頁資料再用 BeautifulSoup 進行解析首先先檢查是否安裝了pip，如果已經安裝了pip,直接pip install requests,pip uninstal

python入門-----爬取汽車之家新聞,---自動登錄抽屜並點贊,

ike color div標簽 pla spa art com col 3-9 爬取汽車之家新聞,代碼如下 import requests res=requests.get(url=‘https://www.autohome.com.cn/news/‘) #向汽車直接

爬取汽車之家

ref article brush att split channel odin lazy com import requests from bs4 import BeautifulSoup response = requests.get(‘https://www.aut

python3 爬取汽車之家所有車型操作步驟

題記: 　　網際網路上關於使用python3去爬取汽車之家的汽車資料（主要是汽車基本引數，配置引數，顏色引數，內飾引數）的教程已經非常多了，但大體的方案分兩種：　　1.解析出汽車之家某個車型的網頁，然後正則表示式匹配出混淆後的資料物件與混淆後的js，並對混淆後的js使用pyv8進行解析返回

python爬蟲實戰爬取汽車之家上車型價格

相關庫 import pymysql import pymysql.cursors from bs4 import BeautifulSoup import requests import random

WebMagic爬蟲入門教程（三）爬取汽車之家的例項-品牌車系車型結構等

本文使用WebMagic爬取汽車之家的品牌車系車型結構價格能源產地國別等；java程式碼備註，只是根據url變化爬取的，沒有使用爬取script頁面具體的資料，也有反爬機制，知識簡單爬取html標籤爬取的網頁：需要配置pom.xml <!-

python爬蟲——爬取汽車之家新聞

按F12審查一下元素：找到了對應的資訊。而且發現要爬取的圖片都在id=auto-channel-lazyload-article的div標籤下的li標籤裡。 li標籤下的a標籤就是新聞的url；image標籤，src就是獲取圖片的url；請求圖片地

Python3[爬蟲實戰] scrapy爬取汽車之家全站連結存json檔案

昨晚晚上一不小心學習了崔慶才，崔大神的部落格，試著嘗試一下爬取一個網站的全部內容，福利吧網站現在已經找不到了，然後一不小心逛到了汽車之家 (http://www.autohome.com.cn/beijing/) 很喜歡這個網站，女人都喜歡車，更何況男人呢。（

Python練習 scrapy 爬取汽車之家文章

autohome.py #spider檔案 # -*- coding: utf-8 -*- import scrapy from Autohome.items import AutohomeItem class AutohomeSpider(scrapy.Spider)

使用python抓取汽車之家車型資料

import requests import pymysql HOSTNAME = '127.0.0.1' USERNAME = 'root' PASSWORD = 'zyndev' DATABASE = 'zyndev_new' brand = 'ht

java 開發用到網路爬蟲，抓取汽車之家網站全部資料經歷

經歷了兩個禮拜的折騰，某某知名網站的資料終於到手了。犯罪沒被發現這種心情感覺很爽。說一下我的犯罪經歷，之前公司總是抓取某某網站資料，可能是被發現了。某某網站改變了策略。通過各種技術終止了我們的行為，導致我們的抓取功能報錯，逐步跟蹤，發現我們之前是在人家的網站，通過Webh

Python爬蟲爬取BT之家找電影資源

一、寫在前面最近看新聞說聖城家園（SCG）倒了，之前BT天堂倒了，暴風影音也不行了，可以說看個電影越來越費力，國內大廠如企鵝和愛奇藝最近也出現一些么蛾子，雖然目前版權意識雖然越來越強，但是很多資源在這些主流視訊網站上面依然沒有，我平時看電影又習慣下載下來再看，所以每次看電影找資源就很麻煩，花了點時間，於

Python爬蟲實戰詳解：爬取圖片之家

前言本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,版權歸原作者所有,如有問題請及時聯絡我們以作處理如何使用python去實現一個爬蟲？模擬瀏覽器請求並獲取網站資料在原始資料中提取我們想要的資料資料篩選將篩選完成的資料做儲存完成一個爬蟲需要哪些工具 Python3.6 p

汽車之家網站為例-爬蟲的編寫，爬取圖片

汽車之家圖片的爬取汽車之家有很多汽車的點評、價格、圖片等資訊，那麼怎麼才能編寫一個爬蟲來獲得我們所需要的資訊呢，很簡單，兩個工具便可以了，一個網頁解析工具requests，一個正則匹配工具re

汽車之家資料爬取:文章連結//圖片//標題

(1)打印出來的東西亂碼,如何處理這個問題? import requests response=requests.get( url='https://www.autohome.com.cn/beijing/' #最新的地址是可以出來的 # url='https://

Python爬取最新反爬蟲汽車之家口碑

本人剛學Python沒幾天,程式碼可能比較醜陋, 大牛不要噴用的Python2.7.2, 因為PyV8最高支援2.7.2, js混淆部分用的PyV8直接執行的js 原理已經寫過一篇了,這裡不再贅述了.可以看我的這篇目錄結構如下: fonts資料夾負責存放下載的字型檔案

scrapy汽車之家車型的簡單爬取

汽車之家車型的簡單爬取 spider # -*- coding: utf-8 -*- import scrapy from scrapy import Request from mininova.items import carItem import sys r

汽車之家店鋪數據抓取 DotnetSpider實戰[一]

@class 詳情 nbsp 實體比較 request else post 好的一、背景春節也不能閑著，一直想學一下爬蟲怎麽玩，網上搜了一大堆，大多都是Python的，大家也比較活躍，文章也比較多，找了一圈，發現園子裏面有個大神開發了一個DotNetSpider的開源

汽車之家店鋪數據抓取 DotnetSpider實戰

xv6 pan req time 大神 -i xib lac context 一、背景春節也不能閑著，一直想學一下爬蟲怎麽玩，網上搜了一大堆，大多都是Python的，大家也比較活躍，文章也比較多，找了一圈，發現園子裏面有個大神開發了一個DotNetSpider的開源庫，

爬取汽車之家北京二手車資訊