【汽車口碑分析】3.爬取汽車評論資料

阿新 • • 發佈：2019-02-14

環境配置

Ubuntu 16.04
Python 3.5

技術框架

Scrapy

需求目標

本專案為汽車口碑分析，第一步需要爬取對於不同車型的評論資料。

選擇58車的車型分類爬取評論資料。

爬取流程

先獲取每個車型的連結，以下圖中紅框內的車型為例
開啟連結後，抓取下圖紅框中的總評分，寫入檔案中。
寫入總評分後，通過拼接連結進入該車型的使用者評論頁面。

通過第一步中獲取的連結拼接上list_s1_p1.html，組成使用者評論頁面的連結。

【注】此為第一頁的連結，若還有下一頁，下述步驟會提及處理方法。
抓取評論頁面中的各種資料，如id，評分

，評論等。
若該評論頁面還有下一頁，則繼續抓取下一頁中的評論資料。

【方法】

判斷頁面中是否有下一頁元素，若有則回撥解析評論頁面的方法。
將爬取的資料儲存到檔案中。

詳細步驟

建立新工程

先建立工程目錄

cd /home/t/dataset/
mkdir carSpider

建立新工程

scrapy startproject carSpider

編輯`items.py`檔案

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in: 

# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy
class CarspiderItem(scrapy.Item):
    file=scrapy.Field() #檔名
    car=scrapy.Field() #車型
    score=scrapy.Field() #總評分
    u_id=scrapy.Field() #使用者ID
    u_score=scrapy.Field() #使用者評分
    u_merit=scrapy.Field() #使用者評論優點
    u_demerit=scrapy.Field() #使用者評論缺點 

    u_summary=scrapy.Field() #使用者評論綜述
    u_flower=scrapy.Field() #使用者評論鮮花數
    u_brick=scrapy.Field() #使用者評論板磚數

編寫`carSpider.py`檔案

import scrapy
from carSpider.items import CarspiderItem

baseDir = '/home/t/dataset/carRemark/'
startUrl='http://www.58che.com/brand.html'

class CarSpider(scrapy.Spider):

    name='spider' #爬蟲名
    def __init__(self):
        self.start_urls=[startUrl] 

    #第一層解析方法
    def parse(self,response):
        #定位到車型元素
        subclasses=response.css('body > div.fltop > div.marcenter > div > div > div.r > ul > li > dl > dt > a')
        for subclass in subclasses:
            subclass_name=subclass.xpath('text()').extract_first() #獲取車型名稱文字
            subclass_link=subclass.xpath('@href').extract_first() #獲取車型連結
            yield scrapy.Request(url=subclass_link,callback=self.parse_car_subclass,meta={'file':subclass_name}) #回撥下一層解析方法，並把車型名稱傳遞給該方法作為檔名

    #第二層解析方法
    def parse_car_subclass(self,response):
        infos=response.css('#line1 > div.cars_line2.l > div.dianpings > div.d_div1.clearfix > font') #定位到總評分元素
        for info in infos:
            score=info.xpath('text()').extract_first() #獲取總評分元素文字
            file=response.meta['file'] #獲取上個Request傳遞來的meta['file']
            self.writeScore(file,score) #將總評分寫入檔案中
            link=response.url+'list_s1_p1.html' #拼接使用者評論第一頁連結
            yield scrapy.Request(url=link,callback=self.parse_remark,meta={'file':file}) #回撥下一層解析方法，把車型名稱傳遞給該方法作為檔名

    #第三層解析方法
    def parse_remark(self,response):
        #定位到使用者評論元素
        infos=response.css('body > div.newbox > div > div.xgo_cars_w760.l > div.xgo_dianping_infos.mb10 > div.xgo_cars_dianping > div > dl')
        for info in infos:
            uid=info.xpath('dd[1]/strong/a/text()')[0].extract() #獲取使用者ID
            score=info.xpath('dd[1]/div/div/@style')[0].extract() #獲取使用者評分星級
            score=self.getScore(score) #將使用者評分星級轉化為5分制評分

            try:
                #先獲取是否有‘優點’元素，若有則定位‘優點’元素的下一個兄弟節點，即‘優點評語’，若無則為空
                node=info.xpath('dd[2]/div/div[contains(@class,"l redc00")]')[0] 
                if node is not None:
                    merit=node.xpath('following-sibling::*[1]/text()')[0].extract()
                else:
                    merit=''
            except:
                merit=''


            try:
                #先獲取是否有‘缺點’元素，若有則定位‘缺點’元素的下一個兄弟節點，即‘缺點評語’，若無則為空
                node=info.xpath('dd[2]/div/div[contains(@class,"l hei666")]')[0]
                if node is not None:
                    demerit=node.xpath('following-sibling::*[1]/text()')[0].extract()
                else:
                    demerit=''
            except:
                demerit=''

            try:
                #先獲取是否有‘綜述’元素，若有則定位‘綜述’元素的下一個兄弟節點，即‘綜述評語’，若無則為空
                node=info.xpath('dd[2]/div/div[contains(@class,"l")]')[0]
                if node is not None:
                    summary=node.xpath('following-sibling::*[1]/text()')[0].extract()
                else:
                    summary=''
            except:
                summary=''

            flower=info.xpath('dd[2]/div[contains(@class,"apply")]/a[3]/span/text()')[0].extract() #獲取鮮花數
            brick=info.xpath('dd[2]/div[contains(@class,"apply")]/a[4]/span/text()')[0].extract() #獲取板磚數

           #建立Item
            item=CarspiderItem()
            item['file']=response.meta['file']
            item['u_id']=uid
            item['u_score']=score
            item['u_merit']=merit
            item['u_demerit']=demerit
            item['u_summary']=summary
            item['u_flower']=flower
            item['u_brick']=brick

            #生成Item
            yield item

        #獲取`下一頁`元素，若有則回撥`parse_remark`第三層解析方法，即繼續獲取下一頁使用者評論資料
        #定位`下一頁`元素
        next_pages=response.css('body > div.newbox > div > div.xgo_cars_w760.l > div.xgo_dianping_infos.mb10 > div.xgo_cars_dianping > div > div > a.next')
        for next_page in next_pages:
            #若有`下一頁`元素，則拼接`下一頁`元素連結，並回調第三層解析方法，用來獲取下一頁使用者評論資料
            if next_page is not None:
                next_page_link=next_page.xpath('@href')[0].extract()
                next_page_link='http://www.58che.com'+next_page_link
                file=response.meta['file']
                yield scrapy.Request(url=next_page_link, callback=self.parse_remark, meta={'file': file})


    #將總評分寫入檔案
    def writeScore(self,file,score):
        with open('/home/t/dataset/carRemark/'+file+'.json','a+') as f:
            f.write(score+'\n')

    #將使用者評分星級轉為5分制分數，類似switch功能
    def getScore(self,text):
        text=text.split(':')[1] #分割文字，原文字格式形如`width:100%`，分割並擷取`:`後的文字
        return {
            '100%':5,
            '80%':4,
            '60%':3,
            '40%':2,
            '20%':1,
            '0%':0
        }.get(text)

【解析】

        #定位到使用者評論元素
        infos=response.css('body > div.newbox > div > div.xgo_cars_w760.l > div.xgo_dianping_infos.mb10 > div.xgo_cars_dianping > div > dl')

此句程式碼定位的元素如下圖所示，定位到的是評論頁面每條評論的元素整體。

mark

 for info in infos:
            uid=info.xpath('dd[1]/strong/a/text()')[0].extract() #獲取使用者ID
            score=info.xpath('dd[1]/div/div/@style')[0].extract() #獲取使用者評分星級
            score=self.getScore(score) #將使用者評分星級轉化為5分制評分

uid定位到的元素如下圖所示，

mark

score定位到的元素如下圖所示，獲取score元素的style屬性，值形如width:80%，需要通過getScore()方法轉換為五分制分數。

mark

try:
    #先獲取是否有‘優點’元素，若有則定位‘優點’元素的下一個兄弟節點，即‘優點評語’，若無則為空
    node=info.xpath('dd[2]/div/div[contains(@class,"l redc00")]')[0] 
    if node is not None:
        merit=node.xpath('following-sibling::*[1]/text()')[0].extract()
    else:
        merit=''
except:
    merit=''

先定位是否有優點元素，如下圖紅框所示，若有該元素，則獲取優點元素的下一個兄弟節點內容，如下圖藍框所示，若無則為空。

mark

#獲取`下一頁`元素，若有則回撥`parse_remark`第三層解析方法，即繼續獲取下一頁使用者評論資料
#定位`下一頁`元素
next_pages=response.css('body > div.newbox > div > div.xgo_cars_w760.l > div.xgo_dianping_infos.mb10 > div.xgo_cars_dianping > div > div > a.next')
    for next_page in next_pages:
        #若有`下一頁`元素，則拼接`下一頁`元素連結，並回調第三層解析方法，用來獲取下一頁使用者評論資料
        if next_page is not None:
            next_page_link=next_page.xpath('@href')[0].extract()
             next_page_link='http://www.58che.com'+next_page_link
             file=response.meta['file']
             yield scrapy.Request(url=next_page_link, callback=self.parse_remark, meta={'file': file})

解析完上述內容，判斷使用者評論頁面是否有分頁，定位是否有下一頁元素，如下圖紅框所示，若有則獲取該元素連結，如下圖橙框所示。

獲取之後，回撥parse_remark方法解析下一頁的評論頁面。

mark

編輯`pipelines.py`檔案

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json
import codecs

baseDir = '/home/t/dataset/carRemark/'
class CarspiderPipeline(object):
    def process_item(self, item, spider):
        print(item['file'])
        with codecs.open(baseDir+item['file']+'.json','a+',encoding='utf-8') as f:
            line=json.dumps(dict(item),ensure_ascii=False)+'\n'
            f.write(line)

        return item

編輯`settings.py`檔案

# -*- coding: utf-8 -*-

# Scrapy settings for carSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'carSpider'

SPIDER_MODULES = ['carSpider.spiders']
NEWSPIDER_MODULE = 'carSpider.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'carSpider (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'carSpider.middlewares.CarspiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'carSpider.middlewares.MyCustomDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'carSpider.pipelines.CarspiderPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
HTTPCACHE_ENABLED = False
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

【解析】

ROBOTSTXT_OBEY = False

將原來的True改為False。

ITEM_PIPELINES = {
    'carSpider.pipelines.CarspiderPipeline': 300,
}

將原來的註釋去掉，即註冊pipelines，否則無法使用該pipelines。

執行爬蟲

在專案根目錄下新建檔案entrypoint.py

mark

from scrapy.cmdline import execute
execute(['scrapy','crawl','spider'])

專案原始碼

【汽車口碑分析】3.爬取汽車評論資料

環境配置 Ubuntu 16.04 Python 3.5 技術框架 Scrapy 需求目標本專案為汽車口碑分析，第一步需要爬取對於不同車型的評論資料。選擇58車的車型分類爬取評論資料。爬取流程先獲取每個車型的連結

【NCNN原始碼分析】3.基本資料結構分析

對於NCNN，在網路層傳遞的過程中，進行資料流動的方式是通過自定義的blob實現的，對於blob通過生產者編號和消費者編號進行定義，producer表示輸出該blob的網路層編號，consumers表示

【學習筆記】python爬取百度真實url

python 今天跑個腳本需要一堆測試的url，，，挨個找復制粘貼肯定不是程序員的風格，so，還是寫個腳本吧。環境：python2.7 編輯器：sublime text 3 一、分析一下首先非常感謝百度大佬的url分類非常整齊，都在一個

【Python3 爬蟲】14_爬取淘寶上的手機圖片

head 並且淘寶網 pan coff urllib images 圖片列表 pic 現在我們想要使用爬蟲爬取淘寶上的手機圖片，那麽該如何爬取呢？該做些什麽準備工作呢？首先，我們需要分析網頁，先看看網頁有哪些規律打開淘寶網站http://www.taobao.com/

【ArcGIS|空間分析|網路分析】3 使用網路資料集查詢最佳路徑

停靠點、障礙點文章目錄要求步驟 1 建立路徑分析圖層 2 新增停靠點 3 設定分析引數 4 計算最佳路徑 5 新增一個障礙 6 儲存路徑參考ArcGIS幫助文件

還在人工爬資料？不用定期敲爬蟲，也能【自動化】訊息爬取的祕訣（內附Python程式碼）

RSS服務Python實做一、安裝我們可以透過Python的套件包：「feedparser 」。讓我們可以輕易的透過Python解析 RSS。Windows 安裝，開啟Command Line：pip install feedparserUbuntu安裝，開啟Terminal：sudo pip insta

【實戰】scrapy 爬取果殼問答！

引言學爬蟲的同學都知道，Scrapy是一個非常好用的框架，可以大大的簡化我們編寫程式碼的工作量。今天我們就從使用Scrapy爬取果殼問答。需求分析爬取果殼問答中精彩回答的標題和答案。知識點爬取資料：Scrapy 資料庫：Mongo 建立專案

【爬蟲相關】爬蟲爬取拉勾網的安卓招聘資訊

我爬取了30頁拉勾上安卓的招聘資料告訴你安卓崗位究竟要一個什麼樣的人我知道沒圖你們是不會看的如圖：以上是抓取了30頁拉勾上關於招聘安卓相關的內容然後根據詞頻製作出詞雲圖出現最多的詞是開發經驗整體流程總共分為2步 1.爬蟲爬取相關的招聘資訊 2.根

Spring Boot【原理分析】(3)——BeanDefinition

一、簡介 BeanDefinition描述了一個Bean的例項，包括屬性，構造方法引數，註解等更多資訊。為後面例項化Bean提供元資料依據。 BeanDefinition的實現類有： 1. RootBeanDefinition：spring BeanFac

【python 新浪微博爬蟲】python 爬取新浪微博24小時熱門話題top500

一、需求分析模擬登陸新浪微博,爬取新浪微博的熱門話題版塊的24小時內的前TOP500的話題名稱、該話題的閱讀數、討論數、粉絲數、話題主持人，以及對應話題主持人的關注數、粉絲數和微博數。二、開發語言 python2.7 三、需要匯入模組 import

【爬蟲】Scrapy 爬取excel中500個網址首頁，使用Selenium模仿使用者瀏覽器訪問，將網頁title、url、文字內容組成的item儲存至json檔案

建立含有網址首頁的excel檔案 host_tag_網站名稱_主域名_子域名.xlsx 編輯讀取excel檔案的工具類專案FileUtils 新建專案FileUtils 編輯file_utils.py # -*- coding: utf-8 -*- """

【Python3爬蟲】Scrapy爬取豆瓣電影TOP250

今天要實現的就是使用是scrapy爬取豆瓣電影TOP250榜單上的電影資訊。步驟如下：一、爬取單頁資訊首先是建立一個scrapy專案，在資料夾中按住shift然後點選滑鼠右鍵，選擇在此處開啟命令列視窗，輸入以下程式碼： scrapy startprojec

【Python爬蟲】按時爬取京東幾類自營手機型號價格引數並存入資料庫

一、最近剛好想換手機，然後就想知道京東上心儀的手機價格如何，對比手機價格如何，以及相應的歷史價格，然後就用Python requests+MySQLdb+smtplib爬取相關的資料二、關於實現的主要步驟： 1、根據京東搜尋頁面，搜尋某型號（

【Python】BeautifulSoup爬取新聞內容

本篇博文是爬取網站新聞的簡單例子，如果要深入瞭解爬蟲，請移步，不要因為這篇博文耽誤你寶貴時間。網站原始碼如下，我們目標是爬取<p>標籤下的新聞內容：程式碼如下：from urllib.request import urlopen from bs4 import B

【python】爬蟲爬取美麗小姐姐圖片美女桌布

爬蟲爬取蜂鳥裡的高清桌布　　想要自動下載某個網站的高清桌布，不能一個個點選下載，所以用爬蟲實現自動下載。改程式碼只針對特定網站，不同網站需要特別分析。一、分析網站　　https://photo.fengniao.com/ 　　隨便點選一張，發現可以上一頁，下一頁的翻頁

【Python3爬蟲】我爬取了七萬條彈幕，看看RNG和SKT打得怎麼樣

一、寫在前面　　直播行業已經火熱幾年了，幾個大平臺也有了各自獨特的“彈幕文化”，不過現在很多平臺直播比賽時的彈幕都基本沒法看的，主要是因為網路上的噴子還是挺多的，尤其是在觀看比賽的時候，很多彈幕不是噴選手就是噴戰隊，如果看了這種彈幕，真是讓比賽減分不少。　　　　但和別的平臺

【Android進階】(3)Android圖像處理

progress chang etc geo xtend static ogr arch 取出 1. 概念色調/色相：物體傳遞的顏色飽和度：顏色的純度，從0（灰）到100%（飽和）來進行描寫敘述亮度/明度：顏色的相對明暗程度 2. 調整圖像小Demo 創建一個

爬蟲練習3 爬取堆糖網校花照片

ring http 正在 usr sts 多線程 src 技術 strings 知識點：多線程的實現圖片的下載及寫入字符串高級查找了解動態加載和jsonrequest 的用法獲取數據的api‘https://www.duitang.com/napi/blog/lis

菜鳥入門【ASP.NET Core】3：準備CentOS和Nginx環境

sysconf service www post 輸入密碼 mct cnblogs kdt 提示基本軟件 VMware虛擬機 centos:http://isoredirect.centos.org/centos/7/isos/x86_64/CentOS-7-x86_

【抓包分析】Charles和夜神模擬器對安卓應用進行抓包分析

技術分享 windows red 工具 com nsh pro 4.2 name 準備工具： 1 Charles : https://www.charlesproxy.com （收費） 2 夜神模擬器： https://www.yeshen.com （免費）

【汽車口碑分析】3.爬取汽車評論資料

環境配置

技術框架

需求目標

爬取流程

詳細步驟

建立新工程

編輯items.py檔案

編寫carSpider.py檔案

編輯pipelines.py檔案

編輯settings.py檔案

執行爬蟲

專案原始碼

相關推薦

編輯`items.py`檔案

編寫`carSpider.py`檔案

編輯`pipelines.py`檔案

編輯`settings.py`檔案