1. 程式人生 > >Scrapy爬取貓眼《復仇者聯盟4終局之戰》影評

Scrapy爬取貓眼《復仇者聯盟4終局之戰》影評

較高的 pytho 必須 save pipeline rate browser 相關 item

一.分析

首先簡單介紹一下Scrapy的基本流程:技術分享圖片

  1. 引擎從調度器中取出一個鏈接(URL)用於接下來的抓取
  2. 引擎把URL封裝成一個請求(Request)傳給下載器
  3. 下載器把資源下載下來,並封裝成應答包(Response)
  4. 爬蟲解析Response
  5. 解析出實體(Item),則交給實體管道進行進一步的處理
  6. 解析出的是鏈接(URL),則把URL交給調度器等待抓取

在網上找到了接口:http://m.maoyan.com/mmdb/comments/movie/248172.json?_v_=yes&offset=0&startTime=2019-02-05%2020:28:22,可以把offset的值設定為0,通過改變startTime的值來獲取更


多的評論信息(每頁評論數據中最後一次評論時間作為新的startTime並構造url重新請求)(startTime=2019-02-05%2020:28:22這裏的%20表示空格)

技術分享圖片

二.主要代碼

items.py

import scrapy


class MaoyanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    city = scrapy.Field()  # 城市
    content = scrapy.Field()  # 評論
    user_id = scrapy.Field()  # 用戶id
    nick_name = scrapy.Field()  # 昵稱
    score = scrapy.Field()  # 評分
    time = scrapy.Field()  # 評論時間
    user_level = scrapy.Field()  # 用戶等級

 comment.py

import scrapy
import random
from scrapy.http import Request
import datetime
import json
from maoyan.items import MaoyanItem

class CommentSpider(scrapy.Spider):
    name = ‘comment‘
    allowed_domains = [‘maoyan.com‘]
    uapools = [
        ‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1‘,
        ‘Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0‘,
        ‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50‘,
        ‘Opera/9.80 (Windows NT 6.1; U; zh-cn) Presto/2.9.168 Version/11.50‘,
        ‘Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3)‘,
        ‘Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.0)‘,
        ‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)‘,
        ‘Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)‘,
        ‘Mozilla/5.0 (Windows; U; Windows NT 6.1; ) AppleWebKit/534.12 (KHTML, like Gecko) Maxthon/3.0 Safari/534.12‘,
        ‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E)‘,
        ‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)‘,
        ‘Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.33 Safari/534.3 SE 2.X MetaSr 1.0‘,
        ‘Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E)‘,
        ‘Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.41 Safari/535.1 QQBrowser/6.9.11079.201‘,
        ‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E) QQBrowser/6.9.11079.201‘,
        ‘Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)‘,
        ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36‘,
        ‘Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0‘
    ]
    thisua = random.choice(uapools)
    header = {‘User-Agent‘: thisua}
    current_time = datetime.datetime.now().strftime(‘%Y-%m-%d %H:%M:%S‘)
    current_time = ‘2019-04-24 18:50:22‘
    end_time = ‘2019-04-24 00:05:00‘  # 電影上映時間
    url = ‘http://m.maoyan.com/mmdb/comments/movie/248172.json?_v_=yes&offset=0&startTime=‘ +current_time.replace(‘ ‘,‘%20‘)

    def start_requests(self):
        current_t = str(self.current_time)
        if current_t > self.end_time:
            try:
                yield Request(self.url, headers=self.header, callback=self.parse)
            except Exception as error:
                print(‘請求1出錯-----‘ + str(error))
        else:
            print(‘全部有關信息已經搜索完畢‘)

    def parse(self, response):
        item = MaoyanItem()
        data = response.body.decode(‘utf-8‘, ‘ignore‘)
        json_data = json.loads(data)[‘cmts‘]
        count = 0
        for item1 in json_data:
            if ‘cityName‘ in item1 and ‘nickName‘ in item1 and ‘userId‘ in item1 and ‘content‘ in item1 and ‘score‘ in item1 and ‘startTime‘ in item1 and ‘userLevel‘ in item1:
                try:
                    city = item1[‘cityName‘]
                    comment = item1[‘content‘]
                    user_id = item1[‘userId‘]
                    nick_name = item1[‘nickName‘]
                    score = item1[‘score‘]
                    time = item1[‘startTime‘]
                    user_level = item1[‘userLevel‘]
                    item[‘city‘] = city
                    item[‘content‘] = comment
                    item[‘user_id‘] = user_id
                    item[‘nick_name‘] = nick_name
                    item[‘score‘] = score
                    item[‘time‘] = time
                    item[‘user_level‘] = user_level
                    yield item
                    count += 1
                    if count >= 15:
                        temp_time = item[‘time‘]
                        current_t = datetime.datetime.strptime(temp_time, ‘%Y-%m-%d %H:%M:%S‘) + datetime.timedelta(
                            seconds=-1)
                        current_t = str(current_t)
                        if current_t > self.end_time:
                            url1 = ‘http://m.maoyan.com/mmdb/comments/movie/248172.json?_v_=yes&offset=0&startTime=‘ + current_t.replace(
                                ‘ ‘, ‘%20‘)
                            yield Request(url1, headers=self.header, callback=self.parse)
                        else:
                            print(‘全部有關信息已經搜索完畢‘)
                except Exception as error:
                    print(‘提取信息出錯1-----‘ + str(error))
            else:
                print(‘信息不全,已濾除‘)

  pipelines文件

import pandas as pd
class MaoyanPipeline(object):
    def process_item(self, item, spider):
        dict_info = {‘city‘: item[‘city‘], ‘content‘: item[‘content‘], ‘user_id‘: item[‘user_id‘],
                     ‘nick_name‘: item[‘nick_name‘],
                     ‘score‘: item[‘score‘], ‘time‘: item[‘time‘], ‘user_level‘: item[‘user_level‘]}
        try:
            data = pd.DataFrame(dict_info, index=[0])  # 為data創建一個表格形式 ,註意加index = [0]
            data.to_csv(‘G:\info.csv‘, header=False, index=True, mode=‘a‘,
                        encoding=‘utf_8_sig‘)  # 模式:追加,encoding = ‘utf-8-sig‘
        except Exception as error:
            print(‘寫入文件出錯-------->>>‘ + str(error))
        else:
            print(dict_info[‘content‘] + ‘---------->>>已經寫入文件‘)

  最後爬完的數據12M左右,65000條數據左右

三.數據可視化

1.主要代碼

用到的模塊:pandas數據處理,matplotlib繪圖,jieba分詞,wordcloud詞雲,地圖相關模塊(echarts-countries-pypkg,echarts-china-provinces-pypkg, echarts-china-cities-pypkg)

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import pandas as pd
from collections import Counter
from pyecharts import Geo, Bar, Scatter
import jieba
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
import time

#觀眾地域圖中部分註釋
#attr:標簽名稱(地點)
#value:數值
#visual_range:可視化範圍
#symbol_size:散點的大小
#visual_text_color:標簽顏色
#is_visualmap:是否映射(數量與顏色深淺是否掛鉤)
#maptype:地圖類型

#讀取csv文件(除了詞雲,其它圖表用的源數據)
def read_csv(filename, titles):
    comments = pd.read_csv(filename, names = titles, low_memory = False)

    return comments

#詞雲用的源數據(比較小)
def read_csv1(filename1, titles):
    comments = pd.read_csv(filename1, names = titles, low_memory = False)

    return comments

#全國觀眾地域分布
def draw_map(comments):
    attr = comments[‘city_name‘].fillna(‘zero_token‘) #以‘zero_token‘代替缺失數據
    data = Counter(attr).most_common(300) #Counter統計各個城市出現的次數,返回前300個出現頻率較高的城市
    # print(data)
    data.remove(data[data.index([(i,x) for i,x in data if i == ‘zero_token‘][0])]) #檢索城市‘zero_token‘並移除(‘zero_token‘, 578)
    geo =Geo(‘《復聯4》全國觀眾地域分布‘, ‘數據來源:Mr.W‘, title_color = ‘#fff‘, title_pos = ‘center‘, width = 1000, height = 600, background_color = ‘#404a59‘)
    attr, value = geo.cast(data) #data形式[(‘合肥‘,229),(‘大連‘,112)]
    geo.add(‘‘, attr, value, visual_range = [0, 4500], maptype = ‘china‘, visual_text_color = ‘#fff‘, symbol_size = 10, is_visualmap = True)
    geo.render(‘G:\\影評\\觀眾地域分布-地理坐標圖.html‘)
    print(‘全國觀眾地域分布已完成‘)


#觀眾地域排行榜單
def draw_bar(comments):
    data_top20 = Counter(comments[‘city_name‘]).most_common(20) #前二十名城市
    bar = Bar(‘《復聯4》觀眾地域排行榜單‘, ‘數據來源:Mr.W‘, title_pos = ‘center‘, width = 1200, height = 600)
    attr, value = bar.cast(data_top20)
    bar.add(‘‘, attr, value, is_visualmap = True, visual_range = [0, 4500], visual_text_color = ‘#fff‘, is_more_utils = True, is_label_show = True)
    bar.render(‘G:\\影評\\觀眾地域排行榜單-柱狀圖.html‘)
    print(‘觀眾地域排行榜單已完成‘)


#觀眾評論數量與日期的關系
#必須統一時間格式,不然時間排序還是亂的
def draw_data_bar(comments):
    time1 = comments[‘time‘]
    time_data = []
    for t in time1:
        if pd.isnull(t) == False and ‘time‘ not in t: #如果元素不為空
            date1 = t.replace(‘/‘, ‘-‘)
            date2 = date1.split(‘ ‘)[0]
            current_time_tuple = time.strptime(date2, ‘%Y-%m-%d‘) #把時間字符串轉化為時間類型
            date = time.strftime(‘%Y-%m-%d‘, current_time_tuple) #把時間類型數據轉化為字符串類型
            time_data.append(date)
    data = Counter(time_data).most_common() #data形式[(‘2019/2/10‘, 44094), (‘2019/2/9‘, 43680)]
    data = sorted(data, key = lambda data : data[0]) #data1變量相當於(‘2019/2/10‘, 44094)各個元組 itemgetter(0)
    bar =Bar(‘《復聯4》觀眾評論數量與日期的關系‘, ‘數據來源:Mr.W‘, title_pos = ‘center‘, width = 1200, height = 600)
    attr, value = bar.cast(data) #[‘2019/2/10‘, ‘2019/2/11‘, ‘2019/2/12‘][44094, 38238, 32805]
    bar.add(‘‘, attr, value, is_visualmap = True, visual_range = [0, 3500], visual_text_color = ‘#fff‘, is_more_utils = True, is_label_show = True)
    bar.render(‘G:\\影評\\觀眾評論日期-柱狀圖.html‘)
    print(‘觀眾評論數量與日期的關系已完成‘)


#觀眾評論數量與時間的關系
#這裏data中每個元組的第一個元素要轉化為整數型,不然排序還是亂的
def draw_time_bar(comments):
    time = comments[‘time‘]
    time_data = []
    real_data = []
    for t in time:
        if pd.isnull(t) == False and ‘:‘ in t:
            time = t.split(‘ ‘)[1]
            hour = time.split(‘:‘)[0]
            time_data.append(hour)
    data = Counter(time_data).most_common()
    for item in data:
        temp1 = list(item)
        temp2 = int(temp1[0])
        temp3 = (temp2,temp1[1])
        real_data.append(temp3)
    data = sorted(real_data, key = lambda x : x[0])
    bar = Bar(‘《復聯4》觀眾評論數量與時間的關系‘, ‘數據來源:Mr.W‘, title_pos = ‘center‘, width = 1200, height = 600)
    attr, value = bar.cast(data)
    bar.add(‘‘, attr, value, is_visualmap = True, visual_range = [0, 3500], visual_text_color = ‘#fff‘, is_more_utils = True, is_label_show = True)
    bar.render(‘G:\\影評\\觀眾評論時間-柱狀圖.html‘)
    print(‘觀眾評論數量與時間的關系已完成‘)


#詞雲,用一部分數據生成,不然數據量有些大,會報錯MemoryError(64bit的python版本不會)
def draw_word_cloud(comments):
    data = comments[‘comment‘]
    comment_data = []
    print(‘由於數據量比較大,分詞這裏有些慢,請耐心等待‘)
    for item in data:
        if pd.isnull(item) == False:
            comment_data.append(item)
    comment_after_split = jieba.cut(str(comment_data), cut_all = False)
    words = ‘ ‘.join(comment_after_split)
    stopwords = STOPWORDS.copy()
    stopwords.update({‘電影‘, ‘非常‘, ‘這個‘, ‘那個‘, ‘因為‘, ‘沒有‘, ‘所以‘, ‘如果‘, ‘演員‘, ‘這麽‘, ‘那麽‘, ‘最後‘, ‘就是‘, ‘不過‘, ‘這個‘, ‘一個‘, ‘感覺‘, ‘這部‘, ‘雖然‘, ‘不是‘, ‘真的‘, ‘覺得‘, ‘還是‘, ‘但是‘})
    wc = WordCloud(width = 800, height = 600, background_color = ‘#000000‘, font_path = ‘simfang‘, scale = 5, stopwords = stopwords, max_font_size = 200)
    wc.generate_from_text(words)
    plt.imshow(wc)
    plt.axis(‘off‘)
    plt.savefig(‘G:\\影評\\WordCloud.png‘)
    plt.show()

#觀眾評分排行榜單
def draw_score_bar(comments):
    score_list = []
    data_score = Counter(comments[‘score‘]).most_common()
    for item in data_score:
        if item[0] != ‘score‘:
            score_list.append(item)
    data = sorted(score_list, key = lambda x : x[0])
    bar = Bar(‘《復聯4》觀眾評分排行榜單‘, ‘數據來源:Mr.W‘, title_pos = ‘center‘, width = 1200, height = 600)
    attr, value = bar.cast(data)
    bar.add(‘‘, attr, value, is_visualmap = True, visual_range = [0, 4500], visual_text_color = ‘#fff‘, is_more_utils = True, is_label_show = True)
    bar.render(‘G:\\影評\\觀眾評分排行榜單-柱狀圖.html‘)
    print(‘觀眾評分排行榜單已完成‘)


#觀眾用戶等級排行榜單
def draw_user_level_bar(comments):
    level_list = []
    data_level = Counter(comments[‘user_level‘]).most_common()
    for item in data_level:
        if item[0] != ‘user_level‘:
            level_list.append(item)
    data = sorted(level_list, key = lambda x : x[0])
    bar = Bar(‘《復聯4》觀眾用戶等級排行榜單‘, ‘數據來源:Mr.W‘, title_pos = ‘center‘, width = 1200, height = 600)
    attr, value = bar.cast(data)
    # is_more_utils = True 提供更多的實用工具按鈕
    bar.add(‘‘, attr, value, is_visualmap = True, visual_range = [0, 4500], visual_text_color = ‘#fff‘, is_more_utils = True, is_label_show = True)
    bar.render(‘G:\\影評\\觀眾用戶等級排行榜單-柱狀圖.html‘)
    print(‘觀眾用戶等級排行榜單已完成‘)


if __name__ == ‘__main__‘:
    filename = ‘G:\\info.csv‘
    filename2 = ‘G:\\info.csv‘
    titles = [‘city_name‘,‘comment‘,‘user_id‘,‘nick_name‘,‘score‘,‘time‘,‘user_level‘]
    comments = read_csv(filename, titles)
    comments2 = read_csv1(filename2, titles)
    draw_map(comments)
    draw_bar(comments)
    draw_data_bar(comments)
    draw_time_bar(comments)
    draw_word_cloud(comments2)
    draw_score_bar(comments)
    draw_user_level_bar(comments)

 

 2.效果與分析

01.觀眾地域分布-地理坐標圖

技術分享圖片

由全國地域熱力圖可見,觀眾主要分布在中部,南部,東部以及東北部,各省會城市的觀眾尤其多(紅色代表觀眾最多),這與實際的經濟、文化、消費水平基本相符.(ps:復聯4的票價有點貴)

02.《復聯4》觀眾地域排行榜單

技術分享圖片

北上廣深等一線城市,觀眾粉絲多,消費水平可以。觀影數量非常多。

03.《復聯4》觀眾評分排行榜單

技術分享圖片

可以看到評分滿分的用戶幾乎超過總人數的70%,可見觀眾看完電影之後很滿足,也說明了電影的可看性很高

04.《復聯4》觀眾評論數量與日期的關系

技術分享圖片

24號上映到現在已經三天,其中觀影人數最多的是25號,可能大家覺得首映有點小貴吧,哈哈。

05.《復聯4》觀眾評論數量與時間的關系

技術分享圖片

從圖中可以看出,評論的數量主要集中在16-23點,因為這部電影時長為2小時,所以把評論時間往前移動2小時基本就是看電影時間。可以看出大家都是中午吃完飯(13點左右)和晚上吃完飯(19點左右)後再去看電影的,而且晚上看電影的人更多

06.《復聯4》觀眾用戶等級排行榜單

技術分享圖片

可見用戶等級為0,5,6的用戶基本沒有,而且隨著等級的提升,人數急劇變少。新用戶可能是以年輕人為主,對科幻電影感興趣,因而評論數量較多,而老用戶主要偏向於現實劇情類的電影,評論數量較少

07.《復聯4》詞雲圖

技術分享圖片

在詞雲圖中可以看到,“好看,可以,完美,精彩,情懷”等字眼,看來影片還是挺好看的。接著就是“鋼鐵俠,美隊,滅霸”看來這幾個人在影評中有重要的故事線。

Scrapy爬取貓眼《復仇者聯盟4終局之戰》影評