1. 程式人生 > >Scrapy爬取貓眼電影評論

Scrapy爬取貓眼電影評論

Scrapy爬取貓眼電影評論

文章目錄


目標: 地址

1、尋找評論介面

將瀏覽器模式從PC切換為手機


2、分析介面URL

第一個URL:http://m.maoyan.com/mmdb/comments/movie/1216446.json?_v_=yes&offset=0&startTime=0
第二個URL:http://m.maoyan.com/mmdb/comments/movie/1216446.json?_v_=yes&offset=15&startTime=2018-10-11%2015%3A19%3A05
第三個URL:http://m.maoyan.com/mmdb/comments/movie/1216446.json?_v_=yes&offset=30&startTime=2018-10-11%2015%3A19%3A05

介面URL規律

offset=0時,startTime也為0,之後就是offest每次增加15,startTime也變為固定時間。

我們來看第一條評論(非熱評)時間3分鐘前,startTime時間是2018-10-11 15:19:05,我電腦的時間是15:22:04所以,這個startTime時間就是最新一條評論時間。

構造URL介面

  • 1216446:表示電影id
  • offset:表示偏移量
  • startTime:最新一條評論的時間

我們獲取最新一條評論的時間,設為固定值,然後將offset每次便宜量增加15就成功構造該請求了。

分析JSON引數

  • cmts:普通評論,每次獲取15條,因為便宜量offset為15。
  • hcmts:熱門評論10條
  • total:評論總數
{
"approve": 3913,
"approved": false,
"assistAwardInfo": {
"avatar": "",
"celebrityId": 0,
"celebrityName": "",
"rank": 0,
"title": ""
},
"authInfo": "",
"avatarurl": "https://img.meituan.net/avatar/7e9e9348115c451276afffda986929b311657.jpg",
"cityName": "深圳",
"content": "腦洞很大,有創意,笑點十足又有淚點,十分感動,十分推薦。懷著看喜劇電影去看的,最後哭了個稀里譁。確實值得一看,很多場景讓我回憶青春,片尾的舊照片更是讓我想起了小時候。",
"filmView": false,
"gender": 1,
"id": 1035829945,
"isMajor": false,
"juryLevel": 0,
"majorType": 0,
"movieId": 1216446,
"nick": "lxz367738371",
"nickName": "發白的牛仔褲",
"oppose": 0,
"pro": false,
"reply": 94,
"score": 5,
"spoiler": 0,
"startTime": "2018-08-17 03:30:37",
"supportComment": true,
"supportLike": true,
"sureViewed": 0,
"tagList": {},
"time": "2018-08-17 03:30",
"userId": 1326662323,
"userLevel": 2,
"videoDuration": 0,
"vipInfo": "",
"vipType": 0
},
  • cityname:所在城市
  • content:評論內容
  • gender:性別
  • id:評論者的id
  • nickname:評論者暱稱
  • userlevel:評論者貓眼等級
  • score:評分(滿分5)
  • time:評論時間

3、Scrapy程式碼

spiders檔案

構造起始請求URL

class Movie1Spider(scrapy.Spider):
    name = 'movie1'
    allowed_domains = ['m.maoyan.com']
    base_url = 'http://m.maoyan.com/mmdb/comments/movie/{}.json?_v_=yes&offset={}&startTime={}'

    def start_requests(self):
        time_now = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        url = self.base_url.format(MOVIE_ID, 0, quote(time_now))
        yield Request(url=url)

JSON資料中獲取,引數資訊。
在進行爬取的時候當固定一個時間並不能一直爬取,當一個固定時間只能爬取到offset=1005,再往後面就沒資料了,當爬取到第一條評論的時候後,再往前爬取,會得到電影上映時間的評論。
所以程式碼終止的條件就是,當評論中的時候大於URL中的請求時間。

    def parse(self, response):
        last_time = re.search(r'startTime=(.*)', response.url).group(1)  # url中的時間
        response = json.loads(response.text)
        cmts = response.get('cmts')
        for cmt in cmts:
            global time
            maoyan_item = MaoyanItem()
            maoyan_item['id'] = cmt.get('id')
            maoyan_item['nickname'] = cmt.get('nickName')
            maoyan_item['gender'] = cmt.get('gender')
            maoyan_item['cityname'] = cmt.get('cityName')
            maoyan_item['content'] = cmt.get('content')
            maoyan_item['score'] = cmt.get('score')
            time = cmt.get('startTime')
            maoyan_item['time'] = time
            maoyan_item['userlevel'] = cmt.get('userLevel')
            if quote(time) > last_time:  # 當評論的時間大於url中的時間
                break
            yield maoyan_item
        if quote(time) < last_time:  # 最後一條評論小於url中的時間
            url = self.base_url.format(MOVIE_ID, 15, quote(time))   # 使用評論最後一條的時間
            yield Request(url=url, meta={'next_time': time})

Item檔案

class MaoyanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    table = 'movie'
    id = Field()  # ID
    nickname = Field()  # 名稱
    gender = Field()  # 性別
    cityname = Field()  # 城市名稱
    content = Field()  # 評論內容
    score = Field()  # 評分
    time = Field()  # 評論時間
    userlevel = Field()  # 評論者等級

piplines

儲存資訊到Mysql資料庫

class MaoyanPipeline(object):
    def __init__(self, host, databases, user, password, port):
        self.host = host
        self.databases = databases
        self.user = user
        self.password = password
        self.port = port

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            host=crawler.settings.get('MYSQL_HOST'),
            databases=crawler.settings.get('MYSQL_DATABASES'),
            user=crawler.settings.get('MYSQL_USER'),
            password=crawler.settings.get('MYSQL_PASSWORD'),
            port=crawler.settings.get('MYSQL_PORT'),
        )

    def open_spider(self, spider):
        try:
            self.db = pymysql.connect(self.host, self.user, self.password, self.databases, charset='utf8',
                                      port=self.port)
            self.db.ping()
        except:
            self.db = pymysql.connect(self.host, self.user, self.password, self.databases, charset='utf8',
                                      port=self.port)
        self.curosr = self.db.cursor()

    def process_item(self, item, spider):
        data = dict(item)
        keys = ','.join(data.keys())
        values = ','.join(['%s'] * len(data))
        sql = 'insert into %s (%s) values (%s)' % (item.table, keys, values)
        self.curosr.execute(sql, tuple(data.values()))
        self.db.commit()
        return item

    def close_spider(self, spider):
        self.db.close()

settings檔案

適當降低爬取的延遲,以及新增Headers,配置Mysql的資訊,開啟piplines

BOT_NAME = 'maoyan'
SPIDER_MODULES = ['maoyan.spiders']
NEWSPIDER_MODULE = 'maoyan.spiders'
DEFAULT_REQUEST_HEADERS = {
    'Referer': 'http://m.maoyan.com/movie/1216446/comments?_v_=yes',
    'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) '
                  'Version/11.0 Mobile/15A372 Safari/604.1'
}
ITEM_PIPELINES = {
    'maoyan.pipelines.MaoyanPipeline': 300,
}
MYSQL_HOST = ''
MYSQL_DATABASES = 'movie'
MYSQL_PORT =
MYSQL_USER = 'root'
MYSQL_PASSWORD = ''

DOWNLOAD_DELAY = 0.1  # 每次下載請求的延遲
MOVIE_ID = '1216446'  # 電影ID

4、爬取結果

5、Scrapy-Redis

由於評論過多,我們用分散式爬取的話會更快。

修改爬蟲檔案

  • 首先需要引入RedisSpiderfrom scrapy_redis.spiders import RedisSpider
  • 將父類繼承中的Spider修改為RedisSpider
  • 因為要從redis資料庫中爬取連結資訊,所以去掉start_urls,並新增redis_key
class Movie1Spider(RedisSpider):
    name = 'movie1'
    allowed_domains = ['m.maoyan.com']
    base_url = 'http://m.maoyan.com/mmdb/comments/movie/{}.json?_v_=yes&offset={}&startTime={}'
    redis_key = 'movie1:start_urls'

    # def start_requests(self):
    #     time_now = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    #     url = self.base_url.format(MOVIE_ID, 0, quote(time_now))
    #     yield Request(url=url)

修改setting檔案

  • 指定redis資料庫連結引數:REDIS_URL
  • 指定scrapy-redis的排程器:SCHEDULER
  • 指定scrapy-redis的去重:DUPEFILTER_CLASS
  • 設定斷點續傳,不清理redis queue:SCHEDULER_PERSIST
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
REDIS_URL = 'redis://:密碼@IP:6379'
SCHEDULER_PERSIST = True
MYSQL_HOST = '地址'
MYSQL_DATABASES = 'movie'
MYSQL_PORT = 62782
MYSQL_USER = 'root'
MYSQL_PASSWORD = ''
DOWNLOAD_DELAY = 0.1  # 每次下載請求的延遲
MOVIE_ID = '1216446'  # 電影ID

啟動程式後,我們連結Redis資料庫,進行單機測試是否可以。

127.0.0.1:6379> lpush dytt:start_urls http://m.maoyan.com/mmdb/comments/movie/1216446.json?_v_=yes&offset=0&startTime=2018-10-11%2018%3A14%3A17

進行分散式部署

使用Gerapy批量部署