Scrapy爬取貓眼電影評論
Scrapy爬取貓眼電影評論
文章目錄
目標: 地址
1、尋找評論介面
將瀏覽器模式從PC切換為手機
2、分析介面URL
第一個URL:http://m.maoyan.com/mmdb/comments/movie/1216446.json?_v_=yes&offset=0&startTime=0
第二個URL:http://m.maoyan.com/mmdb/comments/movie/1216446.json?_v_=yes&offset=15&startTime=2018-10-11%2015%3A19%3A05
第三個URL:http://m.maoyan.com/mmdb/comments/movie/1216446.json?_v_=yes&offset=30&startTime=2018-10-11%2015%3A19%3A05
介面URL規律
在offset
=0時,startTime
也為0,之後就是offest
每次增加15,startTime
也變為固定時間。
我們來看第一條評論(非熱評)時間3分鐘前,startTime
時間是2018-10-11 15:19:05
,我電腦的時間是15:22:04
所以,這個startTime
時間就是最新一條評論時間。
構造URL介面
- 1216446:表示電影id
- offset:表示偏移量
- startTime:最新一條評論的時間
我們獲取最新一條評論的時間,設為固定值,然後將offset
每次便宜量增加15就成功構造該請求了。
分析JSON引數
- cmts:普通評論,每次獲取15條,因為便宜量
offset
為15。 - hcmts:熱門評論10條
- total:評論總數
{
"approve": 3913,
"approved": false,
"assistAwardInfo": {
"avatar": "",
"celebrityId": 0,
"celebrityName": "",
"rank": 0,
"title": ""
},
"authInfo": "",
"avatarurl": "https://img.meituan.net/avatar/7e9e9348115c451276afffda986929b311657.jpg",
"cityName": "深圳",
"content": "腦洞很大,有創意,笑點十足又有淚點,十分感動,十分推薦。懷著看喜劇電影去看的,最後哭了個稀里譁。確實值得一看,很多場景讓我回憶青春,片尾的舊照片更是讓我想起了小時候。",
"filmView": false,
"gender": 1,
"id": 1035829945,
"isMajor": false,
"juryLevel": 0,
"majorType": 0,
"movieId": 1216446,
"nick": "lxz367738371",
"nickName": "發白的牛仔褲",
"oppose": 0,
"pro": false,
"reply": 94,
"score": 5,
"spoiler": 0,
"startTime": "2018-08-17 03:30:37",
"supportComment": true,
"supportLike": true,
"sureViewed": 0,
"tagList": {},
"time": "2018-08-17 03:30",
"userId": 1326662323,
"userLevel": 2,
"videoDuration": 0,
"vipInfo": "",
"vipType": 0
},
- cityname:所在城市
- content:評論內容
- gender:性別
- id:評論者的id
- nickname:評論者暱稱
- userlevel:評論者貓眼等級
- score:評分(滿分5)
- time:評論時間
3、Scrapy程式碼
spiders檔案
構造起始請求URL
class Movie1Spider(scrapy.Spider):
name = 'movie1'
allowed_domains = ['m.maoyan.com']
base_url = 'http://m.maoyan.com/mmdb/comments/movie/{}.json?_v_=yes&offset={}&startTime={}'
def start_requests(self):
time_now = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
url = self.base_url.format(MOVIE_ID, 0, quote(time_now))
yield Request(url=url)
從JSON
資料中獲取,引數資訊。
在進行爬取的時候當固定一個時間並不能一直爬取,當一個固定時間只能爬取到offset=1005
,再往後面就沒資料了,當爬取到第一條評論的時候後,再往前爬取,會得到電影上映時間的評論。
所以程式碼終止的條件就是,當評論中的時候大於URL中的請求時間。
def parse(self, response):
last_time = re.search(r'startTime=(.*)', response.url).group(1) # url中的時間
response = json.loads(response.text)
cmts = response.get('cmts')
for cmt in cmts:
global time
maoyan_item = MaoyanItem()
maoyan_item['id'] = cmt.get('id')
maoyan_item['nickname'] = cmt.get('nickName')
maoyan_item['gender'] = cmt.get('gender')
maoyan_item['cityname'] = cmt.get('cityName')
maoyan_item['content'] = cmt.get('content')
maoyan_item['score'] = cmt.get('score')
time = cmt.get('startTime')
maoyan_item['time'] = time
maoyan_item['userlevel'] = cmt.get('userLevel')
if quote(time) > last_time: # 當評論的時間大於url中的時間
break
yield maoyan_item
if quote(time) < last_time: # 最後一條評論小於url中的時間
url = self.base_url.format(MOVIE_ID, 15, quote(time)) # 使用評論最後一條的時間
yield Request(url=url, meta={'next_time': time})
Item檔案
class MaoyanItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
table = 'movie'
id = Field() # ID
nickname = Field() # 名稱
gender = Field() # 性別
cityname = Field() # 城市名稱
content = Field() # 評論內容
score = Field() # 評分
time = Field() # 評論時間
userlevel = Field() # 評論者等級
piplines
儲存資訊到Mysql資料庫
class MaoyanPipeline(object):
def __init__(self, host, databases, user, password, port):
self.host = host
self.databases = databases
self.user = user
self.password = password
self.port = port
@classmethod
def from_crawler(cls, crawler):
return cls(
host=crawler.settings.get('MYSQL_HOST'),
databases=crawler.settings.get('MYSQL_DATABASES'),
user=crawler.settings.get('MYSQL_USER'),
password=crawler.settings.get('MYSQL_PASSWORD'),
port=crawler.settings.get('MYSQL_PORT'),
)
def open_spider(self, spider):
try:
self.db = pymysql.connect(self.host, self.user, self.password, self.databases, charset='utf8',
port=self.port)
self.db.ping()
except:
self.db = pymysql.connect(self.host, self.user, self.password, self.databases, charset='utf8',
port=self.port)
self.curosr = self.db.cursor()
def process_item(self, item, spider):
data = dict(item)
keys = ','.join(data.keys())
values = ','.join(['%s'] * len(data))
sql = 'insert into %s (%s) values (%s)' % (item.table, keys, values)
self.curosr.execute(sql, tuple(data.values()))
self.db.commit()
return item
def close_spider(self, spider):
self.db.close()
settings檔案
適當降低爬取的延遲,以及新增Headers
,配置Mysql
的資訊,開啟piplines
。
BOT_NAME = 'maoyan'
SPIDER_MODULES = ['maoyan.spiders']
NEWSPIDER_MODULE = 'maoyan.spiders'
DEFAULT_REQUEST_HEADERS = {
'Referer': 'http://m.maoyan.com/movie/1216446/comments?_v_=yes',
'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) '
'Version/11.0 Mobile/15A372 Safari/604.1'
}
ITEM_PIPELINES = {
'maoyan.pipelines.MaoyanPipeline': 300,
}
MYSQL_HOST = ''
MYSQL_DATABASES = 'movie'
MYSQL_PORT =
MYSQL_USER = 'root'
MYSQL_PASSWORD = ''
DOWNLOAD_DELAY = 0.1 # 每次下載請求的延遲
MOVIE_ID = '1216446' # 電影ID
4、爬取結果
5、Scrapy-Redis
由於評論過多,我們用分散式爬取的話會更快。
修改爬蟲檔案
- 首先需要引入RedisSpider
from scrapy_redis.spiders import RedisSpider
- 將父類繼承中的
Spider
修改為RedisSpider
- 因為要從
redis
資料庫中爬取連結資訊,所以去掉start_urls
,並新增redis_key
。
class Movie1Spider(RedisSpider):
name = 'movie1'
allowed_domains = ['m.maoyan.com']
base_url = 'http://m.maoyan.com/mmdb/comments/movie/{}.json?_v_=yes&offset={}&startTime={}'
redis_key = 'movie1:start_urls'
# def start_requests(self):
# time_now = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
# url = self.base_url.format(MOVIE_ID, 0, quote(time_now))
# yield Request(url=url)
修改setting檔案
- 指定
redis
資料庫連結引數:REDIS_URL - 指定
scrapy-redis
的排程器:SCHEDULER - 指定
scrapy-redis
的去重:DUPEFILTER_CLASS - 設定斷點續傳,不清理
redis queue
:SCHEDULER_PERSIST
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
REDIS_URL = 'redis://:密碼@IP:6379'
SCHEDULER_PERSIST = True
MYSQL_HOST = '地址'
MYSQL_DATABASES = 'movie'
MYSQL_PORT = 62782
MYSQL_USER = 'root'
MYSQL_PASSWORD = ''
DOWNLOAD_DELAY = 0.1 # 每次下載請求的延遲
MOVIE_ID = '1216446' # 電影ID
啟動程式後,我們連結Redis
資料庫,進行單機測試是否可以。
127.0.0.1:6379> lpush dytt:start_urls http://m.maoyan.com/mmdb/comments/movie/1216446.json?_v_=yes&offset=0&startTime=2018-10-11%2018%3A14%3A17
進行分散式部署
使用Gerapy批量部署