Python爬取 豆瓣急先鋒 電影評論 ,龍叔的電影居然分這麼低
阿新 • • 發佈:2020-10-07
豆瓣電影首頁,亮麗的風景,居然這麼低的分,和同是國慶黨電影差距這麼大,唯一低分的還是國產木蘭,差不多低分的居然是女神 劉亦菲,從小看龍叔電影、劉亦菲劍仙的人,不服了,看看評論是則麼說的。
首先用python 爬取已經評論的500多條評論,剩餘200多條是沒看就評論的,不抓取了。
jianping = response.css('.main-bd a:nth-of-type(1)::text').extract() jianping1 = [] for jianping2 in jianping: r = re.findall(r'[^ \n]', jianping2) # 去除 特殊符號 r = ''.join(r) # 轉元組 jianping1.append(r) while '' in jianping1: jianping1.remove('') while '展開' in jianping1: jianping1.remove('展開')
然後用詞雲轉換下,看下評論都說了啥
listOfFileName.append('急先鋒') listOfFileName.append('急先鋒觀後感') listOfFileName.append('急先鋒影評') listOfFileName.append('影評') listOfFileName.append('觀後感') sw = listOfFileName
去掉渾水的評論
看完大家評論的,還是大部分覺得挺好看的,就是龍叔老了,沒啥打鬥場面了,來了很多小鮮肉。
然後我們看看為啥評分這麼低,
看到15570條,但是豆瓣牛畢,我們只能抓100條,根本無法瞭解什麼,我就抓取他的全部影評,
大部分都是5星好評,但是豆瓣會遮蔽掉,,
豆瓣給的解釋
不管,我們抓取下來分析下,
幾乎給的都是滿星,但是依舊阻擋不了4.9分的命運,明天抓國慶熱檔其他電影,究竟有多好看,評分這麼高。
最後原始碼:
import scrapy import re import csv class Xianfeng(scrapy.Spider): name= 'xianfeng' def start_requests(self): headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36' 'cook' } urls = ['https://movie.douban.com/subject/27195078/reviews?start=%d' % page for page in range(0, 720, 20)] for url in urls: yield scrapy.Request(url,headers=headers, callback=self.parse) def parse(self,response): mingzi = response.css('.review-list a:nth-of-type(2)::text').extract() mingzi1 = [] for mingzi2 in mingzi: d = re.findall(r'[^ \n]', mingzi2) # 去除 特殊符號 d = ''.join(d) # 轉元組 mingzi1.append(d) while '' in mingzi1: mingzi1.remove('') pingfen = response.css('.review-list span:nth-of-type(1)::attr(title)').extract() pingfen1= [] for pingfen2 in pingfen: pingfen1.append(pingfen2) shijian = response.css('.review-list span:nth-of-type(2)::text').extract() shijian1= [] for shijian2 in shijian: shijian1.append(shijian2) jianping = response.css('.main-bd a:nth-of-type(1)::text').extract() jianping1 = [] for jianping2 in jianping: r = re.findall(r'[^ \n]', jianping2) # 去除 特殊符號 r = ''.join(r) # 轉元組 jianping1.append(r) while '' in jianping1: jianping1.remove('') while '展開' in jianping1: jianping1.remove('展開') ''' duanping = response.xpath('//*[@id ="link-report"]/div/p//text()').extract() duanping1=[] for duanping2 in duanping: duanping1.append(duanping2) ''' print('---------write------------------') for i in range(len(pingfen1)): fileName = '評分.txt' f = open(fileName, "a+",encoding='utf-8') content = pingfen1[i] + '\n' f.write(content) f.close() for t in range(len(jianping1)): fileName1 = '簡評.txt' f1 = open(fileName1, "a+",encoding='utf-8') content1 = jianping1[t] + '\n' f1.write(content1) f1.close() # 3. 構建列表頭 for g in range(len(mingzi1)): with open('xianfeng.csv', 'a', encoding='utf-8_sig') as f: const= mingzi1[g]+ ',' + pingfen1[g] + ',' + shijian1[g] + ',' + jianping1[g] + '\n' f.write(const) f.close()