python爬蟲 登陸豆瓣 爬豆瓣電影短評
阿新 • • 發佈:2019-01-25
這個爬蟲的目的是爬取豆瓣電影短評和評分(從1星到5星),這些東西可以做情感分類。由於不登入的情況下只能看電影短評的前幾頁,所以要實現登陸豆瓣。
登陸豆瓣的部分是在網上看的別人的程式碼,忘了從哪看的了。
# -*- coding: utf-8 -*- from bs4 import BeautifulSoup import re from time import sleep import random import urllib import requests url = "https://douban.com/accounts/login" formData = { "redir": "https://www.douban.com", "form_email": "**************", "form_password": "**************", "login": u'登入', 'source': 'None', } headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.87 Safari/537.36", "Referer": "https://douban.com/accounts/login", "Host": "accounts.douban.com", "Connection": "Keep-Alive", "Content-Type": "application/x-www-form-urlencoded" } s = requests.session() # r_ = s.post(url, data=formData, headers=headers) # a = r_.text # soup_ = BeautifulSoup(a, "html.parser") # captchaAddr = soup_.find('img', id='captcha_image')['src'] # reCaptchaID = r'<input type="hidden" name="captcha-id" value="(.*?)"/' # captchaID = re.findall(reCaptchaID, a) # urllib.urlretrieve(captchaAddr, "captcha.jpg") # # captcha = raw_input('please input the captcha:') # formData['captcha-solution'] = captcha # formData['captcha-id'] = captchaID r_ = s.post(url, data=formData, headers=headers) page_ = r_.text # print page_ # co = r_.cookies """---------------------------------------------------------------------------------""" number = 0 def process_h3(soup, fp): global number h3s = soup.findAll("h3") for i in h3s: aa = i.span.next_siblings bb = aa.next().next() number += 1 if number % 100 == 0: print number if len(bb) == 4: fp.write(bb[2].attrs["class"][0][-2:-1]) fp.write(" ") cc = i.next_siblings cc.next() dd = cc.next().get_text().strip() ee = dd.replace('\n', " ") fp.write(ee.encode("utf8")) # print ee.encode("utf8") fp.write('\n') def find_next(soup): line = soup.findAll("a", {"class", "next"}) if len(line) == 0: return None else: href = line[0].attrs["href"] return target + href """---------------------------------------------------------------------------------""" target = "https://movie.douban.com/subject/25944714/comments" """------------------------------------------------------------------------ params""" movie = s.get(target) # , cookies=co) page_movie = movie.text # print page_movie # print movie.status_code soupMovie = BeautifulSoup(page_movie) numb_ = soupMovie.findAll("ul", {"class": "fleft"}) print "total:", re.findall('(\d+)', numb_[0].text)[0] movieName = soupMovie.find("title").get_text()[:-3] print movieName with open(movieName + ".txt", 'w') as fp: process_h3(soupMovie, fp) while True: inter = random.gauss(9, 2) time = inter if inter > 2.1 else 2.1 sleep(time) next_ = find_next(soupMovie) if next_ is None: break try: soupMovie = BeautifulSoup(s.get(next_, timeout=10).text) process_h3(soupMovie, fp) except: sleep(100) try: soupMovie = BeautifulSoup(s.get(next_, timeout=10).text) process_h3(soupMovie, fp) except: break
登陸豆瓣的時候剛開始不用驗證碼,登陸次數多了再登就需要驗證碼了,當需要驗證碼時,把以下程式碼的註釋去掉,程式會將驗證碼圖片下載下來,再提示你輸入驗證碼,照圖片上的驗證碼輸入就可以了。
這個爬蟲每次執行會爬取一部電影的所有影評,需要指定該電影短評的首頁網址,如下面的程式碼所示。# r_ = s.post(url, data=formData, headers=headers) # a = r_.text # soup_ = BeautifulSoup(a, "html.parser") # captchaAddr = soup_.find('img', id='captcha_image')['src'] # reCaptchaID = r'<input type="hidden" name="captcha-id" value="(.*?)"/' # captchaID = re.findall(reCaptchaID, a) # urllib.urlretrieve(captchaAddr, "captcha.jpg") # captcha = raw_input('please input the captcha:') # formData['captcha-solution'] = captcha # formData['captcha-id'] = captchaID
"""---------------------------------------------------------------------------------"""
target = "https://movie.douban.com/subject/25944714/comments"
"""------------------------------------------------------------------------ params"""
爬蟲開始時會輸出短評總數以及電影名稱,每經過100條輸出一次提示。結果儲存在以電影名命名的txt檔案中。(注:該爬蟲忽略了只有評價內容而沒有評分的短評)。
從指定的第一頁開始,程式每爬一頁都會去找下一頁的連結,接著去爬。其實不一定要從第一頁開始,從任意頁開始都行。每爬完一頁,將等待一段時間,等待的時間長短服從均值為9,標準差為2的高斯分佈,這是為了使爬蟲行為更加沒有規律。
很多電影短評到最後幾頁無法訪問,所以程式在連續兩次無法訪問之後就會停止。
爬到的資料如下圖所示:
轉載請註明出處,如有問題,請評論。