豆瓣:豆列爬取心得
阿新 • • 發佈:2020-08-07
最近把豆瓣2020版電影日曆上的資料整理到了豆瓣的豆列裡,但豆列裡面沒法做更方便的篩選和查詢,於是乎就想著把資料爬取下來自己篩選一下,便有了這篇筆記,程式碼實現是python3。
程式碼實現
爬蟲程式碼實現非常簡單,用到的是python3的requests_html,開啟豆列頁面分析一下頁面的Html結構,會發現每部電影的資訊是包含在一個div[@class='bd doulist-subject'] 裡面的。我們只需要把對應的電影名,評分,型別,主演,導演,年份等取出來就行。
這裡比較麻煩的是“導演,主演,型別,上映地區”是在一個div[@class='abstract'] 裡的,只是資料換了一下行。所以這裡需要將資料去掉換行後分別擷取。
然後資料爬取下來後我是選擇了插入到MYSQL中,實現也非常簡單,這裡就不具體描述了。以下是完整程式碼:
MYSQL:
CREATE TABLE movie(
mid int PRIMARY KEY AUTO_INCREMENT,
title varchar(200),
rate decimal(5,1),
type varchar(200),
director varchar(200),
starring varchar(200),
state varchar(200),
created int
) AUTO_INCREMENT = 1
Python:
import json from requests_html import HTMLSession import math import pyodbc import re import pymysql class MSSQL: def __init__(self): self.server = 'mssqlserver' self.database = 'douban' self.username = 'admin' self.password = 'password' self.driver= '{ODBC Driver 13 for SQL Server}' def connect(self): connection = pyodbc.connect('DRIVER='+self.driver+';SERVER='+self.server+';PORT=1433;DATABASE='+self.database+';UID='+self.username+';PWD='+ self.password) #cursor = connection.cursor() return connection def execquery(self,sqltext): connection = self.connect() cursor = connection.cursor() cursordata = cursor.execute(sqltext) return cursordata def execscalar(self,sqltext): connection = self.connect() cursor = connection.cursor() cursor.execute(sqltext) connection.commit() def insert_douban_movie(self,title,rate,director,starring,movietype,countrystate,releasetime): sqltext = "insert into douban_movie(title,rate,director,starring,movietype,countrystate,releasetime) values(?,?,?,?,?,?,?)" connection = self.connect() cursor = connection.cursor() cursor.execute(sqltext,title,rate,director,starring,movietype,countrystate,releasetime) connection.commit() class MYSQL: def __init__(self): self.server = 'mysqlinstance' self.database = 'douban' self.username = 'username' self.password = 'password' def connect(self): conn = pymysql.connect(self.server,self.username,self.password,self.database) return conn def insert_movie(self,title,rate,type,director,starring,state,created): try: conn = self.connect() cursor = conn.cursor() cursor.execute("INSERT INTO movie(title,rate,type,director,starring,state,created) VALUES(%s,%s,%s,%s,%s,%s,%s);",(title,rate,type,director,starring,state,created)) conn.commit() except Exception as e: conn.rollback() print("insert error:{error}".format(error=e)) finally: cursor.close() conn.close() def test(self): db = self.connect() cursor = db.cursor() cursor.execute("SELECT VERSION()") data = cursor.fetchone() print ("Database version : %s " % data) db.close() class HtmlCrawler: def __init__(self): self.session = HTMLSession() def get_doulist(self,doulist_url): r = self.session.get(doulist_url) page_size = 25 total_number =int(r.html.xpath("//div[@class='doulist-filter']/a/span")[0].text.replace('(','').replace(')','')) total_page = math.ceil(total_number/page_size) for i in range(0,total_page): doulist_url2 = doulist_url+'/?start='+str(i*page_size) self.get_movies(doulist_url2) def get_movies(self,doulist_url): r = self.session.get(doulist_url) movies_tilte = r.html.xpath("//div[@class='bd doulist-subject']//div[@class='title']/a") movies_rate = r.html.xpath("//div[@class='bd doulist-subject']//div[@class='rating']/span[@class='rating_nums']") movies_abstact = r.html.xpath("//div[@class='bd doulist-subject']//div[@class='abstract']") for i in range(0,len(movies_tilte)): regstr = movies_abstact[i].text.strip().replace('\n','') re1 = r'導演:(.*?)主演' re2 = r'主演:(.*?)型別' re3 = r'型別:(.*?)製片國家/地區' re4 = r'製片國家/地區:(.*?)年份' director = re.findall(re1,regstr)[0].strip() starring = re.findall(re2,regstr)[0].strip() movietype = re.findall(re3,regstr)[0].strip() state = re.findall(re4,regstr)[0].strip() created =int(regstr.split('年份:')[-1]) MYSQL().insert_movie(movies_tilte[i].text,float(movies_rate[i].text),movietype,director,starring,state,created) if __name__ == "__main__": url = 'https://www.douban.com/doulist/122330446' HtmlCrawler().get_doulist(url)
統計分析
本來想隨便寫幾個查詢語句就算了的,發現還是不符合我偷懶的作風,索性直接把資料放到powerbi上做個簡單的查詢頁面,做好後示意圖如下:
上方的查詢框是Text Filter,本來還準備加篩選框(Hierarchy Slicer)的,想想還要把源資料中的主演,導演,地區,型別等擷取後去重就嫌麻煩沒做了。
吐槽
小米的Mix3 掃碼也太拉垮了,我用豆瓣掃碼錄入電影資訊,掃個10來次相機就宕機了,必須手機重啟才行。另外,iPhone 7,華為 mate 30我均試過沒有這毛病,說明並不是豆瓣的問題。