到豆瓣爬取電影信息
阿新 • • 發佈:2018-04-30
wow64 mov self. use safari 代碼 app itl ike
初學puthon爬蟲,於是自己怕了豆瓣以電影信息,直接上源碼
import re import requests from bs4 import BeautifulSoup import urllib import os class movie: def __init__(self): self.url="https://movie.douban.com/subject/25933890/?tag=%E7%83%AD%E9%97%A8&from=gaia_video" self.head={ ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36‘, } def getpag(self): req=requests.get(self.url,self.head) html=req.content html=html.decode(‘utf-8‘) return html def gettit(self,page): title = r‘<span property="v:itemreviewed">(.+?)</span>‘ power = r‘<strong class="ll rating_num" property="v:average">(.+?)</strong>‘ tit = re.findall(title, page) powe = re.findall(power,page) tit = str(tit) print(tit, ‘\n‘) print("豆瓣評分:", powe, ‘\n‘) def getinfo(self,page): soup = BeautifulSoup(page, "lxml") infor = soup.find_all(‘div‘, ‘info‘) for info in infor:print(info.get_text()) def getping(self,page): soup = BeautifulSoup(page, "lxml") ping = soup.find_all(‘div‘, ‘comment‘) for pin in ping: pname=pin.fin pn=pname.find_all(‘a‘).d_all(‘span‘,class_=‘comment-info‘) for pnam in pname: for p in pn: print(p.get_text()) arg=pin.find_all(‘p‘) for ar in arg: print(ar.get_text()) def start(self): page=self.getpag() self.gettit(page) self.getinfo(page) self.getping(page) movie().start()
爬取成功
我利用的是BeautifulSoup設個庫,這個庫將可以將heml代碼進行按標簽進行分類整理,還可以讀取標簽屬性,詳情可以自己搜索,對於爬蟲來說非常強大
我的代碼理念理念是利用BeautifulSoup,利用for循環一層一層的往下搜索找到自己想要的數據
到豆瓣爬取電影信息