1. 程式人生 > >豆瓣電影搜尋爬蟲程式,爬蟲小程式,適合初學者

豆瓣電影搜尋爬蟲程式,爬蟲小程式,適合初學者

最近在學python,之前用Python寫過簡單的圖片爬取,今天想著用python爬一下豆瓣的電影,就有了下面的程式:

#coding:utf-8
import re
import sys
import urllib
from bs4 import BeautifulSoup
def movieSearch():
<span style="white-space:pre">	</span>douBanSearchurl = "http://movie.douban.com/subject_search?search_text="
<span style="white-space:pre">	</span>data = urllib.urlopen(douBanSearchurl+movieName).read()
<span style="white-space:pre">	</span>r = re.findall(r'<a class="nbg" href=(.*?) onclick',data)
<span style="white-space:pre">	</span>realy_url = re.sub('"','',r[0])
<span style="white-space:pre">	</span>movieData = urllib.urlopen(realy_url).read()
<span style="white-space:pre">	</span>soup = BeautifulSoup(movieData)
<span style="white-space:pre">	</span>movieSummary = soup.find_all("span",{'property':'v:summary'})
<span style="white-space:pre">	</span>#movieSummaryText = re.findall(r'<span property="v:summary" class="">(\W*.*\W*.*?)</span>',movieData)
<span style="white-space:pre">	</span>movie = re.findall(r'name="title" value="(.*?)"',movieData)
<span style="white-space:pre">	</span>people = re.findall(r'name="desc" value="(.*?)"',movieData)
<span style="white-space:pre">	</span>imdb = re.findall(r'</span> <a href="(.*?)" target=',movieData)
<span style="white-space:pre">	</span>Time = re.findall(r'<span property="v:runtime" content="109">(.*?)</span>',movieData)
<span style="white-space:pre">	</span>print u"IMDB電影網連結"
<span style="white-space:pre">	</span>print imdb[0]
<span style="white-space:pre">	</span>print u"豆瓣電影連結"
<span style="white-space:pre">	</span>print  realy_url
<span style="white-space:pre">	</span>print '*'*80
<span style="white-space:pre">	</span>print movie[0].decode('utf-8').encode('gbk')
<span style="white-space:pre">	</span>print people[0].decode('utf-8').encode('gbk')
<span style="white-space:pre">	</span>print u"電影簡介"
<span style="white-space:pre">	</span>print '*'*80
<span style="white-space:pre">	</span>print movieSummary[0].encode('gbk')
if __name__=='__main__':
<span style="white-space:pre">	</span>while(1):
<span style="white-space:pre">		</span>arg = raw_input("請選擇功能:\n1:電影搜尋\n2:退出\n".decode('utf-8').encode('gb2312'))
<span style="white-space:pre">		</span>if arg=='1' :
<span style="white-space:pre">			</span>movieName=raw_input("請輸入電影名:  ".decode('utf-8').encode('gb2312')).strip()
<span style="white-space:pre">			</span>print u"開始搜尋"
<span style="white-space:pre">			</span>movieSearch()
<span style="white-space:pre">		</span>else:
<span style="white-space:pre">			</span>print u"退出程式"
<span style="white-space:pre">			</span>break;<span style="white-space:pre">			</span>			


在除錯過程中,遇到了兩個頭疼的問題:

1、beautifulSoup編碼與CMD編碼不匹配,beautifulSoup得到的網頁資訊均為Unicode,但是cmd不支援Unicode只有gbk,中文顯示就成了很大問題了,所以就用了最笨的辦法,通過轉碼來實現,目前還沒找到更好的辦法,如果有請指點下。

2、在寫正則表示式時,不知道<br\>如何匹配,特別是對於網頁文字有換行分段的形式,該如何匹配?