python爬蟲(爬取蜂鳥網高畫素圖片)_空網頁,錯誤處理
阿新 • • 發佈:2019-02-04
__author__ = 'AllenMinD' import requests,urllib,os from bs4 import BeautifulSoup ans = 1 #counting for page in range(0,43): flag = 1 #web exists or not if page<10: url = 'http://bbs.fengniao.com/forum/pic/slide_101_8903443_8017670'+str(page)+'.html' else: url = 'http://bbs.fengniao.com/forum/pic/slide_101_8903443_801767'+str(page)+'.html' source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text,'lxml') file_name = '' download_link = [] for pic_tag in soup.find_all('a'): if pic_tag.get('href') == '/forum/8903443.html': file_name = pic_tag.get('title') if pic_tag.get('class') == ['pictureDownload']: if pic_tag.get('href') == '': #if this page is None flag = 0 break else: download_link.append(pic_tag.get('href')) if flag == 0 : #this page is None continue folder_path = 'D:/spider_things/2016.4.8/' + file_name + '/' if not os.path.exists(folder_path): os.makedirs(folder_path) for item in download_link: try: urllib.urlretrieve(item,folder_path + str(ans) + '.jpg') print 'you have downloaded' , ans , 'pic(s)' ans = ans + 1 except urllib.ContentTooShortError,e: #if the picture is too big , pass it continue
這次同樣是爬去蜂鳥網的圖片,但是中途遇到了2個新問題:
1. 空網頁:
蜂鳥網的有些圖片集的圖片連線不是連號的,這時候就要用一個if語句來跳過一些沒有圖片的連線
if pic_tag.get('href') == '': #if this page is None
flag = 0
break
.....
if flag == 0 : #this page is None
continue
2.錯誤處理
爬取這次圖片的時候發現,有些圖片太大了,超出了urllib.urlretrieve方法所規定的範圍,即出現報錯:urllib.ContentTooShortError
這時候,要利用try...except 來處理
try...except的格式是:
try:
......
except 錯誤型別(如urllib.ContentTooShortError),e:
......