python爬蟲(爬遊民星空桌布)_圖片損壞問題
阿新 • • 發佈:2019-01-26
__author__ = 'AllenMinD' import requests,urllib from bs4 import BeautifulSoup ans = 1 for page in range(1,12): if page==1: url = 'http://www.gamersky.com/ent/201603/730123.shtml' else: url = 'http://www.gamersky.com/ent/201603/730123_'+str(page)+'.shtml' header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36' ,'Cookie':'BIDUPSID=5B700B9ED7BFDE99E48407F4C10FABAA; BAIDUID=05F28292EA8DA5A589737ACF26DD1B31:FG=1; PSTM=1456985091; BDUSS=1hczlEYmxKckJPbU9CRDE0R1hQcWtOOWJIQ2JQY1BRckQ2OW9kdWNnfmhTUjVYQVFBQUFBJCQAAAAAAAAAAAEAAABHG40~AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOG89lbhvPZWaU'} source_code = requests.get(url,headers = header) plain_text = source_code.text soup = BeautifulSoup(plain_text,'lxml') # print soup.select('p > a') download_link = [] for pic_tag in soup.select('p > a'): #print pic_tag.get('href')[52:] download_link.append(pic_tag.get('href')[52:])#*********重點********* folder_path = "D:/spider_things/2016.4.4/bizhi/" for item in download_link: urllib.urlretrieve(item , folder_path + str(ans) + '.jpg') print 'You have downloaded',ans,'picture(s)!~' ans = ans+1
前些天想做下游民星空桌布的爬去,但是一開始想平常那樣爬取時,最終下載下來的圖片會損壞:
於是,上網查詢問題所在,在一個網站中得到一點啟發(點選開啟連結):初步猜測出錯的原因是圖片的連結不對
然後我在了看原來爬取下載圖片的連結是:
http://www.gamersky.com/showimage/id_gamersky.shtml?http://img1.gamersky.com/image2016/03/20160319_hc_44_10/gamersky_005origin_009_201631919596C4.jp
g
但實際上,下載的原圖的連結是:
http://img1.gamersky.com/image2016/03/20160319_hc_44_10/gamersky_005origin_009_201631919596C4.jpg
終於!發現問題了,原來的連結中多了:“http://www.gamersky.com/showimage/id_gamersky.shtml?”
於是利用一個簡單的切片便獲取了正確的連結:
pic_tag.get('href')[52:]
問題終於解決了!~o(∩_∩)o 哈