1. 程式人生 > >python爬蟲(爬遊民星空桌布)_圖片損壞問題

python爬蟲(爬遊民星空桌布)_圖片損壞問題

__author__ = 'AllenMinD'
import requests,urllib
from bs4 import BeautifulSoup

ans = 1

for page in range(1,12):
    if page==1:
        url = 'http://www.gamersky.com/ent/201603/730123.shtml'
    else:
        url = 'http://www.gamersky.com/ent/201603/730123_'+str(page)+'.shtml'
    header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36'
              ,'Cookie':'BIDUPSID=5B700B9ED7BFDE99E48407F4C10FABAA; BAIDUID=05F28292EA8DA5A589737ACF26DD1B31:FG=1; PSTM=1456985091; BDUSS=1hczlEYmxKckJPbU9CRDE0R1hQcWtOOWJIQ2JQY1BRckQ2OW9kdWNnfmhTUjVYQVFBQUFBJCQAAAAAAAAAAAEAAABHG40~AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOG89lbhvPZWaU'}
    source_code = requests.get(url,headers = header)
    plain_text = source_code.text

    soup = BeautifulSoup(plain_text,'lxml')

#    print soup.select('p > a')

    download_link = []
    for pic_tag in soup.select('p > a'):
        #print pic_tag.get('href')[52:]
        download_link.append(pic_tag.get('href')[52:])#*********重點*********

    folder_path = "D:/spider_things/2016.4.4/bizhi/"

    for item in download_link:
        urllib.urlretrieve(item , folder_path + str(ans) + '.jpg')
        print 'You have downloaded',ans,'picture(s)!~'
        ans = ans+1

前些天想做下游民星空桌布的爬去,但是一開始想平常那樣爬取時,最終下載下來的圖片會損壞:


於是,上網查詢問題所在,在一個網站中得到一點啟發(點選開啟連結):初步猜測出錯的原因是圖片的連結不對

然後我在了看原來爬取下載圖片的連結是:

http://www.gamersky.com/showimage/id_gamersky.shtml?http://img1.gamersky.com/image2016/03/20160319_hc_44_10/gamersky_005origin_009_201631919596C4.jp

g

但實際上,下載的原圖的連結是:

http://img1.gamersky.com/image2016/03/20160319_hc_44_10/gamersky_005origin_009_201631919596C4.jpg


終於!發現問題了,原來的連結中了:“http://www.gamersky.com/showimage/id_gamersky.shtml?

於是利用一個簡單的切片便獲取了正確的連結:

pic_tag.get('href')[52:]

問題終於解決了!~o(∩_∩)o 哈