Python編程快速上手（七）Unsplash批量下載器

阿新 • • 發佈：2019-02-03

ica div 解決 cep ear loading wcf utf-8 exce

首發於個人博客：http://rhinoc.top/post/python_7.html

程序描述

V1.0
輸入關鍵字搜索圖片，模擬頁面下拉獲取更多圖片，頁面加載完成後獲取圖片鏈接並下載至指定文件夾。

V2.0
解決了V1.0版本存在的兩個嚴重問題。更改為深入二級鏈接獲取圖片，從而不再需要等待一級頁面加載完畢。

涉及知識點

selenium
bs4
iter_content()
javascript

V1.0存在的問題 & V2.0的解決方法

獲取圖片鏈接需要預先對頁面進行加載，下載圖片時相當於進行了二次加載，造成了不必要的網絡和時間占用。

即便給予充足的加載時間，頁面也不能完全加載，導致捕獲到的圖片有一千余張而實際有效的圖片不足十張。

對於一個圖片爬蟲來說，連圖片都爬不下來得是件多羞恥的事情。所以在面對V1.0版本的諸多不足時，我的當務之急是解決第二個問題，其次再是對程序進行優化。幸運的是，就像問題總是相伴而生一樣，解決了第二個問題時，我恰巧也把第一個問題解決了。

首先，疑惑於明明加載了那麽久頁面卻會「獲取不到圖片鏈接」，我將加載出來的頁面保存下來查看。

page = open(os.path.join(‘pics‘, ‘saved.html‘),‘wb‘)
page.write(soup.encode(‘utf-8‘))

技術分享圖片打開saved.html後，我發現雖然圖片鏈接未能加載出來，但是圖片所指向的二級鏈接卻赫然在目。

打開二級鏈接，不出意外，就是我求之而不得的圖片。
自然，我就想到，如果在當前這個一級頁面不能得到圖片鏈接，是不是可以通過進入二級鏈接獲取圖片，而且由於二級鏈接所指向的頁面只有一張圖片，圖片加載的成功率必然也很高。

如此這般，problems solved ????。

第一版代碼

#! python3

import os, bs4, requests, re, time
from selenium.webdriver.chrome.options import Options
from selenium import webdriver

def scroll_down(driver, times):
    print(‘Loading‘,end=‘‘)
    for i in range(times):
        print(‘.‘,end=‘‘)
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(60)
    print(‘‘)

os.makedirs(‘pics‘,exist_ok=True)

keyword = input()
url = ‘https://unsplash.com/search/photos/‘ + keyword

chrome_options = Options()
chrome_options.add_argument(‘--headless‘)
chrome_options.add_argument(‘--disable-gpu‘)
print(‘Setting up the driver‘)
driver = webdriver.Chrome(options=chrome_options, executable_path=‘/usr/local/bin/chromedriver‘)
driver.get(url)
print(‘Driver prepared‘)

scroll_down(driver,3)
soup = bs4.BeautifulSoup(driver.page_source, ‘lxml‘)
page = open(os.path.join(‘pics‘, ‘saved.html‘),‘wb‘)
page.write(soup.encode(‘utf-8‘))
link = soup.find_all(‘img‘,{‘class‘:‘_2zEKz‘})
amount = str(soup.find(‘p‘,{‘class‘:‘_1u88E _1iWCF _27Bp2‘}))
rr = re.compile(r‘.*>(.*?)free.*‘)
amount = rr.findall(amount)[0].strip()
amount = int(amount.replace(‘,‘,‘‘))
print(‘Total: ‘+ str(amount))
print(‘Captured: ‘+ str(len(link)))

count = 0
for piece in link:
    #
    picLk = str(piece.get(‘src‘))
    rr = re.compile(r‘(.* ?)\?‘)
    filename = str(piece.get(‘alt‘)) + ‘.png‘
    if picLk != ‘None‘:
        count += 1
        picLk = rr.findall(picLk)[0]
        print(‘Download from ‘ + picLk + ‘: ‘ + filename )
        picRes = requests.get(picLk)
        if filename == ‘None.png‘:
            imageFile = open(os.path.join(‘pics‘, os.path.basename(picLk)+‘.png‘),‘wb‘)
        else:
            imageFile = open(os.path.join(‘pics‘, filename),‘wb‘)
        for chunk in picRes.iter_content(100000):
             imageFile.write(chunk)
        imageFile.close()

print(‘Valid:‘ + str(count))

第二版代碼

#! python3

import os, bs4, requests, re, time
from selenium.webdriver.chrome.options import Options
from selenium import webdriver

def scroll_down(driver, times):
    print(‘Loading‘,end=‘‘)
    for i in range(times):
        print(‘.‘,end=‘‘)
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(3)
    print(‘‘)

os.makedirs(‘pics‘,exist_ok=True)

keyword = input()
url = ‘https://unsplash.com/search/photos/‘ + keyword

chrome_options = Options()
chrome_options.add_argument(‘--headless‘)
chrome_options.add_argument(‘--disable-gpu‘)
print(‘Setting up the driver‘)
driver = webdriver.Chrome(options=chrome_options, executable_path=‘/usr/local/bin/chromedriver‘)
driver.get(url)
print(‘Driver prepared‘)

scroll_down(driver,3)
soup = bs4.BeautifulSoup(driver.page_source, ‘lxml‘)
link = soup.find_all(‘a‘,{‘class‘:‘_2Mc8_‘})
amount = str(soup.find(‘p‘,{‘class‘:‘_1u88E _1iWCF _27Bp2‘}))
rr = re.compile(r‘.*>(.*?)free.*‘)
amount = rr.findall(amount)[0].strip()
amount = int(amount.replace(‘,‘,‘‘))
print(‘Total: ‘+ str(amount))
print(‘Captured: ‘+ str(len(link)))

count = 0
for piece in link:
    deepLk = ‘https://unsplash.com‘ + str(piece.get(‘href‘))
    headers = {
        ‘Accept‘: ‘application/json, text/javascript, */*; q=0.01‘,
        ‘Origin‘: ‘http://www.zimuzu.tv‘,
        ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36‘,
        ‘Content-Type‘: ‘application/x-www-form-urlencoded‘,
    }
    deepRes = requests.get((deepLk), headers=headers)
    deepSoup = bs4.BeautifulSoup(deepRes.text, "html.parser")
    pic = deepSoup.find(‘img‘,{‘class‘:‘_2zEKz‘})
    try:
        picLk = str(pic.get(‘src‘))
    except:
        print(‘download failed‘)
    rr = re.compile(r‘(.* ?)\?‘)
    filename = str(pic.get(‘alt‘)) + ‘.png‘
    if picLk != ‘None‘:
        count += 1
        picLk = rr.findall(picLk)[0]
        print(‘Download from ‘ + deepLk + ‘: ‘ + filename )
        picRes = requests.get(picLk)
        if filename == ‘None.png‘:
            imageFile = open(os.path.join(‘pics‘, os.path.basename(picLk)+‘.png‘),‘wb‘)
        else:
            imageFile = open(os.path.join(‘pics‘, filename),‘wb‘)
        for chunk in picRes.iter_content(100000):
                imageFile.write(chunk)
        imageFile.close()

Python編程快速上手（七）Unsplash批量下載器

ica div 解決 cep ear loading wcf utf-8 exce 首發於個人博客：http://rhinoc.top/post/python_7.html 程序描述 V1.0輸入關鍵字搜索圖片，模擬頁面下拉獲取更多圖片，

Python編程快速上手（七）Unsplash批量下載器

程序描述

涉及知識點

V1.0存在的問題 & V2.0的解決方法

第一版代碼

第二版代碼

Python編程快速上手（七）Unsplash批量下載器

python編程快速上手之第10章實踐項目參考答案(11.11.2)

python編程快速上手之第10章實踐項目參考答案(11.11.3)

python編程快速上手之第10章實踐項目參考答案(12.13.1)

python編程快速上手之第13章實踐項目參考答案(13.6.1)

python編程快速上手之第13章實踐項目參考答案(13.6.2)

python編程快速上手之第15章實踐項目參考答案(17.7.2)

python編程快速上手之第15章實踐項目參考答案(17.7.3)

python編程快速上手第六章實踐項目參考code

《Python編程快速上手讓繁瑣工作自動化》pdf

《Python編程快速上手》6.7實踐項目代碼

《Python編程快速上手》第7.18.1實踐練習

《Python編程快速上手》8.9.1實踐練習

《Python編程快速上手》第8.9.2實踐練習

《Python編程快速上手》第8.9.3實踐練習

《Python編程快速上手》第9.8.3實踐練習

Python編程快速上手-讓繁瑣工作自動化第三章函數習題及其答案

Python編程快速上手-讓繁瑣工作自動化第四章列表練習題及其答案

Python編程快速上手-列表

《python編程快速上手》

Python編程快速上手（七）Unsplash批量下載器

程序描述

涉及知識點

V1.0存在的問題 & V2.0的解決方法

第一版代碼

第二版代碼

相關推薦