Python編程快速上手(七)Unsplash批量下載器
阿新 • • 發佈:2019-02-03
ica div 解決 cep ear loading wcf utf-8 exce
首發於個人博客:http://rhinoc.top/post/python_7.html
程序描述
V1.0
輸入關鍵字搜索圖片,模擬頁面下拉獲取更多圖片,頁面加載完成後獲取圖片鏈接並下載至指定文件夾。
V2.0
解決了V1.0版本存在的兩個嚴重問題。更改為深入二級鏈接獲取圖片,從而不再需要等待一級頁面加載完畢。
涉及知識點
- selenium
- bs4
- iter_content()
- javascript
V1.0存在的問題 & V2.0的解決方法
- 獲取圖片鏈接需要預先對頁面進行加載,下載圖片時相當於進行了二次加載,造成了不必要的網絡和時間占用。
- 即便給予充足的加載時間,頁面也不能完全加載,導致捕獲到的圖片有一千余張而實際有效的圖片不足十張。
對於一個圖片爬蟲來說,連圖片都爬不下來得是件多羞恥的事情。所以在面對V1.0版本的諸多不足時,我的當務之急是解決第二個問題,其次再是對程序進行優化。幸運的是,就像問題總是相伴而生一樣,解決了第二個問題時,我恰巧也把第一個問題解決了。
首先,疑惑於明明加載了那麽久頁面卻會「獲取不到圖片鏈接」,我將加載出來的頁面保存下來查看。
page = open(os.path.join(‘pics‘, ‘saved.html‘),‘wb‘) page.write(soup.encode(‘utf-8‘))
打開saved.html
後,我發現雖然圖片鏈接未能加載出來,但是圖片所指向的二級鏈接卻赫然在目。
打開二級鏈接,不出意外,就是我求之而不得的圖片。
自然,我就想到,如果在當前這個一級頁面不能得到圖片鏈接,是不是可以通過進入二級鏈接獲取圖片,而且由於二級鏈接所指向的頁面只有一張圖片,圖片加載的成功率必然也很高。
如此這般,problems solved ????。
第一版代碼
#! python3 import os, bs4, requests, re, time from selenium.webdriver.chrome.options import Options from selenium import webdriver def scroll_down(driver, times): print(‘Loading‘,end=‘‘) for i in range(times): print(‘.‘,end=‘‘) driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") time.sleep(60) print(‘‘) os.makedirs(‘pics‘,exist_ok=True) keyword = input() url = ‘https://unsplash.com/search/photos/‘ + keyword chrome_options = Options() chrome_options.add_argument(‘--headless‘) chrome_options.add_argument(‘--disable-gpu‘) print(‘Setting up the driver‘) driver = webdriver.Chrome(options=chrome_options, executable_path=‘/usr/local/bin/chromedriver‘) driver.get(url) print(‘Driver prepared‘) scroll_down(driver,3) soup = bs4.BeautifulSoup(driver.page_source, ‘lxml‘) page = open(os.path.join(‘pics‘, ‘saved.html‘),‘wb‘) page.write(soup.encode(‘utf-8‘)) link = soup.find_all(‘img‘,{‘class‘:‘_2zEKz‘}) amount = str(soup.find(‘p‘,{‘class‘:‘_1u88E _1iWCF _27Bp2‘})) rr = re.compile(r‘.*>(.*?)free.*‘) amount = rr.findall(amount)[0].strip() amount = int(amount.replace(‘,‘,‘‘)) print(‘Total: ‘+ str(amount)) print(‘Captured: ‘+ str(len(link))) count = 0 for piece in link: # picLk = str(piece.get(‘src‘)) rr = re.compile(r‘(.* ?)\?‘) filename = str(piece.get(‘alt‘)) + ‘.png‘ if picLk != ‘None‘: count += 1 picLk = rr.findall(picLk)[0] print(‘Download from ‘ + picLk + ‘: ‘ + filename ) picRes = requests.get(picLk) if filename == ‘None.png‘: imageFile = open(os.path.join(‘pics‘, os.path.basename(picLk)+‘.png‘),‘wb‘) else: imageFile = open(os.path.join(‘pics‘, filename),‘wb‘) for chunk in picRes.iter_content(100000): imageFile.write(chunk) imageFile.close() print(‘Valid:‘ + str(count))
第二版代碼
#! python3 import os, bs4, requests, re, time from selenium.webdriver.chrome.options import Options from selenium import webdriver def scroll_down(driver, times): print(‘Loading‘,end=‘‘) for i in range(times): print(‘.‘,end=‘‘) driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") time.sleep(3) print(‘‘) os.makedirs(‘pics‘,exist_ok=True) keyword = input() url = ‘https://unsplash.com/search/photos/‘ + keyword chrome_options = Options() chrome_options.add_argument(‘--headless‘) chrome_options.add_argument(‘--disable-gpu‘) print(‘Setting up the driver‘) driver = webdriver.Chrome(options=chrome_options, executable_path=‘/usr/local/bin/chromedriver‘) driver.get(url) print(‘Driver prepared‘) scroll_down(driver,3) soup = bs4.BeautifulSoup(driver.page_source, ‘lxml‘) link = soup.find_all(‘a‘,{‘class‘:‘_2Mc8_‘}) amount = str(soup.find(‘p‘,{‘class‘:‘_1u88E _1iWCF _27Bp2‘})) rr = re.compile(r‘.*>(.*?)free.*‘) amount = rr.findall(amount)[0].strip() amount = int(amount.replace(‘,‘,‘‘)) print(‘Total: ‘+ str(amount)) print(‘Captured: ‘+ str(len(link))) count = 0 for piece in link: deepLk = ‘https://unsplash.com‘ + str(piece.get(‘href‘)) headers = { ‘Accept‘: ‘application/json, text/javascript, */*; q=0.01‘, ‘Origin‘: ‘http://www.zimuzu.tv‘, ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36‘, ‘Content-Type‘: ‘application/x-www-form-urlencoded‘, } deepRes = requests.get((deepLk), headers=headers) deepSoup = bs4.BeautifulSoup(deepRes.text, "html.parser") pic = deepSoup.find(‘img‘,{‘class‘:‘_2zEKz‘}) try: picLk = str(pic.get(‘src‘)) except: print(‘download failed‘) rr = re.compile(r‘(.* ?)\?‘) filename = str(pic.get(‘alt‘)) + ‘.png‘ if picLk != ‘None‘: count += 1 picLk = rr.findall(picLk)[0] print(‘Download from ‘ + deepLk + ‘: ‘ + filename ) picRes = requests.get(picLk) if filename == ‘None.png‘: imageFile = open(os.path.join(‘pics‘, os.path.basename(picLk)+‘.png‘),‘wb‘) else: imageFile = open(os.path.join(‘pics‘, filename),‘wb‘) for chunk in picRes.iter_content(100000): imageFile.write(chunk) imageFile.close()
Python編程快速上手(七)Unsplash批量下載器