python 爬取視覺中國網站

阿新 • • 發佈：2019-01-05

python 爬取視覺中國設計師社團網站
閱讀本文需要 5 分鐘

如今版權越來越重要了，特別是今年。前幾天，在一個的圈子裡看到一條訊息，有公司專門針對一些圖片對不知情使用者索賠，說一句請各位使用圖片的時候明確是否侵權。

藉此來學習一下 python 如何用 100 行程式碼爬取視覺中國網站的設計圖片並批量下載，今天我們就開始試著爬取一下視覺中國的網站裡的圖片，所有圖片僅做學習，不做商用。

先分析一下網頁

輸入搜尋詞「世界盃」回車之後，開啟除錯工具，找到請求，並分析其原始碼，發現找不到我們所要的內容，例如圖片上的「當端午遇上世界盃」在網頁原始碼上並沒有找到，由此，這些資料應該是通過 ajax 方式傳遞的，開啟除錯工具的 XHR 發現果然有請求的資料，檢視其 preview 可以看到是一系列 json 格式的資料，分析這些 json 資料，我們發現我們需要的資料，如圖片地址，圖片標題。。。都可以從這裡面取到。

分析完之後，就開始寫程式碼爬取了
首先爬取網站內容，獲取其中 ajax 的 json 資料，這裡通過python 的 urllib 庫的urlencode（）來將 json 格式的引數自動解析為url的字串形式。

例如當page=1 時會解析成這個樣子,就可以獲得ajax請求的完整資料的 url ：

http://www.shijue.me/community/search?size=20&orderby=rating&key=%E4%B8%96%E7%95%8C%E6%9D%AF&type=json&page=1&license=-1
def get_content(page):
parameters = {
‘key’:’世界盃’,
‘type’:’json’,
‘page’:page,
‘size’:’20’,
‘license’:-1,
‘orderby’:’rating’,
}
# size:每一頁有size組圖
# orderby: 通過rating（等級）排序
# 用urlencode() 方法構造請求的GET引數，然後用requests請求這個連結，如果返回狀態碼為200，則呼叫response的json格式返回
url = ‘

http://www.shijue.me/community/search?’ + urlencode(parameters)
try:
response = requests.get(url)
if response.status_code == 200:
return response.json()
except requests.ConnectionError:
return None
獲取上面的 json 資料後，將 json 資料解析出來，獲取其中的 title 和 image 的url地址，並將其儲存到本地。同時由於網頁的 MD5 加密方式，同時獲取其 MD5 值可以保證下載的名稱的唯一性。

def get_images(json):
if json.get(‘dataArray’):
for item in json.get(‘dataArray’):
title = item.get(‘title’)
image =item.get(‘url’)
yield {
‘image’:image,
‘title’:title
}
”’
實現一個儲存圖片的方法 save_image(), 其中item就是前面返回的一個字典,根據item的title建立資料夾，然後請求這個圖片的二進位制資料，
以二進位制的形式寫入檔案，圖片的名稱可以使用其內容的MD5值，這樣可以去重複.
”’
def save_image(item):
if not os.path.exists(item.get(‘title’)):
os.mkdir(item.get(‘title’))
try:
response = requests.get(item.get(‘image’))
if response.status_code == 200:
file_path = ‘{0}/{1}.{2}’.format(item.get(‘title’), md5(response.content).hexdigest(), ‘jpg’)
if not os.path.exists(file_path):
with open(file_path, ‘wb’) as f:
f.write(response.content)
else:
print(‘已經下載’, file_path)
except requests.ConnectionError:
print(‘儲存圖片失敗’)
執行之後，可以看到本地檔案中就有了這麼多下載好的結果了。

改進一下，將資料儲存到 mongodb 資料庫中去.。

mongodb 是一個適合進行文件儲存的非關係型資料庫，所有這裡將圖片資訊通過 mongodb 儲存.

設定 Mongodb 配置資訊

client = MongoClient()
db = client[‘picture’]
collection = db[‘picture’]
max_page = 10
存入mongodb 中，我們可以儲存多一點資訊，例如圖片的作者，創作時間，主題。。。。暫時根據需求為了方便就只獲取這麼多資訊，圖片的 url 地址當然不能少。由此改寫一下獲取的資料。

def get_monmgodb(json):
# 解析json資料並獲取 author，createDate，title， image 作為存入 mongodb的結果集
if json.get(‘dataArray’):
# 迴圈獲取圖片相關資訊
for item in json.get(‘dataArray’):
author = item.get(‘uploaderName’)
title = item.get(‘title’)
image = item.get(‘url’)
createDate = item.get(‘createDate’)
# 將結果構造成一個字典形式並返回一個生成器
yield {
‘author’:author,
‘title’:title,
‘image’:image,
‘createDate’:createDate
}
最後用過 insert（）將獲得的字典形式的資料存入資料庫就完成了，如有需要直接從資料庫中取出需要的資料進行下載，以免都存在硬碟中佔用資源.

def save_mongdb(result):
collection.insert(result)
print(” 儲存到mongodb成功 “)
執行之後，開啟mongodb客戶端，可以看到所有資料已經存到資料庫裡了。

以上就是通過分析 ajax 爬取網頁的過程，如有不足或疏漏，歡迎指教。

import requests
import os
from multiprocessing import Pool
from hashlib import md5
from urllib.parse import urlencode

from pymongo import MongoClient

”’
作者：starichat
時間:2018.7.13
內容: 分析視覺中國設計師社群網站，利用ajax爬取結果
”’

爬取網站內容,獲取 ajax 的 json 資料

client = MongoClient()
db = client[‘picture’]
collection = db[‘picture’]
max_page = 10

def get_content(page):
parameters = {
‘key’:’世界盃’,
‘type’:’json’,
‘page’:page,
‘size’:’20’,
‘license’:-1,
‘orderby’:’rating’,
}

# 用urlencode() 方法構造請求的GET引數，然後用requests請求這個連結，如果返回狀態碼為200，則呼叫response的json格式返回
url = 'http://www.shijue.me/community/search?' + urlencode(parameters)
print(url)
try:
    response = requests.get(url)
    if response.status_code == 200:
        return response.json()
except requests.ConnectionError:
    return None

”’
實現一個儲存圖片的方法 save_image(), 其中item就是前面返回的一個字典,根據item的title建立資料夾，然後請求這個圖片的二進位制資料，
以二進位制的形式寫入檔案，圖片的名稱可以使用其內容的MD5值，這樣可以去重複，
”’
def save_image(item):
if not os.path.exists(item.get(‘title’)):
os.mkdir(item.get(‘title’))
try:
response = requests.get(item.get(‘image’))
if response.status_code == 200:
file_path = ‘{0}/{1}.{2}’.format(item.get(‘title’), md5(response.content).hexdigest(), ‘jpg’)
if not os.path.exists(file_path):
with open(file_path, ‘wb’) as f:
f.write(response.content)
else:
print(‘Already Downloaded’, file_path)
except requests.ConnectionError:
print(‘Failed to save Image’)

def get_monmgodb(json):
# 解析json資料並獲取 author，createDate，title， image 作為存入 mongodb的結果集
if json.get(‘dataArray’):
# 迴圈獲取圖片相關資訊
for item in json.get(‘dataArray’):
author = item.get(‘uploaderName’)
title = item.get(‘title’)
image = item.get(‘url’)
createDate = item.get(‘createDate’)
# 將結果構造成一個字典形式並返回一個生成器
yield {
‘author’:author,
‘title’:title,
‘image’:image,
‘createDate’:createDate
}
def save_mongdb(result):
collection.insert(result)
print(” 儲存到mongodb成功 “)

def main(page):
json = get_content(page)
print(json)
for result in get_monmgodb(json):
try :
save_mongdb(result)
save_image(result)
except :
pass
continue
GROUP_START = 1
GROUP_END = 2 # 暫時以爬取 20 頁為例
if name == ‘main‘:
pool = Pool()
groups = ([x * 2for x in range(GROUP_START,GROUP_END + 1)])
pool.map(main,groups)
pool.close()
pool.join()
所有程式碼以上傳至 github 上了，如有需要，可以點選閱讀原文檢視。

如覺得有用，歡迎轉發及留言。

還是友情提醒下：文章使用圖片時使用請注意下版權。

關注有驚喜。收貨不只是一點點喔!
公眾號：
starichat

python 爬取視覺中國網站

爬取網站內容,獲取 ajax 的 json 資料

python 爬取視覺中國網站

綜合使用python爬蟲技術，selenium模組動態抓取“視覺中國”網站上的圖片的url

python 爬取新浪網站 NBA球員最近2個賽季庫裡前20場資料

python爬取古詩文網站詩文一欄的所有詩詞

用python爬取某視訊網站彈幕

利用python爬取實習僧網站上的資料

python 爬取簡單靜態網站之電影天堂

Python開發爬蟲之BeautifulSoup解析網頁篇：爬取安居客網站上北京二手房數據

Python爬取異步加載的網站selenium+PhantomJS

Python爬取網站美女照片

Python爬取小說網站頁面製作電子書

Python爬蟲爬取美劇網站

Python之爬取IP代理網站

python爬蟲爬取拉勾網站內容

python 3 爬取某小說網站小說，註釋詳細

Python爬取黑龍江科技大學URL教務網站

Python爬取攜程旅遊行程資訊+GIS視覺化

用Python爬取中國新說唱歌曲資訊

利用Python爬取攝影網站圖片，切勿商用

Python 爬取 B 站資料分析，宋智孝李光洙誰最受中國粉絲喜愛

python 爬取視覺中國網站

爬取網站內容,獲取 ajax 的 json 資料

相關推薦