python爬取千庫網
阿新 • • 發佈:2020-09-14
url:https://i588ku.com/beijing/0-0-default-0-8-0-0-0-0-1/
有水印
但是點進去就沒了
這裡先來測試是否有反爬蟲
import requests
from bs4 import BeautifulSoup
import os
html = requests.get('https://i588ku.com/beijing/0-0-default-0-8-0-0-0-0-1/')
print(html.text)
輸出是404,添加個ua頭就可以了
可以看到每個圖片都在一個div class裡面,比如fl marony-item bglist_5993476,是3個class但是最後一個編號不同就不取
我們就可以獲取裡面的url
import requests from bs4 import BeautifulSoup import os headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' } html = requests.get('https://i588ku.com/beijing/0-0-default-0-8-0-0-0-0-1/',headers=headers) soup = BeautifulSoup(html.text,'lxml') Urlimags = soup.select('div.fl.marony-item div a') for Urlimag in Urlimags: print(Urlimag['href'])
輸出結果為
//i588ku.com/ycbeijing/5993476.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5991004.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5990729.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5991308.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5990409.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5989982.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5978978.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5993625.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5990728.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5951314.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5992353.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5993626.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5992302.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5820069.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5804406.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5960482.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5881533.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5986104.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5956726.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5986063.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5978787.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5954475.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5959200.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5973667.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5850381.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5898111.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5924657.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5975496.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5928655.html //i588ku.com/comnew/vip/ //i588ku.com/ycbeijing/5963925.html //i588ku.com/comnew/vip/
這個/vip是廣告,過濾一下
for Urlimag in Urlimags:
if 'vip' in Urlimag['href']:
continue
print('http:'+Urlimag['href'])
然後用os寫入本地
import requests
from bs4 import BeautifulSoup
import os
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
}
html = requests.get('https://i588ku.com/beijing/0-0-default-0-8-0-0-0-0-1/',headers=headers)
soup = BeautifulSoup(html.text,'lxml')
Urlimags = soup.select('div.fl.marony-item div a')
for Urlimag in Urlimags:
if 'vip' in Urlimag['href']:
continue
# print('http:'+Urlimag['href'])
imgurl = requests.get('http:'+Urlimag['href'],headers=headers)
imgsoup = BeautifulSoup(imgurl.text,'lxml')
imgdatas = imgsoup.select_one('.img-box img')
title = imgdatas['alt']
print('無水印:','https:'+imgdatas['src'])
if not os.path.exists('千圖網圖片'):
os.mkdir('千圖網圖片')
with open('千圖網圖片/{}.jpg'.format(title),'wb')as f:
f.write(requests.get('https:'+imgdatas['src'],headers=headers).content)
然後我們要下載多頁,先看看url規則
第一頁:https://i588ku.com/beijing/0-0-default-0-8-0-0-0-0-1/
第二頁:https://i588ku.com/beijing/0-0-default-0-8-0-0-0-0-2/
import requests
from bs4 import BeautifulSoup
import os
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
}
for i in range(1,11):
print('正在下載第{}頁'.format(i))
html = requests.get('https://i588ku.com/beijing/0-0-default-0-8-0-0-0-0-{}/'.format(i),headers=headers)
soup = BeautifulSoup(html.text,'lxml')
Urlimags = soup.select('div.fl.marony-item div a')
for Urlimag in Urlimags:
if 'vip' in Urlimag['href']:
continue
# print('http:'+Urlimag['href'])
imgurl = requests.get('http:'+Urlimag['href'],headers=headers)
imgsoup = BeautifulSoup(imgurl.text,'lxml')
imgdatas = imgsoup.select_one('.img-box img')
title = imgdatas['alt']
print('無水印:','https:'+imgdatas['src'])
if not os.path.exists('千圖網圖片'):
os.mkdir('千圖網圖片')
with open('千圖網圖片/{}.jpg'.format(title),'wb')as f:
f.write(requests.get('https:'+imgdatas['src'],headers=headers).content)