爬蟲之Requests庫應用例項
阿新 • • 發佈:2018-12-13
1.京東商品頁的爬取
import requests
url = "https://item.jd.com/100000400014.html"
try:
r = requests.get(url)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text)
except:
print("ERROR!")
2.亞馬遜商品頁面爬取
執行如下程式碼
url = "https://www.amazon.cn/dp/B07DFFDR1V/ref=sr_1_5?ie=UTF8&qid=1538651583&sr=8-5"
r = requests.get(url)
r.encoding = r.apparent_encoding
print(r.text)
執行結果並不是我們想要的頁面程式碼,用r.status_code
看一下返回值為503
,執行r.request.headers
,返回{'user-agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
,因此這種情況要修改它的headers
引數,程式碼:
import requests
url = "https://www.amazon.cn/dp/B07DFFDR1V/ref=sr_1_5?ie=UTF8&qid=1538651583&sr=8-5"
header = {'user-agent': 'Mozilla/5.0'}
try:
r = requests.get(url, headers = header)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text)
except:
print("ERROR!")
3.百度搜索關鍵詞提交
百度的關鍵詞介面:
因此只需要構造類似的結構即可進行關鍵詞搜尋
>>> import requests
>>> kw = {'wd':'Python' }
>>> r = requests.get("http://www.baidu.com/s", params = kw)
#真正的url
>>> r.request.url
'http://www.baidu.com/s?wd=Python'
#返回值的長度
>>> len(r.text)
254672
4.網路圖片的爬取與儲存
網路圖片連結的格式:
簡單程式碼:
>>> import requests
>>> url = "http://pic107.huitu.com/res/20180709/520738_20180709220906214050_1.jpg"
>>> path = "D:/" + url.split('/')[-1]
>>> r = requests.get(url)
>>> r.status_code
200
>>> with open (path, "wb") as f:
f.write(r.content)
86931
全程式碼:
import requests
import os
url = "http://pic107.huitu.com/res/20180709/520738_20180709220906214050_1.jpg"
root = "D://pics//"
path = root + url.split("/")[-1]
try:
if not os.path.exists(root):
os.mkdir(root)
if not os.path.exists(path):
r = requests.get(url)
with open(path) as f:
f.write(r.content)
f.close()
print('檔案儲存成功')
else:
print("檔案已存在")
except:
print("爬取失敗")
5.IP地址歸屬地自動查詢
import requests
ip = input("your ip:")
url = "http://www.ip138.com/ips138.asp?ip="
try:
r = requests.get(url + ip)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text)
except:
print("ERROR!")