1. 程式人生 > >Python爬蟲 403解決辦法

Python爬蟲 403解決辦法

寫爬蟲的時候先看看要爬的網頁的狀態碼

print urllib.urlopen(url).getcode()

200正常訪問
301重定向
404網頁不存在
403禁止訪問(禁止用一個User-Agent快速多次訪問)

**

403解決辦法

**

import urllib
import random

my_headers=["Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"
,"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14","Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Win64; x64; Trident/6.0)" def get_content(url): """ 得到content資訊 :param url: :return: content """
randdom_header=random.choice(my_headers) req = urllib2.Request(url) req.add_header("User-Agent",randdom_header) req.add_header("Host","e-hentai.org") req.add_header("Referer","https://e-hentai.org/lofi") req.add_header("GET",url) content=urllib2.urlopen(req).read() return
content

原理是用多個User-Agent 不斷random,讓伺服器分辨不清,這樣就可以解決403的問題了。