實踐出真知-scrapy整合ip代理(以阿布云為例)
阿新 • • 發佈:2019-01-01
一、前言
有個專案需要爬取證券協會,對方有ip遮蔽。所以我需要在scrapy中實現ip自動切換,才能夠完成爬取任務。
在此之前,我用過第三方庫scrapy-proxys加上芝麻ip的代理api介面,可能是之前程式碼沒有調整好,導致的沒有能夠成功。(後面有機會再測試)。
二、阿布雲範例
阿布雲官方給出了python和scrapy的示例程式碼
python3示例
from urllib import request # 要訪問的目標頁面 targetUrl = "http://test.abuyun.com/proxy.php" # 代理伺服器 proxyHost = "http-dyn.abuyun.com" proxyPort = "9020" # 代理隧道驗證資訊 proxyUser = "H01234567890123D" proxyPass = "0123456789012345" proxyMeta = "http://%(user)s:%(pass)
[email protected]%(host)s:%(port)s" % { "host" : proxyHost, "port" : proxyPort, "user" : proxyUser, "pass" : proxyPass, } proxy_handler = request.ProxyHandler({ "http" : proxyMeta, "https" : proxyMeta, }) #auth = request.HTTPBasicAuthHandler() #opener = request.build_opener(proxy_handler, auth, request.HTTPHandler) opener = request.build_opener(proxy_handler) request.install_opener(opener) resp = request.urlopen(targetUrl).read() print (resp)
上面的是原生寫法,下面提供scrapy的中介軟體寫法
scrapy中介軟體
import base64 # 代理伺服器 proxyServer = "http://http-dyn.abuyun.com:9020" # 代理隧道驗證資訊 proxyUser = "H01234567890123D" proxyPass = "0123456789012345" # for Python2 proxyAuth = "Basic " + base64.b64encode(proxyUser + ":" + proxyPass) # for Python3 #proxyAuth = "Basic " + base64.urlsafe_b64encode(bytes((proxyUser + ":" + proxyPass), "ascii")).decode("utf8") class ProxyMiddleware(object): def process_request(self, request, spider): request.meta["proxy"] = proxyServer request.headers["Proxy-Authorization"] = proxyAuth
這裡在scrapy專案中的Middleware裡面寫即可。
三、正式整合
在專案的middlewares.py中新增類:
import base64
""" 阿布雲ip代理配置,包括賬號密碼 """
proxyServer = "http://http-dyn.abuyun.com:9020"
proxyUser = "HWFHQ5YP14Lxxx"
proxyPass = "CB8D0AD56EAxxx"
# for Python3
proxyAuth = "Basic " + base64.urlsafe_b64encode(bytes((proxyUser + ":" + proxyPass), "ascii")).decode("utf8")
class ABProxyMiddleware(object):
""" 阿布雲ip代理配置 """
def process_request(self, request, spider):
request.meta["proxy"] = proxyServer
request.headers["Proxy-Authorization"] = proxyAuth
然後再到settings.py中開啟中介軟體:
DOWNLOADER_MIDDLEWARES = {
#'Securities.middlewares.SecuritiesDownloaderMiddleware': None,
'Securities.middlewares.ABProxyMiddleware': 1,
}
四、注意事項
阿布雲動態ip預設是1秒鐘請求5次,(可以加錢,購買多次)。所以,當他是預設5次的時候,我需要對爬蟲進行限速,還是在settings.py中,空白處新增如下程式碼:
""" 啟用限速設定 """
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 0.2 # 初始下載延遲
DOWNLOAD_DELAY = 0.2 # 每次請求間隔時間
當然了,如果加錢購買多次的話,可以不用考慮限速的問題。
即可完成阿布雲動態代理ip在scrapy中的的整合,盡情的爬吧!