python爬蟲之Scrapy 使用代理配置
阿新 • • 發佈:2019-02-09
http://www.pythontab.com/html/2014/pythonweb_0326/724.html
在爬取網站內容的時候,最常遇到的問題是:網站對IP有限制,會有防抓取功能,最好的辦法就是IP輪換抓取(加代理)
下面來說一下Scrapy如何配置代理,進行抓取
1.在Scrapy工程下新建“middlewares.py”
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication import base64
# Start your middleware class
class ProxyMiddleware( object ):
# overwrite process request
def process_request( self , request, spider):
# Set the location of the proxy
request.meta[ 'proxy' ] = "http://YOUR_PROXY_IP:PORT"
# Use the following lines if your proxy requires authentication proxy_user_pass = "USERNAME:PASSWORD"
# setup basic authentication for the proxy
encoded_user_pass = base64.encodestring(proxy_user_pass)
request.headers[ 'Proxy-Authorization' ] = 'Basic ' + encoded_user_pass
|
2.在專案配置檔案裡(./pythontab/settings.py)新增
1 2 3 4 |
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware' : 110 ,
'tab.middlewares.ProxyMiddleware' : 100 ,
}
|