scrapy配置proxy代理

阿新 • • 發佈：2018-12-09

一、IP池

IP可以從這個幾個網站獲取:
快代理：https://www.kuaidaili.com/free/
西刺代理：http://www.xicidaili.com/
如果出現像下面這種提示:”由於連線方在一段時間後沒有正確答覆或連線的主機沒有反應，連線嘗試失敗”或者是這種，” 由於目標計算機積極拒絕，無法連線。”. 那就是IP的問題，更換就行了。。。。發現上面好多IP都不能用。。

2017-04-16 12:38:11 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://news.sina.com 
.cn/> (failed 1 times): TCP connection timed out: 10060: 由於連線方在 一段時間後沒有正確答覆或連線的主機沒有反應，連線嘗試失敗。.  
this is ip:182.241.58.70:51660  
2017-04-16 12:38:32 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://news.sina.com.cn/> (failed 2 times): TCP connection timed out: 10060: 由於連線方在 一段時間後沒有正確答覆或連線的主機沒有反應，連線嘗試失敗。.  
this is ip:49.75 
.59.243:28549  
2017-04-16 12:38:33 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force  
2017-04-16 12:38:33 [scrapy.core.engine] INFO: Closing spider (shutdown)  
2017-04-16 12:38:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)  
2017 
-04-16 12:38:53 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET  TCP connection timed out: 10060: 由於 連線方在一段時間後沒有正確答覆或連線的主機沒有反應，連線嘗試失敗。.  
2017-04-16 12:38:54 [scrapy.core.scraper] ERROR: Error downloading <GET http://news.sina.com.cn/>  
Traceback (most recent call last):  
  File "f:\software\python36\lib\site-packages\twisted\internet\defer.py", line 1299, in _inlineCallbacks  
    result = result.throwExceptionIntoGenerator(g)  
  File "f:\software\python36\lib\site-packages\twisted\python\failure.py", line 393, in throwExceptionIntoGenerator  
    return g.throw(self.type, self.value, self.tb)  
  File "f:\software\python36\lib\site-packages\scrapy\core\downloader\middleware.py",
    defer.returnValue((yield download_func(request=request,spider=spider)))  
TCP connection timed out: 10060: 由於連線方在一段時間後沒有正確答覆或連線的主機沒有反應，連線嘗試失敗。.

在Scrapy中與代理伺服器設定相關的下載中介軟體是HttpProxyMiddleware，對應的類為:

scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware

2.修改中介軟體檔案middlewares.py

# -*- coding: utf-8 -*-  

# Define here the models for your spider middleware  
#  
# See documentation in:  
# http://doc.scrapy.org/en/latest/topics/spider-middleware.html  

import random  
from scrapy import signals  
from myproxies.settings import IPPOOL  

class MyproxiesSpiderMiddleware(object):  

      def __init__(self,ip=''):  
          self.ip=ip  

      def process_request(self, request, spider):  
          thisip=random.choice(IPPOOL)  
          print("this is ip:"+thisip["ipaddr"])  
          request.meta["proxy"]="http://"+thisip["ipaddr"]

3.在settings中設定DOWNLOADER_MIDDLEWARES

[html] view plain copy
DOWNLOADER_MIDDLEWARES = {  
     'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':543,  
     'DouTu.middlewares.MyproxiesSpiderMiddleware':125  
}

4.爬蟲檔案為


# -*- coding: utf-8 -*-  
import scrapy  


class ProxieSpider(scrapy.Spider):  


    def __init__(self):  
        self.headers = {  
            'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',  
            'Accept-Encoding':'gzip, deflate',  
            'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'  
        }  


    name = "proxie"  
    allowed_domains = ["sina.com.cn"]  
    start_urls = ['http://news.sina.com.cn/']  

    def parse(self, response):  
        print(response.body)

二、自動更新IP池

這裡寫個自動獲取IP的類proxies.py，執行一下把獲取的IP儲存到txt檔案中去:


# *-* coding:utf-8 *-*
import requests
from bs4 import BeautifulSoup
import lxml
from multiprocessing import Process, Queue
import random
import json
import time
import requests

class Proxies(object):


    """docstring for Proxies"""
    def __init__(self, page=3):
        self.proxies = []
        self.verify_pro = []
        self.page = page
        self.headers = {
        'Accept': '*/*',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36',
        'Accept-Encoding': 'gzip, deflate, sdch',
        'Accept-Language': 'zh-CN,zh;q=0.8'
        }
        self.get_proxies()
        self.get_proxies_nn()

    def get_proxies(self):
        page = random.randint(1,10)
        page_stop = page + self.page
        while page < page_stop:
            url = 'http://www.xicidaili.com/nt/%d' % page
            html = requests.get(url, headers=self.headers).content
            soup = BeautifulSoup(html, 'lxml')
            ip_list = soup.find(id='ip_list')
            for odd in ip_list.find_all(class_='odd'):
                protocol = odd.find_all('td')[5].get_text().lower()+'://'
                self.proxies.append(protocol + ':'.join([x.get_text() for x in odd.find_all('td')[1:3]]))
            page += 1

    def get_proxies_nn(self):
        page = random.randint(1,10)
        page_stop = page + self.page
        while page < page_stop:
            url = 'http://www.xicidaili.com/nn/%d' % page
            html = requests.get(url, headers=self.headers).content
            soup = BeautifulSoup(html, 'lxml')
            ip_list = soup.find(id='ip_list')
            for odd in ip_list.find_all(class_='odd'):
                protocol = odd.find_all('td')[5].get_text().lower() + '://'
                self.proxies.append(protocol + ':'.join([x.get_text() for x in odd.find_all('td')[1:3]]))
            page += 1

    def verify_proxies(self):
        # 沒驗證的代理
        old_queue = Queue()
        # 驗證後的代理
        new_queue = Queue()
        print ('verify proxy........')
        works = []
        for _ in range(15):
            works.append(Process(target=self.verify_one_proxy, args=(old_queue,new_queue)))
        for work in works:
            work.start()
        for proxy in self.proxies:
            old_queue.put(proxy)
        for work in works:
            old_queue.put(0)
        for work in works:
            work.join()
        self.proxies = []
        while 1:
            try:
                self.proxies.append(new_queue.get(timeout=1))
            except:
                break
        print ('verify_proxies done!')


    def verify_one_proxy(self, old_queue, new_queue):
        while 1:
            proxy = old_queue.get()
            if proxy == 0:break
            protocol = 'https' if 'https' in proxy else 'http'
            proxies = {protocol: proxy}
            try:
                if requests.get('http://www.baidu.com', proxies=proxies, timeout=2).status_code == 200:
                    print ('success %s' % proxy)
                    new_queue.put(proxy)
            except:
                print ('fail %s' % proxy)


if __name__ == '__main__':
    a = Proxies()
    a.verify_proxies()
    print (a.proxies)
    proxie = a.proxies 
    with open('proxies.txt', 'a') as f:
       for proxy in proxie:
             f.write(proxy+'\n')

修改代理檔案middlewares.py的內容為如下:

import random
import scrapy
from scrapy import log


# logger = logging.getLogger()

class ProxyMiddleWare(object):
    """docstring for ProxyMiddleWare"""
    def process_request(self,request, spider):
        '''對request物件加上proxy'''
        proxy = self.get_random_proxy()
        print("this is request ip:"+proxy)
        request.meta['proxy'] = proxy 


    def process_response(self, request, response, spider):
        '''對返回的response處理'''
        # 如果返回的response狀態不是200，重新生成當前request物件
        if response.status != 200:
            proxy = self.get_random_proxy()
            print("this is response ip:"+proxy)
            # 對當前reque加上代理
            request.meta['proxy'] = proxy 
            return request
        return response

    def get_random_proxy(self):
        '''隨機從檔案中讀取proxy'''
        while 1:
            with open('G:\\Scrapy_work\\myproxies\\myproxies\\proxies.txt', 'r') as f:
                proxies = f.readlines()
            if proxies:
                break
            else:
                time.sleep(1)
        proxy = random.choice(proxies).strip()
        return proxy

修改下settings檔案：

DOWNLOADER_MIDDLEWARES = {  
#    'myproxies.middlewares.MyCustomDownloaderMiddleware': 543,  
     'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':None,  
     'myproxies.middlewares.ProxyMiddleWare':125,  
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware':None  
}

scrapy配置proxy代理

一、IP池 IP可以從這個幾個網站獲取: 快代理：https://www.kuaidaili.com/free/ 西刺代理：http://www.xicidaili.com/ 如果出現像下面這種提示:”由於連線方在一段時間後沒有正確答覆或連線的主機沒有反應，連線嘗試失敗”或者是這種

實踐webpack+es6+react+redux+antd構建專案(三) 配置proxy代理

上一篇是關於引入react,redux,antd的，專案已經可以正常運行了，但是拿到的都是定義的死資料。這篇文章講一下配置代理，調mock資料。簡便起見，我使用的是Easy Mock，定義好mock資料，在專案中引入使用。先說一下Easy Mock的使用方法，建立一個專案，

Scrapy爬蟲：代理IP配置

Scrapy設定代理IP步驟： 1、在Scrapy工程下新建"middlewares.py": ? 1 2 3 4 5 6 7 8 9 10 11 12 13 import base64 # Start your middl

proxy代理對鏈接Connection的處理

tco gin ins reat eth lis 創建 org get package com.pool.cn; import java.lang.reflect.InvocationHandler; import java.lang.reflect.Method; i

kubernetes中port、target port、node port的對比分析，以及kube-proxy代理

ans toc contain exp red lec adb service 接口轉：http://blog.csdn.net/xinghun_4/article/details/50492041 容器網絡實例服務中的3個端口設置這幾個port的概念很容易混淆，比

SCRAPY配置

scrapy 1 yum groupinstall "Development tools" 2 yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel wget https://

docker配置反向代理

http docker proxy Custom(定制) Docker daemon optionsThere are a number of ways to configure the daemon flags and environment variables for your Docker

CentOS7 Nginx安裝及配置反向代理

ges xxx 網絡 update bad blank 模塊 sendfile 找到背景：　　Mono (Mono JIT compiler version 5.4.0.201 ) 　　jexus-5.8.2-x64（《CentOS7 安裝 jexus-5.8.2-x6

linux下配置squid代理

linux squid 代理工具、×××工具yum install -y squid （安裝squid服務）配置正向代理（企業中用於訪問控制等）vim /etc/squid/squid.conf （編輯squid配置文件） ...... acl http proto HTTP

Proxy 代理

ble 新建 proxy 一個 site urn 目標 pan post Proxy 是代理的構造函數,通過新建對象的形式,對新建的對象的操作進行自定義處理語法: 　　new Promise( target , handler ) target 是攔截操作的目標對象, h

squid配置正向代理

mkdir 什麽是 coredump nec 初始資源 8.0 .net pool 什麽是squidsquid可以做代理也可以做緩存squid緩存不僅可以節省寶貴的帶寬資源，也可以大大降低服務器的I/O. squid不僅可以做正向代理，又可以做反向代理。搭建squid

nginx配置-反向代理

反向代理 nginx 續：nginx安裝配置開啟 nginx 請求日誌同時解開http模塊如下部分註釋 log_format main ‘$remote_addr - $remote_user [$time_local] "$request" ‘

Centos7下Nginx+Tomcat配置反向代理，使用memcached解決session一致性問題

新版本 desc secure conn prot 多臺 for body 我們一、session一致性問題使用集群方案解決網站高並發問題時，就會部署多臺應用服務器。當用戶第一次通過客戶端（如：瀏覽器）訪問服務器時，服務器會創建對應的session，使用

Centos7下Nginx+Tomcat配置反向代理，使用redis解決session一致性問題

localhost 第一次方案 days tom gin align title 瀏覽器一、session一致性問題使用集群方案解決網站高並發問題時，就會部署多臺應用服務器。當用戶第一次通過客戶端（如：瀏覽器）訪問服務器時，服務器會創建對應的sessio

Nginx 配置反向代理和負載均衡

變量 ash web 出現 location 部分順序 AR 有效反向代理客戶端 A 出於某種原因（如 GFW）無法訪問服務器 B，但它能訪問服務器 C，服務器 C 能訪問服務器 A，那麽客戶端 A 可通過服務器 C 來訪問服務器 B。此時服務器 C 就是一個代理服務

Nginx防盜鏈 Nginx訪問控制 Nginx解析php相關配置 Nginx代理

十二周四次課（3月15日）12.13 Nginx防盜鏈cd /usr/local/nginx/conf/vhostvi test.com.conf將以上內容復制到下圖位置測試，成功前提data/wwwroot/test.com目錄下要有1.gif12.14 Nginx訪問控制cd /usr/local/ngi

Nginx防盜鏈以及訪問控制，Nginx解析php配置和代理

NginxNginx防盜鏈 1.編輯配置文件： [root@weixing01 ~]# vim /usr/local/nginx/conf/vhost/test.com.conf location ~* ^.+\.(gif|jpg|png|swf|flv|rar|zip|doc|pdf|gz|bz2|jp

Java設計模式（八）Proxy代理模式

com 服務器 exp 技術分享如果 face pub [] his 一、場景描述代理在生活中並不少見，租房子需要找中介，打官司需要找律師，很多事情我們需要找專業人士代理我們做，另一方面，中介和律師也代理了房東、法律程序與我們打交道。當然，設計模式中的代理與廣義的

使用squid配置透明代理並對上網行為進行控制

install mes 10.10 path ces 透明代理服務器限制 BE lB 使用Squid配置透明代理環境：CentOS 6.4 + squid-3.1.10-20.el6_5.3.x86_64 1.檢查squid是否默認安裝，沒有安裝先安裝 rpm -qa

Nodejs http-proxy代理實戰應用

port proxy delete eat tar rip you app this var https = require(‘https‘); var express = require(‘express‘); var app = express() var h

scrapy配置proxy代理

一、IP池

二、自動更新IP池

相關推薦