利用scrapy建立代理池

阿新 • • 發佈：2018-11-05

一開始爬取xicidaili，頻率太快ip被禁了。。。，只能回家爬取。明天把爬取的ip存到redis裡做持久化，並且爬取後自動測試代理評級。

proxypool.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request,Spider
from pyquery import PyQuery
from ..items import ProxyItem

class ProxySpider(Spider):
    name = 'proxy'
    allowed_domains = ['proxy']
    start_urls = ['http://proxy/']
    page = 1
    xicidail_url = "http://www.xicidaili.com/nn/{page}"
    kuaidaili_url = "https://www.kuaidaili.com/free/inha/{page}/"
    _66daili_url = "http://www.66ip.cn/areaindex_{page}/1.html"
    ip3366_url = "http://www.ip3366.net/?stype=1&page={page}"

    def start_requests(self):
        yield Request(url=self.kuaidaili_url.format(page=self.page), callback=self.kuaidaili_parse)
        yield Request(url=self._66daili_url.format(page=self.page), callback=self._66_daili_parse)
        yield Request(url=self.ip3366_url.format(page=self.page), callback=self.ip3366_parse)
        # yield Request(url=self.xicidail_url.format(page=1), callback=self.xicidaili_parse)

    def kuaidaili_parse(self, response):
        pq = PyQuery(response.text)
        item = ProxyItem()
        proxies = pq.find("#list .table-bordered tbody").find("tr")
        for proxy in proxies.items():
            ip = proxy.find("td").eq(0).text()
            port = proxy.find("td").eq(1).text()
            item["proxy"] = ip + ":" + port
            print("從%s成功獲取代理：IP：%s PORT：%s" % ("www.kuaidaili.com", ip, port))
            yield item
        now_page = int(response.url.split("/")[-2])
        next_page = now_page + 1
        if next_page <= 10:
            yield Request(url=self.kuaidaili_url.format(page=str(next_page)), callback=self.kuaidaili_parse, dont_filter=True)

    def _66_daili_parse(self, response):
        pq = PyQuery(response.text)
        item = ProxyItem()
        proxies = pq.find("#footer table tr:gt(0)")
        for proxy in proxies.items():
            ip = proxy.find("td").eq(0).text()
            port = proxy.find("td").eq(1).text()
            item["proxy"] = ip + ":" + port
            print("從%s成功獲取代理：IP：%s PORT：%s" % ("http://www.66ip.cn", ip, port))
            yield item
        now_page = int(response.url.split("/")[-2].split("_")[1])
        next_page = now_page + 1
        if next_page <= 34:
            yield Request(url=self._66daili_url.format(page=str(next_page)), callback=self._66_daili_parse, dont_filter=True)

    def ip3366_parse(self, response):
        pq = PyQuery(response.text)
        item = ProxyItem()
        proxyies = pq.find("#list table tbody tr:gt(0)")
        for proxy in proxyies.items():
            ip = proxy.find("td").eq(0).text()
            port = proxy.find("td").eq(1).text()
            item["proxy"] = ip + ":" + port
            print("從%s成功獲取代理：IP：%s PORT：%s" % ("www.ip3366.net", ip, port))
            yield item
        now_page = int(response.url.split("=")[2])
        next_page = now_page + 1
        if next_page <= 10:
            yield Request(url=self.ip3366_url.format(page=str(next_page)), callback=self.ip3366_parse, dont_filter=True)


    # def xicidaili_parse(self, response):
    #     print(response)
    #     page = int(response.url.split("/")[-1])
    #     pq = PyQuery(response.text)
    #     item = XicidailiItem()
    #     proxies = pq.find("#ip_list").find("tr")
    #     total_page = int(pq.find("#body .pagination a").eq(-2).text())
    #     self.page += 1
    #     for proxy in proxies.items():
    #         ip = proxy.find("td").eq(1).text()
    #         port = proxy.find("td").eq(2).text()
    #         item["proxy_info"] = ip + ":"+ port
    #         # print(item["proxy_info"])
    #     yield item
    #     page += 1
    #     if page <= 10:
    #         yield Request(url=self.xicidail_url.format(page=str(page)), callback=self.xicidaili_parse, dont_filter=True)

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy
from scrapy import Item,Field

class ProxyItem(Item):
    ip = Field()
    port = Field()

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for ProxyPool project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'ProxyPool'

SPIDER_MODULES = ['ProxyPool.spiders']
NEWSPIDER_MODULE = 'ProxyPool.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'ProxyPool (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 4

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36"
}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'ProxyPool.middlewares.ProxypoolSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'ProxyPool.middlewares.ProxypoolDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
# ITEM_PIPELINES = {
#    'ProxyPool.pipelines.ProxyPipeline': 300,
# }

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 30
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

利用scrapy建立代理池

一開始爬取xicidaili，頻率太快ip被禁了。。。，只能回家爬取。明天把爬取的ip存到redis裡做持久化，並且爬取後自動測試代理評級。 proxypool.py # -*- coding: utf-8 -*- import scrapy from scrapy import Reque

python爬蟲建立代理池，爬取5000個代理IP並進行驗證！

前面已經介紹了urllib+正則表示式和BeautifulSoup進行爬取資料的方法，今天再解決一個實際問題——構建自己的代理池。通過爬蟲在網上進行資料的獲取，由於效率很快，換言之，訪問的速度過快，導致一段時間內的流量過大，會使得對方的伺服器壓力過

scrapy中代理池的使用

首先自己要準備一份代理池進行更新維護。準備：開啟redis-server，並且執行python run.py ，在redis中可以看到爬取到的代理。確保http://127.0.0.1:5555/random 上能取得正確的ip使用：在下載中介軟體middlewares中進行

【scrapy ip代理池】如何解決ip被限制的問題

1、首先寫一個指令碼proxies.py 自動獲取代理ip # *-* coding:utf-8 *-* import requests from bs4 import BeautifulSoup import lxml from multiproces

利用Python建立和維護爬蟲代理池

IP_Pool 前言剛好前段時間突然萌發了自己做一個代理池的想法，於是就用了一些通俗的方法來實現，一來能方便自己理解，二來也加強學習。這裡開放給大家，給大家提供一點參考，使用前請務必要仔細檢視README.md檔案。 Github:Proxy_IP_Pool 總體構思

python學習 —— 建立IP代理池

圖片端口 position except app rtl 分享圖片 ipp use 　　代碼： from bs4 import BeautifulSoup from requests import Session, get, post from time imp

【Python3爬蟲】Scrapy使用IP代理池和隨機User-Agent

findall 4.3 sdch 5.0 agen and 由於付費 status 在使用爬蟲的時候，有時候會看到由於目標計算機積極拒絕，無法連接...，這就是因為我們的爬蟲被識別出來了，而這種反爬蟲主要是通過IP識別的，針對這種反爬蟲，我們可以搭建一個自己的IP代理池，

scrapy中設定IP代理池（自定義IP代理池）

首先主要的就是你應該對scrapy目錄結構，有一個相對清晰的認識，至少有過一個demo 一、手動更新IP池 1.在settings配置檔案中新增IP池: IPPOOL=[ {"ipaddr":"61.129.70.131:8080"}, {"ipaddr":"61.152

建立自己的IP代理池[爬取西刺代理]

一:基本引數和匯入的包 import requests import re import random url = 'http://www.xicidaili.com/nn' headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64)

python爬蟲由淺入深15---利用Redis+Flask來維護代理池和Cookie池

Redis主要用來維護池，提供池的佇列儲存關於Redis的安裝與配置，可見點選開啟連結 Flask來實現池的介面，用它來從中拿出內容代理池：作用：用來偽裝IP，更好地利用代理資源來應對站點的反爬蟲策略要求：多站抓取，非同步檢測定時篩選，持續更新

UA池和代理池在scrapy中的應用

一.下載中介軟體下載中介軟體（Downloader Middlewares）位於scrapy引擎和下載器之間的一層元件。 - 作用：（1）引擎將請求傳遞給下載器過程中，下載中介軟體可以對請求進行一系列處理。比如設定請求的 User-Agent，設定代理等（2）在下載器完成將Response傳

scrapy 解決IP代理池的三種方法

一.手動更新ip池 1.1在setting配置檔案中新增ip池 IPPOOL=[ {“ipaddr”:”61.129.70.131:8080”}, {“ipaddr”:”61.152.81.193:9100”}, {“ipaddr”:”120.204.

Python爬蟲IP代理池的建立和使用

正如在上一篇博文中提到的那樣，同一個IP針對一個網站短時間內大量的訪問通常會導致IP被封，除了在爬取資料時增加延遲(爬取量不大或者對爬取速度沒要求)，還有一個好方法就是使用代理IP，這樣就可以完美解決IP被封的問題。那麼，問題來了，代理的IP從哪兒來呢，土豪自然隨意，直接淘寶

Scrapy加Redis加IP代理池實現音樂爬蟲

音樂爬蟲目的：爬取歌名，歌手，歌詞，歌曲url。一、建立爬蟲專案建立一個資料夾，進入資料夾，開啟cmd視窗，輸入： scrapy startproject songSpider 依次輸入： cd songSpider scrapy

scrapy爬蟲代理——利用crawlera神器，無需再尋找代理IP

一、crawlera平臺註冊首先申明，註冊是免費的，使用的話除了一些特殊定製外都是free的。填寫使用者名稱、密碼、郵箱，註冊一個crawlera賬號並激活 2、建立Organizations,然後新增crawlear服務

scrapy爬去知乎使用者+代理池實現

spider： # -*- coding: utf-8 -*- import json from scrapy import Spider, Request from zhihuuser.items import UserItem # https://www.cnblo

scrapy接入IP代理池（程式碼部分）

> 記錄一個比較完整的通過ip池進行爬蟲被禁的處理 class HttpProxymiddleware(object): # 一些異常情況彙總 EXCEPTIONS_TO

scrapy框架拓展：為download_middleware新增使用者代理池

**在與spider同目錄下建立user_agent.py** #coding:utf-8 """避免被ban策略之一：使用useragent池。使用注意：需在settings.py中進行相應的設定。 """ from scrapy import log

Python爬蟲之Scrapy框架的UA池和代理池

loader 代理 user 4.5 fill 中間件 5.1 html linu 一下載Scrapy的下載中間件下載中間件（Downloader Middlewares）位於scrapy引擎和下載器之間的一層組件。下載中間件的作用：（1）引擎請求傳遞給下載器的

10 UA池和代理池在Scrapy中的應用

下載中介軟體簡介　　在Scrapy中，引擎和下載器之間有一個元件，叫下載中介軟體（Downloader Middlewares）。因它是介於Scrapy的request/response處理的鉤子，所以有2方面作用：（1）引擎將請求傳遞給下載器過程中，下載中介軟體可以對Requests進行一系

利用scrapy建立代理池

proxypool.py

items.py

settings.py

相關推薦