IP代理

阿新 • • 發佈：2021-01-21

在我們編寫爬蟲的過程中，有些網站具有反爬措施，這是就需要我們設定代理ip

IP代理基本簡介

常用IP測試網站：
http://myip.ipip.net/
http://httpbin.org/get

常用IP代理網站：
西拉免費代理IP
快代理
 89免費代理
 全網代理IP

ip池設定格式

proxies = {
            'http': 'http://113.238.142.208:3128',
            'https': 'https://113.238.142.208:3128',
        }

這裡添加了 http 和 https 兩個代理，這樣寫是因為有些網頁採用 http 協議，有的則是採用 https 協議，為了在這兩類網頁上都能順利使用代理，所以一般都同時寫上，當然，如果確定了某網頁的請求型別，可以只寫一種。

有無IP代理的區別

import requests
url='http://myip.ipip.net/'
proxies = {
            'http': 'http://113.238.142.208:3128',
            'https': 'https://113.238.142.208:3128',
        }
try:
    response = requests.get(url) #不使用代理
    # response=requests.get(url,proxies=proxies,timeout=3)# 使用代理
    print(response. 
status_code)
    if response.status_code == 200:
        print(response.text)
except requests.ConnectionError as e:
    print(e.args)

不使用ip代理時：
在這裡插入圖片描述
使用IP代理：

使用西拉免費代理IP搭建

import requests
import random
import time
from lxml import etree
from fake_useragent import UserAgent
from multiprocessing.dummy import 
 Pool
import os


class IpPool:
    def __init__(self):
        if os.path.exists("ip_pool.txt"):
            os.remove("ip_pool.txt")
            print('刪除檔案成功')
        else:
            print("The file does not exist")
        # 測試ip是否可用url
        self.test_url = 'http://httpbin.org/get'
        # 獲取IP的 目標url
        self.url = 'http://www.xiladaili.com/gaoni/{}/'

        self.headers = {'User-Agent': UserAgent().random}
        # 儲存可用ip
        # self.file = open('ip_pool.txt', 'a')
        self.ip_list=[]
    def get_html(self, url):
        '''獲取頁面'''
        html = requests.get(url=url, headers=self.headers).text

        return html

    def get_proxy(self, url):
        '''資料處理  獲取ip 和埠'''
        html = self.get_html(url=url)
        # print(html)

        elemt = etree.HTML(html)

        ip_ports_list = elemt.xpath('/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()')

        self.ip_list.extend(ip_ports_list)
        pool=Pool()
        pool.map(self.test_proxy,self.ip_list)

    def test_proxy(self, proxy):
        '''測試代理IP是否可用'''
        proxies = {
            'http': 'http://{}'.format(proxy),
            'https': 'https://{}'.format(proxy),
        }
        # 引數型別
        # proxies
        # proxies = {'協議': '協議://IP:埠號'}
        # timeout 超時設定 網頁響應時間3秒 超過時間會丟擲異常
        try:
            resp = requests.get(url=self.test_url, proxies=proxies, headers=self.headers, timeout=3)
            # 獲取 狀態碼為200
            if resp.status_code == 200:
                print('\n',proxy, '\033[31m可用\033[0m')
                # 可以的IP 寫入文字以便後續使用
                with open('ip_pool.txt', 'a') as f:
                    f.write(proxy+'\n')

            else:
                print(proxy, '不可用')
        except Exception as e:
            print(proxy, '不可用')
    def crawl(self):
        '''執行函式'''
        # 快代理每頁url 的區別
        # https://www.kuaidaili.com/free/inha/1/
        # https://www.kuaidaili.com/free/inha/2/
        # .......
        # 提供的免費ip太多
        # 這裡只獲取前100頁提供的免費代理IP測試
        for i in range(1, 101):
            # 拼接完整的url
            print('正在獲取第{}頁IP'.format(i))
            page_url = self.url.format(i)
            # 注意抓取控制頻率
            # time.sleep(random.randint(1, 4))
            self.get_proxy(url=page_url)

        # 執行完畢關閉文字
        # self.file.close()


if __name__ == '__main__':
    ip = IpPool()
    ip.crawl()

使用快代理搭建代理IP池

import requests
import random
import time
from lxml import etree
from fake_useragent import UserAgent
from multiprocessing.dummy import Pool


class IpPool:
    def __init__(self):
        # 測試ip是否可用url
        self.test_url = 'http://httpbin.org/get'
        # 獲取IP的 目標url
        self.url = 'https://www.89ip.cn/index_{}.html'

        self.headers = {'User-Agent': UserAgent().random}
        # 儲存可用ip
        # self.file = open('ip_pool.txt', 'a')
        self.ip_list=[]
    def get_html(self, url):
        '''獲取頁面'''
        html = requests.get(url=url, headers=self.headers).text

        return html

    def get_proxy(self, url):
        '''資料處理  獲取ip 和埠'''
        html = self.get_html(url=url)
        # print(html)

        elemt = etree.HTML(html)

        ips_list = elemt.xpath('//table/tbody/tr/td[1]/text()')
        ports_list = elemt.xpath('//table/tbody/tr/td[2]/text()')

        for ip, port in zip(ips_list, ports_list):
            # 拼接ip與port
            proxy = ip.strip() + ":" + port.strip()
            # print(proxy)

            # 175.44.109.195:9999
            self.ip_list.append(proxy)
        pool=Pool()
        pool.map(self.test_proxy,self.ip_list)

    def test_proxy(self, proxy):
        '''測試代理IP是否可用'''
        proxies = {
            'http': 'http://{}'.format(proxy),
            'https': 'https://{}'.format(proxy),
        }
        # 引數型別
        # proxies
        # proxies = {'協議': '協議://IP:埠號'}
        # timeout 超時設定 網頁響應時間3秒 超過時間會丟擲異常
        try:
            resp = requests.get(url=self.test_url, proxies=proxies, headers=self.headers, timeout=3)
            # 獲取 狀態碼為200
            if resp.status_code == 200:
                print('\n',proxy, '\033[31m可用\033[0m')
                # 可以的IP 寫入文字以便後續使用
                with open('ip_pool.txt', 'a') as f:
                    f.write(proxy+'\n')

            else:
                print(proxy, '不可用')
        except Exception as e:
            print(proxy, '不可用')
    def crawl(self):
        '''執行函式'''
        # 快代理每頁url 的區別
        # https://www.kuaidaili.com/free/inha/1/
        # https://www.kuaidaili.com/free/inha/2/
        # .......
        # 提供的免費ip太多
        # 這裡只獲取前100頁提供的免費代理IP測試
        for i in range(1, 101):
            # 拼接完整的url
            print('正在獲取第{}頁IP'.format(i))
            page_url = self.url.format(i)
            # 注意抓取控制頻率
            # time.sleep(random.randint(1, 4))
            self.get_proxy(url=page_url)

        # 執行完畢關閉文字
        # self.file.close()


if __name__ == '__main__':
    ip = IpPool()
    ip.crawl()

專案實戰：使用代理IP爬取美團商家評論

import requests,time,random,csv
from fake_useragent import UserAgent
from lxml import etree

starttime = time.time()#記錄開始時間

ips = [] #裝載有效 IP
for i in range(1, 2):
    headers = {
    "User-Agent" : UserAgent().chrome #chrome瀏覽器隨機代理
    }
    ip_url = 'http://www.xiladaili.com/gaoni/{}/'.format(i)
    html = requests.get(url=ip_url, headers=headers).text
    seletor=etree.HTML(html)
    ip_list=seletor.xpath('/html/body/div/div[3]/div[2]/table/tbody/tr/td[1]/text()')
    for i in range(len(ip_list)):
        ip = "http://" + ip_list[i]
        # 設定為字典格式
        proxies = {"http": ip}
        #使用上面的IP代理請求百度，成功後狀態碼200
        baidu = requests.get("https://www.baidu.com/", proxies = proxies)
        if baidu.status_code == 200:
            ips.append(proxies)
    print ("正在準備IP代理，請稍後。。。")

#建立CSV檔案，並寫入表頭資訊，並設定編碼格式為“utf-8-sig”防止中文亂碼
fp = open('./美團_大學城.csv','a', newline='',encoding='utf-8-sig') #"./"表示當前資料夾，"a"表示新增
writer = csv.writer(fp) #方式為寫入
writer.writerow(('使用者ID','使用者名稱', '平均價','評論','回覆')) #表頭

for page in range(0, 515, 10):#0~100
    url = "https://www.meituan.com/meishi/api/poi/getMerchantComment?uuid=2ff7056c-9d76-424c-b564-b7084f7e16e4&platform=1&partner=126&originUrl=https%3A%2F%2Fwww.meituan.com%2Fmeishi%2F193383554%2F&riskLevel=1&optimusCode=10&id=193383554&userId=&offset={}&pageSize=10&sortType=1".format(page)
    headers = {
    "User-Agent" : UserAgent().chrome #chrome瀏覽器隨機代理
    }
    proxy=ips[random.randint(0 , len(ips)-1)]
    print(proxy)
    respone = requests.get(url = url, headers = headers,proxies=proxy) #向伺服器發出請求，伺服器返回結果
    for item in respone.json()['data']['comments']:#遍歷，迴圈
        userId = item['userId']#評論者id
        userName = item['userName']#評論人暱稱
        avgPrice = item['avgPrice']
        comment = item['comment']#評論內容
        merchantComment = item['merchantComment']#商家回覆
        data = (userId, userName, avgPrice, comment, merchantComment)
        writer.writerow((data))
fp.close() #關閉檔案
endtime = time.time()#獲取結束時間
sumTime = endtime - starttime #總的時間
print ("一共用的時間是%s秒"%sumTime)

參考資料：https://blog.csdn.net/ayouleyang/article/details/104453540

python如何基於redis實現ip代理池

這篇文章主要介紹了python如何基於redis實現ip代理池,文中通過示例程式碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值,需要的朋友可以參考下

python3 Scrapy爬蟲框架ip代理配置的方法

什麼是Scrapy？　　Scrapy是一個為了爬取網站資料，提取結構性資料而編寫的應用框架，非常出名，非常強悍。所謂的框架就是一個已經被集成了各種功能（高效能非同步下載，佇列，分散式，解析，持久化等）的具有很強

Python 爬蟲之設定ip代理，設定User-Agent，設定請求頭，設定post載荷

1、get方式：如何為爬蟲新增ip代理，設定Request header（請求頭） import urllib import urllib.request

Python爬蟲設定ip代理過程解析

1、get方式：如何為爬蟲新增ip代理，設定Request header（請求頭） import urllib import urllib.request

scrapy 解決爬蟲IP代理池，資料輕鬆爬。

現在越來越多的人在工作中使用到爬蟲，各個網站的反爬蟲機制也越來越嚴格，下面就自己構建一個代理ip池。

快速設定Scrapy隨機的IP代理

匯入代理列表把隨機的IP列表定義在settings.py檔案裡面 PROXIES=[\'http://180.119.43.106:4228\', \'http://106.56.246.104:4237\', \'http://118.79.56.240:4278\',

從高可用IP代理池到千萬級網易雲音樂資料爬取的實現

Home NLP Spider Linux Pat Tags Github 知乎專欄 CV NLP 利用預訓練語言模型拓展實體集合試談語言模型中無監督非結構化知識融入

使用IP代理後為什麼有時會被封號

有許多網路爬蟲使用者使用了代理ip，為了就是防止抓取資料和採集資料防止ip被封，在正常的進行爬取資料時，突然停了，但是工作還要進行，爬蟲採集是需要使用大量的ip對目標網站進行訪問採集，通常我們使用

爬蟲ip代理

from bs4 import BeautifulSoup import requests import random def get_ip_list(url, headers): web_data = requests.get(url, headers=headers)

IP代理

技術標籤：python爬蟲在我們編寫爬蟲的過程中，有些網站具有反爬措施，這是就需要我們設定代理ip

python3怎麼使用ip代理池抓取頁面？

技術標籤：pythonnginxproxy安全網路在抓取一個網站的資訊時，假如我們停止頻繁的訪問，就很有可能被網站檢測到而被遮蔽，處理這個問題的辦法就是運用ip代理。在我們接入因特網停止上網時，我們的電腦都會被分

關於爬蟲IP代理服務選擇的注意事項

關於IP代理服務的選擇，要著重注意一下幾個指標： 1.白名單數限制這個指標決定了可以使用幾臺機器

IP代理提取池

IP代理提取池 0x001 反爬蟲之IP地址記錄 IP地址記錄主要是針對惡意爬蟲，防止其短時間內大量發起HTTP請求，請求訪問網站，造成網站資源的侵佔。IP地址記錄無非是因為爬蟲訪問同一個網站的速度過快。

Nginx反向代理FTP（公網IP代理到內網IP）

一、Nginx反向代理通常只做七層負載均衡，但現在我叢集需要FTP做檔案系統，上傳下載web檔案，所以要做TCP “四層代理”，Nginx要是想實現四層負載均衡，就需要新增“--with-stream” 模組。

教你如何選擇ip代理。

動態共享代理、動態獨享代理、隧道共享代理、靜態獨享代理有什麼區別？我該如何選擇？

爬蟲 1 -IP代理池

自動獲取ip代理池爬蟲 1 -IP代理池 IP代理池 [time=Tue, Jan 25, 2022 9:34 PM] 原文連結 dokcer-compose檔案下載

顯示 IP 屬地後有人慌了，IP 代理畸形產業爆火：最低 6 元就能更改

感謝網友 SP_CE 的線索投遞！

Python搭建代理IP池實現檢測IP的方法

在獲取 IP 時，已經成功將各個網站的代理 IP 獲取下來了，然後就需要一個檢測模組來對所有的代理進行一輪輪的檢測，檢測可用就設定為滿分，不可用分數就減 1，這樣就可以實時改變每個代理的可用情況，在獲取有效 IP

Python爬蟲使用代理IP的實現

使用爬蟲時，如果目標網站對訪問的速度或次數要求較高，那麼你的 IP 就很容易被封掉，也就意味著在一段時間內無法再進行下一步的工作。這時候代理 IP 能夠給我們帶來很大的便利，不管網站怎麼封，只要能找到一個新的

Python搭建代理IP池實現介面設定與整體排程

介面模組需要用 API 來提供對外服務的介面，當然也可以直接連資料庫來取，但是這樣就需要知道資料庫的連線資訊，不太安全，而且需要配置連線，所以一個比較安全和方便的方式就是提供一個 Web API 介面，通過訪問介面

IP代理

IP代理基本簡介

有無IP代理的區別

使用西拉免費代理IP搭建

使用快代理搭建代理IP池

專案實戰：使用代理IP爬取美團商家評論

相關推薦