python爬蟲設定代理ip池——方法（一）

阿新 • • 發佈：2019-01-22

"""

在使用python爬蟲的時候，經常會遇見所要爬取的網站採取了反爬取技術，高強度、高效率地爬取網頁資訊常常會給網站伺服器帶來巨大壓力，所以同一個IP反覆爬取同一個網頁，就很可能被封，那如何解決呢？使用代理ip，設定代理ip池。

以下介紹的免費獲取代理ip池的方法：

優點：

1.免費

缺點：

1.代理ip穩定性差需要經常更換

2.爬取後ip存在很多不可用ip需要定期篩選

小建議：

該方法比較適合學習使用，如果做專案研究的話建議參考本人部落格《python爬蟲設定代理ip池——方法（二）》，購買穩定的代理ip

"""

一.主要思路

1.從代理ip網站爬取IP地址及埠號並儲存

2.驗證ip是否能用

3.格式化ip地址

在requests中使用代理ip爬取網站

二. 寫在前面

在Requests中使用代理爬取的格式是import requests
requests.get(url, headers=headers,proxies=proxies)
其中proxies是一個字典其格式為：對每個ip都有proxies = {
http: 'http://114.99.7.122:8752'
https: 'https://114.99.7.122:8752'
}
注意：對於http和https兩個元素，這裡的http和https
代表的不是代理網站上在ip後面接的型別代表的是requests訪問的網站的傳輸型別是http還是https

你爬的網站是http

型別的你就用http，如果是https型別的你就用https,在代理網站上爬的時候也要分別爬http或https的ip

三.程式碼

1.配置環境，匯入包

# IP地址取自國內髙匿代理IP網站：http://www.xicidaili.com/nn/
# 僅僅爬取首頁IP地址就足夠一般使用

from bs4 import BeautifulSoup
import requests
import random

 headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'}

2.獲取網頁內容函式

def getHTMLText(url,proxies):
    try:
        r = requests.get(url,proxies=proxies)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
    except:
        return 0
    else:
        return r.text

3.從代理ip網站獲取代理ip列表函式，並檢測可用性，返回ip列表

def get_ip_list(url):
    web_data = requests.get(url,headers)
    soup = BeautifulSoup(web_data.text, 'html')
    ips = soup.find_all('tr')
    ip_list = []
    for i in range(1, len(ips)):
        ip_info = ips[i]
        tds = ip_info.find_all('td')
        ip_list.append(tds[1].text + ':' + tds[2].text)
#檢測ip可用性，移除不可用ip：（這裡其實總會出問題，你移除的ip可能只是暫時不能用，剩下的ip使用一次後可能之後也未必能用）
    for ip in ip_list:
        try:
          proxy_host = "https://" + ip
          proxy_temp = {"https": proxy_host}
          res = urllib.urlopen(url, proxies=proxy_temp).read()
        except Exception as e:
          ip_list.remove(ip)
          continue
    return ip_list

4.從ip池中隨機獲取ip列表

def get_random_ip(ip_list):
    proxy_list = []
    for ip in ip_list:
        proxy_list.append('http://' + ip)
    proxy_ip = random.choice(proxy_list)
    proxies = {'http': proxy_ip}
    return proxies

5.呼叫代理

if __name__ == '__main__':
    url = 'http://www.xicidaili.com/nn/'
    ip_list = get_ip_list(url)
    proxies = get_random_ip(ip_list)
    print(proxies)

python爬蟲設定代理ip池——方法（一）

python爬蟲設定代理ip池——方法（一）

Python爬蟲設定代理IP爬取知乎圖片

Python 字符串內置方法（一）

Python爬蟲學習6：scrapy入門（一）爬取汽車評論並儲存到csv檔案

Python實現爬蟲設定代理IP和偽裝成瀏覽器的方法分享

scrapy 設定代理ip和cookies（微博）

python——爬蟲&問題解決&思考（四）

python中關於操作時間的方法（二）：使用datetime模塊

Selenium2+python自動化45-18種定位方法（find_elements）【轉載】

JDBC資料庫連線池連線資料庫及資料庫操作DAO層設計通用更新及查詢方法（一）

Python 字串內建方法（一）

Python 字符串內置方法（二）

python爬蟲實踐——零基礎快速入門（二）爬取豆瓣電影

如何給Python dict設定預設的返回值（value）

selenium+python實現檔案上傳的方法（1）

python爬蟲"Hello World"級入門例項（二）,使用json從中國天氣網抓取資料

python爬蟲之requests庫詳解（一，如何通過requests來獲得頁面資訊）

python爬蟲-爬取愛情公寓電影（2018）豆瓣短評並資料分析

python爬蟲實踐——零基礎快速入門（四）爬取小豬租房資訊

okex 加密貨幣自動化交易 Python量化通過api交易的方法（五）

python爬蟲設定代理ip池——方法（一）

相關推薦