python3 使用urllib報錯urlopen error EOF occurred in violation of protocol (_ssl.c:841)
阿新 • • 發佈:2019-09-01
python3原始碼:
import urllib.request from bs4 import BeautifulSoup response = urllib.request.urlopen("http://php.net/") html = response.read() soup=BeautifulSoup(html, "html5lib") text=soup.get_text(strip=True) print(text)
程式碼很簡單,就是抓取http://php.net/頁面文字內容,然後使用BeautifulSoup模組清除過濾掉多餘的html標籤。貌似第一次允許成功了,之後一直卡著再報錯:
File "C:\Python36\lib\urllib\request.py", line 504, in _call_chain result = func(*args) File "C:\Python36\lib\urllib\request.py", line 1361, in https_open context=self._context, check_hostname=self._check_hostname) File "C:\Python36\lib\urllib\request.py", line 1320, in do_open raise URLError(err) urllib.error.URLError: <urlopen error EOF occurred in violation of protocol (_ssl.c:841)>
實際上google瀏覽器是能夠訪問的。
此問題可能是由於Web伺服器上禁用了SSLv2,而比較老的python庫Python 2.x嘗試預設情況下與PROTOCOL_SSLv23建立連線。因此在這種情況下,需要選擇請求使用的SSL版本。
要更改HTTPS中使用的SSL版本,需要將該HTTPAdapter類子類化並將其掛載到 Session物件。例如,如果想強制使用TLSv1,則新的傳輸介面卡將如下所示:
from requests.adapters import HTTPAdapter from requests.packages.urllib3.poolmanager import PoolManager class MyAdapter(HTTPAdapter): def init_poolmanager(self, connections, maxsize, block=False): self.poolmanager = PoolManager(num_pools=connections, maxsize=maxsize, block=block, ssl_version=ssl.PROTOCOL_TLSv1)
然後,可以將其掛載到Requests Session物件:
s=requests.Session() s.mount('https://', MyAdapter()) response = urllib.request.urlopen("http://php.net/")
編寫一個通用傳輸介面卡還是很簡單,它可以從ssl建構函式中的包中獲取任意SSL型別並使用它。
from requests.adapters import HTTPAdapter from requests.packages.urllib3.poolmanager import PoolManager class SSLAdapter(HTTPAdapter): '''An HTTPS Transport Adapter that uses an arbitrary SSL version.''' def __init__(self, ssl_version=None, **kwargs): self.ssl_version = ssl_version super(SSLAdapter, self).__init__(**kwargs) def init_poolmanager(self, connections, maxsize, block=False): self.poolmanager = PoolManager(num_pools=connections, maxsize=maxsize, block=block, ssl_version=self.ssl_version)
修改後的上述出錯的程式碼:
import urllib.request from bs4 import BeautifulSoup import requests from requests.adapters import HTTPAdapter from requests.packages.urllib3.poolmanager import PoolManager import ssl class MyAdapter(HTTPAdapter): def init_poolmanager(self, connections, maxsize, block=False): self.poolmanager = PoolManager(num_pools=connections, maxsize=maxsize, block=block, ssl_version=ssl.PROTOCOL_TLSv1) s=requests.Session() s.mount('https://', MyAdapter()) response = urllib.request.urlopen("http://php.net/") html = response.read() soup=BeautifulSoup(html, "html5lib") text=soup.get_text(strip=True) print(text)
可以正常抓取網頁文字信