[Python網路程式設計]gevent httpclient以及網頁編碼

阿新 • • 發佈：2019-01-28

之前看到geventhttpclient這個專案，https://github.com/gwik/geventhttpclient，官方文件說非常快，由於響應使用了C的解析，所以我一直想把這玩意用到專案中，

這兩天一直在糾結這玩意，說實在一句話，比較難用，封裝的不給力，最大缺陷如下：

1.不支援重定向，重定向需要自己來寫，很費事

2.新建的httpclient物件只能傳送同域名的請求

這相當的蛋疼，我花了一點時間封裝了一下，解決了上面的兩個問題，還增加了自動編解碼問題，程式碼如下：

#!/usr/bin/env python
#-*-encoding:UTF-8-*-

import re
from geventhttpclient.url import URL
from geventhttpclient.client import HTTPClient,HTTPClientPool

from urlparse import urljoin


#from core.common import urljoin

HEADERS = {
    'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0'
}

DEFAULF_METHOD = "GET"
MAX_REDIRECT_TIME = 10
DEFAULT_PAGE_ENCODING = "utf8"

class DifferDomainException(Exception):
    """
    if request different domain url,geventhttpclient will throw it,
    see gevent.client 'raise ValueError("Invalid host in URL")'
    """
    def __init__(self,uri):
        self.args = uri
        self.uri = uri

class MaxRedirectException(Exception):
    def __init__(self,response):
        self.args = response
        self.response = response


class HTTP(HTTPClient):
    def request(self,request_uri, method=DEFAULF_METHOD,body=b"", headers={},follow_redirect=True,redirects=MAX_REDIRECT_TIME):
        if body and method == DEFAULF_METHOD:
            method = "POST"
        h = [ k.title() for k in headers.iterkeys() ]
        headers.update(dict( [ (k,v) for k,v in HEADERS.iteritems() if k not in h ] ))
        response = super(HTTP,self).request(method, request_uri, body, headers)
        if follow_redirect and response.status_code in (301,302,303,307) and response.method in ("GET","POST"):
            if redirects:
                location = response.get('location') or response.get('content-location') or response.get('uri')
                if location:
                    location = urljoin(request_uri,location)
                    if not location.startswith(self._base_url_string):
                        raise DifferDomainException(location)
                    return self.request(location, method, body, headers, follow_redirect,redirects-1)
            else:
                raise MaxRedirectException(response)
        return response

class HTTPPool(HTTPClientPool):
    def get_client(self, url):
        if not isinstance(url, URL):
            url = URL(url)
        client_key = url.host, url.port
        try: 
            return self.clients[client_key]
        except KeyError:
            client = HTTP.from_url(url, **self.client_args)
            self.clients[client_key] = client
            return client

_POLL = HTTPPool(network_timeout=100,connection_timeout=100)

META_CHARSET_REGEX = re.compile(r'(?si)<head>.*<meta http-equiv="?content-type"?[^>]+charset=(?P<result>[^">]+).*</head>')
def decodePage(content,content_type):
    httpCharset, metaCharset = None, None
    if content_type and content_type.find("charset=") != -1:
        httpCharset = content_type.split("charset=")[-1]
    match = META_CHARSET_REGEX.search(content)
    if match:
        metaCharset = match.group('result')
    print httpCharset,metaCharset
    charset = httpCharset or metaCharset or DEFAULT_PAGE_ENCODING

    return content.decode(charset).encode(DEFAULT_PAGE_ENCODING)


def request(request_uri, method=DEFAULF_METHOD,body=b"", headers={},follow_redirect=True,auto_read=True):
    client = _POLL.get_client(request_uri)
    response = None
    try:
        response = client.request(request_uri,method,body,headers,follow_redirect)
    except DifferDomainException,e:
        print "DifferDomainException:"+e.uri
        response = request(e.uri,method,body,headers,follow_redirect)
    except MaxRedirectException,e:
        print "max redirect"
        response = e.response # will return previous response,of course redirect response
    except Exception,e:
        print str(e)

    if auto_read and response:
        with response:
            response.content = decodePage(response.read(),response.get('content-type'))
    return response
   
        
def test():
    # print request("http://127.0.0.1/re.php",follow_redirect=False)
    # print request("http://127.0.0.1/re.php",follow_redirect=True).content
    r=request("http://www.baidu.com/",follow_redirect=False)
    #baidu utf8 utf8
    print r.content[:10]
    r=request("http://www.163.com/",follow_redirect=False)
    #163 gbk gb2312
    print r.content[:10]

test()

在測試網頁編碼問題遇到了一些問題，看下面：

由於頭部的請求先到，所以我們一般認為返回的內容編碼是先根據頭部的，如果沒有再看頁面編碼。

我們看網易的編碼，頭部為gbk，網頁為gb2312，但用gb2312解碼竟然有問題，？？？我很不解，各位大大們為啥呢？

但用頭部gbk解碼是正常的，這也證明了頭部編碼優先。按理說網頁編碼是告訴瀏覽器以gb2312顯示，但明顯有問題，瀏覽器怎麼做到的？

我們再看新浪的，這更讓我鬱悶了，誰來拯救我啊？

[Python網路程式設計]gevent httpclient以及網頁編碼

之前看到geventhttpclient這個專案，https://github.com/gwik/geventhttpclient，官方文件說非常快，由於響應使用了C的解析，所以我一直想把這玩意用到專案中，這兩天一直在糾結這玩意，說實在一句話，比較難用，封裝的不給力，最大缺陷

Python爬蟲系列：判斷目標網頁編碼的幾種方法

qpi data- tps 分享運行 ofo html nbsp 來看在爬取網頁內容時，了解目標網站所用編碼是非常重要的，本文介紹幾種常用的方法，並使用幾個網站進行簡單測試。代碼運行結果：從不同國家的幾個網站測試結果來看，utf8使用的較多（對於純英文網站，用什

python網路程式設計之互斥鎖

標籤（空格分隔）：互斥鎖程序之間的資料不共享，但是共享同一套檔案系統，所以訪問同一個檔案，或者同一個列印終端，是沒有問題的，而共享帶來的問題就是競爭，競爭帶來的結果就是錯亂，如下： #併發執行,效率高,但競爭同一列印終端,帶來了列印錯亂 from multiprocessing import Pr

python網路程式設計之setsockopt

1、TCP連線的建立方法客戶端在建立一個TCP連線時一般需要兩步，而伺服器的這個過程需要四步，具體見下面的比較。步驟 TCP客戶端 TCP伺服器第一步建立socket物件

python網路程式設計整理

1、C/S、B/S架構 client<------>server browser<------>server 學習socket程式設計就是要編寫一個客戶端軟體和服務端軟體然後實現服務端與客戶端基於網路通訊2、什麼是網路？ 1、物理連線介質 2、網際

小飛俠帶你精通Python網路程式設計系列03-Python版本的選擇

1. 目前Python有兩個主要版本Python2.X和Python3.X 2. Python2.X最後一個版本是2.7，目前（2018年10月21日）Python3.X最新版本為3.7 3. 很不幸，Python3不完全相容Python2，事實上Python3 在2008年就釋出了，而最後一個Pytho

python 網路程式設計粘包解決方案2 + ftp上傳 + socketserver

一。struct 神奇的打包工具 struct 程式碼: import struct num = 156 #將int型別的資料打包成4個位元組的資料 num_stru = struct.pack('i',num) print(len(num_stru)) print

Python網路程式設計攻略(2-2)

用select.select編寫一個聊天伺服器程式碼如下： #coding: utf-8 import select import socket import sys import signal import cPickle import struct import argparse SER

python網路程式設計學習(2-1)

使用多路服用套接字I/O提升效能 1.使用ForkingMixIn：為每個客戶端建立一個新的程序（1）建立ForkingServer，繼承於TCP Server和ForkingMixIn類，前一個類可以完成伺服器的基本操作，後一個則是可以非同步處理客戶端（2）ForkingServer類需

python網路程式設計學習(1)

CHAPTER 1 gethostname:獲取主機名 gethostbyame（host_name）：獲取host_name的IP地址，其中host_name可以為網址 inet_aton(ip_addr):將IP地址轉為十六進位制 inet_ntoa(packed_ip_addr):將十

python網路程式設計基礎之socket粘包現象

粘包現象兩種　　　　登陸 #服務端import json import socket server=socket.socket()#建立socket物件 ip_port=('127.0.0.1',8001)#獲取本機迴環地址和通訊埠號 server.bind(ip_port)#繫結ip地址和埠

python網路程式設計之udp的聊天室

import socket def main(): #建立套接字 udp_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) udp_socket.bind(("192.168.227.129",7567

python網路程式設計-udp

目錄 1. 建立socket 2. udp網路程式-傳送資料 3. udp網路程式-接收資料 4. python3中的編碼轉換 5. udp埠繫結 1. 建立socket 在 Python 中使用socket 模組的函式 socket 就可以完成（大多數的程式語言的套路是相似

python-網路程式設計-tcp/ip認識(一)

本文作者：黎智煊，叩丁狼高階講師。原創文章，轉載請註明出處。網際網路的認識首先來了解一下最早的網際網路的背景. 網際網路始於1969年美國的阿帕網,阿帕網（Advanced Research Projects Agency Network，ARPA

python-網路程式設計-tcp/ip認識(二)

本文作者：黎智煊，叩丁狼高階講師。原創文章，轉載請註明出處。埠 1 什麼是埠叩丁狼教育.png 那麼TCP/IP協議中的埠指的是什麼呢？埠就好一個房子的門，是出入這間房子的必經之路。叩丁狼教育.png 如果一個程序需要收發網路資料，那麼就

Python網路程式設計零基礎2

因為基礎內容較多所以內容回顧較多（程式碼放到下面）網路效能衡量指標頻寬：通訊通道上支援的最高資料傳輸頻率速率：每秒中傳輸多少bits資料吞吞量：單位時間內通過某個網路資料量時延：傳輸時延，傳播時延，處理時延，排隊時

Python網路程式設計零基礎1

1. 網路的概念 1）網路：多個節點通過特定的介質聯絡起來的一種關係 2）計算機網路：以計算裝置作為節點通訊線路作為介質的網路 3）網際網路（英特網）把許多的網路連在一起的網路 4）起源：起源於冷戰 2. 網路功能

Python網路程式設計之socket應用

1 引言 2 網路基礎 3 socket介紹 4 socket基本使用 5 總結 1 引言本篇主要對Python下網路程式設計中用到的socket模組進行初步總結。首先從網路基礎理論出發，介紹了TC

python網路程式設計3-socketserver模組

一、複習 # 解決黏包問題 #為什麼會出現黏包問題 # 首先只有在TCP協議中才會出現黏包現象 # 是因為TCP協議是面向流的協議 # 在傳送的資料傳輸過程中有快取機制來避免資料丟失 # 因此在連續傳送小資料的時候，以及接收大小不符的時候都容易出現尿包現象

python網路程式設計2-黏包問題

一、複習 # ip地址：一臺機器在網路上的位置 # 公網ip 私網ip # TCP協議：可靠，面向連線的，耗時長 #三次握手 #四次揮手 # UDP協議：不可靠，無連線，效率高 # ARP協議：通過ip找mac的過程 # ip協議屬於網路osi中的

[Python網路程式設計]gevent httpclient以及網頁編碼

相關推薦