1. 程式人生 > >urllib 模塊 - module urllib

urllib 模塊 - module urllib

protoc sage module nbsp when type where glob delet

  1 urllib 模塊 - urllib module
  2 
  3 獲取 web 頁面,
  4     html = urllib.request.urlopen("http://www.zzyzz.top/")
  5     html2 = urllib.request.Request("http://www.zzyzz.top/")
  6     print("html",html)
  7     print("html2",html2)
  8     
  9     output,
 10         html <http.client.HTTPResponse object at 0x0395DFF0>
 11
html2 <urllib.request.Request object at 0x03613930> 12 13 Methods of HTTPResponse object, 14 geturl() — return the URL of the resource retrieved, 15 commonly used to determine if a redirect was followed 16 得到最終顯示給用戶的頁面的 url (並不一定是所提供參數的 url, 因為有可能有
17 redirect 情況) 18 19 info() — return the meta-information of the page, such as headers, in the 20 form of an email.message_from_string() instance (see Quick Reference 21 to HTTP Headers) 22 23
getcode() – return the HTTP status code of the response. 24 25 Methods of Request object, 26 Request.full_url 27 The original URL passed to the constructor. 28 Request.full_url is a property with setter, getter and a deleter. 29 Getting full_url returns the original request URL with the fragment, 30 if it was present. 31URL 參數(區別於 HTTPResponse object 的 geturl() 方法) 32 33 Request.type 34 The URI scheme. 35 http , https 等 字符串 36 37 Request.host 38 The URI authority, typically a host, but may also contain a port 39 separated by a colon. 40 即 host IP Addr. (可能會同時得到 port 端口號) 41 42 Request.origin_req_host 43 The original host for the request, without port. 44 即 host IP Addr, 不含 port 信息. 45 46 Request.selector 47 The URI path. If the Request uses a proxy, then selector will be the 48 full URL that is passed to the proxy. 49 即 訪問 server 的 path(相對於server 的 root 來說), 50 例如 / 表示 server root 跟目錄. 51 52 Request.data 53 The entity body for the request, or None if not specified. 54 例如 POST 的 form 信息. urllib.request.Request("http://www.zzyzz.top/",data) 55 # data = {"Hi":"Hello"} 56 57 Request.unverifiable 58 boolean, indicates whether the request is unverifiable as defined by RFC 2965. 59 60 Request.method 61 The HTTP request method to use. By default its value is None, which means 62 that get_method()will do its normal computation of the method to be used. 63 Its value can be set (thus overriding the default computation in get_method()) 64 either by providing a default value by setting it at the class level in a 65 Request subclass, or by passing a value in to the Request constructor 66 via the method argument. 67 68 Request.get_method() 69 Return a string indicating the HTTP request method. If Request.method 70 is not None,return its value, otherwise return GET if Request.data 71 is None, or POST if it’s not.This is only meaningful for HTTP requests. 72 POST 或者 GET 73 74 Request.add_header(key, val) 75 Add another header to the request. Headers are currently ignored by 76 all handlers except HTTP handlers,where they are added to the list 77 of headers sent to the server. Note that there cannot be more than 78 one header with the same name, and later calls will overwrite previous 79 calls in case the key collides.Currently, this is no loss of HTTP 80 functionality, since all headers which have meaning when used more 81 than once have a (header-specific) way of gaining the same 82 functionality using only one header. 83 84 Request.add_unredirected_header(key, header) 85 Add a header that will not be added to a redirected request. 86 87 Request.has_header(header) 88 Return whether the instance has the named header (checks both 89 regular and unredirected). 90 91 Request.remove_header(header) 92 Remove named header from the request instance (both from regular 93 and unredirected headers). 94 95 Request.get_full_url() 96 Return the URL given in the constructor. 97 得到的其實是 Request.full_url 98 99 Request.set_proxy(host, type) 100 Prepare the request by connecting to a proxy server. The host and 101 type will replace those of the instance, and the instance’s selector 102 will be the original URL given in the constructor. 103 104 Request.get_header(header_name, default=None) 105 Return the value of the given header. If the header is not present, 106 return the default value. 107 108 Request.header_items() 109 Return a list of tuples (header_name, header_value) of the Request headers. 110 111 例子, 獲取 html codes, 112 urlobj = urllib.request.Request("http://www.zzyzz.top/") 113 with urllib.request.urlopen(urlobj) as FH: # 文件類對象 114 print(FH.read().decode(utf8)) 115 116 Authentication, 117 當訪問一個需要進行認證的 URL, 會得到一個 HTTP 401 錯誤,表示所訪問的 URL 需要 Authentication. 118 Authentication 通常由種形式, 119 1, 瀏覽器 explorer 顯示一個彈出框, 要求用戶提供 用戶名 密碼進行認證, 它是基於 cookies 的. 120 2, form 表單形式的認證, 在 web 界面要求用戶提供 用戶名 密碼, 然後通過 POST 方法將認證信息 121 發送給 server 端進行認證. 122 123 基於 cookies 的 Authentication 認證 - Basic HTTP Authentication 124 import urllib.request 125 # Create an OpenerDirector with support for Basic HTTP Authentication... 126 auth_handler = urllib.request.HTTPBasicAuthHandler() 127 auth_handler.add_password(realm= None, 128 uri="http://www.zzyzz.top/", 129 user=userid, 130 passwd=password) 131 opener = urllib.request.build_opener(auth_handler) 132 # ...and install it globally so it can be used with urlopen. 133 urllib.request.install_opener(opener) 134 html = urllib.request.urlopen("http://www.zzyzz.top/") 135 print(html.read().decode(utf8)) 136 137 基於 form 表單的 Authentication 認證, 138 再 server 端是通常這樣處理, 對用戶 submit(POST) 的 form 表單的數據信息做驗證, 139 若驗證通過 redirect 到授權頁面, 否者 redirect 到 login 界面要求用戶重新 POST 140 認證信息. 141 所以對於這一類的認證, 正常按照 POST form 的方法對待就可以了. 142 urlobj = urllib.request.Request("http://www.zzyzz.top/",{"id":"userid","pw":"password"}) 143 with urllib.request.urlopen(urlobj) as FH: # 文件類對象 144 print(FH.read().decode(utf8)) 145 146 異常處理 - error handling 147 148 其他協議 - other protocols except HTTP 149 150 Reference, 151 https://docs.python.org/3/library/urllib.request.html#module-urllib.request

urllib 模塊 - module urllib