1. 程式人生 > >Python--34 爬蟲

Python--34 爬蟲

func 聯網 function byte uda 傳輸 content out ppi

Python如何訪問互聯網

  URL + lib --> urllib

URL的一般格式為

  protocol://hostname[:port]/path/[;parameters][?query]#fragment

URL由三部分組成

  第一部分是協議:http,https,ftp,file,ed2k......

  第二部分是存放資源服務器的域名系統或IP地址(有時候要包含端口號,各種傳輸協議都有默認的端口號,如http的默認端口為80)

  第三部分是資源的具體地址,如目錄或文件名等

urllib包含四個模塊

  urllib.request for opening and reading URLs

  urllib.error containing the exceptions raised by urllib.request

  urllib.parse for parsing URLS

  urllib.robotparser for parsing robots.txt files

    urllib.request.urlopen(url,data = None,[timeout,]*,cafile = None,capath=None,cadefault = False)

    Open the URL url,which can be either a string or a Request object.

>>> import urllib.request
>>> response = urllib.request.urlopen(http://www.weparts.net)
>>> html = response,read()
>>> print(html.decode(utf-8))

實戰

import urllib.request
response = urllib.request.urlopen(http://placekitten.com/g/500/600)
cat_img = response.read()
with open(
cat_500_600.jpg,wb) as f: f.write(cat_img)
import urllib.request
req = urllib.request.Request(http://placekitten.com/g/500/600)
response = urllib.request.urlopen(req)
cat_img = response.read()
with open(cat_500_600.jpg,wb) as f:
    f.write(cat_img)
>>>response.geturl()
http://placeketten.com/g/500/600
>>>response.info()
<bound method HTTPResponse.geturl of <http.client.HTTPResponse object at 
>>>print(response.info())
0x7fe88d136f60>>
Date: Thu, 14 Sep 2017 08:10:46 GMT
Content-Type: image/jpeg
Content-Length: 26590
Connection: close
Set-Cookie: __cfduid=dc52691cf479658e05d15824990dabeb11505376646; expires=Fri, 14-Sep-18 08:10:46 GMT; path=/; domain=.placekitten.com; HttpOnly
Accept-Ranges: bytes
X-Powered-By: PleskLin
Access-Control-Allow-Origin: *
Cache-Control: public
Expires: Thu, 31 Dec 2020 20:00:00 GMT
Server: cloudflare-nginx
CF-RAY: 39e1df2a94ee77a2-LAX
>>>response.getcode()
200

data urllib .parse.urlencode() function takes a mapping or sequence of 2-tuples and returns a string in this format

import urllib.request
url = http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule&sessionFrom=
data= {}
data[i] = I love Junjie
data[from] = AUTO
data[to] = AUTO
data[smartresult] = dict
data[client] = fanyideskweb
data[salt] = 1505376958945
data[sign] = 86bb3d2294c81c8d6718e800f939bf45
data[doctype] = json
data[version] = 2.1
data[keyfrom] = fanyi.web
data[action] = FY_BY_CLICKBUTTION
data[typoResult] = true
data = urllib.parse.urlencode(data).encode(utf-8)
response = urllib.request.urlopen(url,data)
html = response.read().decode(utf-8)
print(html)
import json
json,loads(html) #得到的就是一個字典

隱藏

urllib.request.Request(url,data = None, headers = {},origin_req_host = None,unverifiable = False, method = None)

headers should be a dictionary

add_header()

Python--34 爬蟲