python 爬蟲urllib基礎示例

阿新 • • 發佈：2018-05-30

urllib 爬蟲基礎

環境使用python3.5.2 urllib3-1.22

下載安裝

wget https://www.python.org/ftp/python/3.5.2/Python-3.5.2.tgz

tar -zxf Python-3.5.2.tgz

cd Python-3.5.2/

./configure --prefix=/usr/local/python

make && make install

mv /usr/bin/python /usr/bin/python275

ln -s /usr/local/python/bin/python3 /usr/bin/python

wget https://files.pythonhosted.org/packages/ee/11/7c59620aceedcc1ef65e156cc5ce5a24ef87be4107c2b74458464e437a5d/urllib3-1.22.tar.gz

tar zxf urllib3-1.22.tar.gz

cd urllib3-1.22/

python setup.py install

瀏覽器模擬示例

添加headers一：build_opener()
import urllib.request
url="http://www.baidu.com"
headers=("User-Agent","Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36")
opener=urllib.request.build_opener()
opener.addheaders=[headers]
data=opener.open(url).read()
fl=open("/home/urllib/test/1.html","wb")
fl.write(data)
fl.close()

添加headers二：add_header()
import urllib.request
url="http://www.baidu.com"
req=urllib.request.Request(url)
req.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36")
data=urllib.request.urlopen(req).read()
fl=open("/home/urllib/test/2.html","wb")
fl.write(data)
fl.close()

增加超時設置

timeout超時
import urllib.request
for i in range(1,100):
	try:
		file=urllib.request.urlopen("http://www.baidu.com",timeout=1)
		data=file.read()
		print(len(data))
	except Exception as e:
		print("出現異常---->"+str(e))

HTTP協議GET請求一

get請求
import urllib.request
keywd="hello"
url="http://www.baidu.com/s?wd="+keywd
req=urllib.request.Request(url)
req.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36")
data=-urllib.request.urlopen(req).read()
fl=open("/home/urllib/test/3.html","wb")
fl.write(data)
fl.close()

HTTP協議GET請求二

get請求 （編碼）
import urllib.request
keywd="中國"
url="http://www.baidu.com/s?wd="
key_code=urllib.request.quote(keywd)
url_all=url+key_code
req=urllib.request.Request(url_all)
req.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36")
data=-urllib.request.urlopen(req).read()
fl=open("/home/urllib/test/4.html","wb")
fl.write(data)
fl.close()

HTTP協議POST請求

post請求
import urllib.request
import urllib.parse
url="http://www.baidu.com/mypost/"
postdata=urllib.parse.urlencode({
"user":"testname",
"passwd":"123456"
}).encode('utf-8')
req=urllib.request.Request(url,postdata)
red.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36")
data=urllib.request.urlopen(req).read()
fl=open("/home/urllib/test/5.html","wb")
fl.write(data)
fl.close()

使用代理服務器

def use_proxy(proxy_addr,url):
	import urllib.request
	proxy=urllib.request.ProxyHandler({'http':proxy_addr})
	opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
	urllib.request.install_opener(opener)
	data=urllib.request.urlopen(url).read().decode('utf-8')
	return data
proxy_addr="201.25.210.23:7623"
url="http://www.baidu.com"
data=use_proxy(proxy_addr,url)
fl=open("/home/urllib/test/6.html","wb")
fl.write(data)
fl.close()

開啟DebugLog

import urllib.request
url="http://www.baidu.com"
httpd=urllib.request.HTTPHandler(debuglevel=1)
httpsd=urllib.request.HTTPSHandler(debuglevel=1)
opener=urllib.request.build_opener(opener)
urllib.request.install_opener(opener)
data=urllib.request.urlopen(url)
fl=open("/home/urllib/test/7.html","wb")
fl.write(data)
fl.close()

URLError異常處理

URLError異常處理
import urllib.request
import urllib.error
try:
	urllib.request.urlopen("http://blog.csdn.net")
except urllib.error.URLError as e:
	print(e.reason)

HTTPError處理	
import urllib.request
import urllib.error
try:
	urllib.request.urlopen("http://blog.csdn.net")
except urllib.error.HTTPError as e:
	print(e.code)
	print(e.reason)

結合使用
import urllib.request
import urllib.error
try:
	urllib.request.urlopen("http://blog.csdn.net")
except urllib.error.HTTPError as e:
	print(e.code)
	print(e.reason)
except urllib.error.URLError as e:
	print(e.reason)

推薦方法：
import urllib.request
import urllib.error
try:
	urllib.request.urlopen("http://blog.csdn.net")
except urllib.error.URLError as e:
	if hasattr(e,"code"):
		print(e.code)
	if hasattr(e,"reason"):
		print(e.reason)

示例僅供參考

python 爬蟲urllib基礎示例

urllib 爬蟲基礎環境使用python3.5.2 urllib3-1.22 下載安裝wget https://www.python.org/ftp/python/3.5.2/Python-3.5.2.tgztar -zxf Python-3.5.2.tgzcd Python-3.5.2/./

Python爬蟲-urllib的基本用法

quest resp lan roc 用法 rom handler baidu github from urllib import response,request,parse,error from http import cookiejar if __name__

python爬蟲 urllib庫基本使用

afa 識別 urllib spa response aid gen odin pos 以下內容均為python3.6.*代碼學習爬蟲，首先有學會使用urllib庫，這個庫可以方便的使我們解析網頁的內容，本篇講一下它的基本用法解析網頁 #導入urllib from u

python 爬蟲開發基礎知識

Request請求方式常用的有get post請求的url 第一部分是協議(或稱為服務方式)第二部分是存有該資源的主機IP地址(有時也包括埠號)第三部分是主機資源的具體地址，如目錄和檔名等請求頭包含請求時的頭部資訊，如User-Agent,Host,Cookies等資訊請求體請求時攜帶的資料，如提

python爬蟲urllib庫使用

urllib包括以下四個模組：　　1.request:基本的HTTP請求模組，可以用來模擬傳送請求。就像在瀏覽器裡輸入網址然後回車一樣，只需要給庫方法傳入URL以及額外的引數，就可以模擬實現這個過程。　　2.error：異常處理模組　　3.parse：提供了許多URL處理方法，如拆分、解析、合併等

1.0 -Python爬蟲-Urllib/Requests

0 爬蟲準備工作參考資料 python網路資料採集，圖靈工業出版精通Python爬蟲框架Scrapy，人民郵電出版社 Python3網路爬蟲 Scrapy官方教程前提知識 url http協議 web前端，h

Python 爬蟲 urllib模組：get方式

本程式以爬取百度首頁為例格式：匯入urllib.request 開啟爬取的網頁: response = urllib.request.urlopen('網址') 讀取網頁程式碼: html = response.read() 列印:

Python 爬蟲 urllib模組：post方式

本程式以爬取 'http://httpbin.org/post' 為例格式：匯入urllib.request 匯入urllib.parse 資料編碼處理，再設為utf-8編碼: bytes(urllib.parse.urlenco

Python爬蟲--urllib

urllib包含模組： -urllib.request：開啟和讀取urls -urllib.error：包含urllib.request產生的常見的錯誤，使用try捕捉

python爬蟲urllib庫詳解

什麼是Urllib Urllib是python內建的HTTP請求庫，中文文件如下：https://docs.python.org/3/library/urllib.html包括以下模組urllib.request 請求模組urllib.error 異常處理模組urllib.parse url解析模組urll

Python 爬蟲 --- urllib

-s 屬性 proc tpc urlopen fire res win mat 對於互聯網數據，Python 有很多處理網絡協議的工具，urllib 是很常用的一種。一、urllib.request，request 可以很方便的抓取 URL 內容。 urllib.req

python爬蟲——BeautifulSoup基礎操作

安裝好BeautifulSoup4和Jupyter之後，在cmd中輸入jupyter notebook 執行，會直接跳轉到網頁jupyter編輯器中。 import requests newsur

(二)python爬蟲urllib庫的基本使用及瞭解第一小節

urllib庫是python官方提供的一個http請求庫,在python3中的urllib庫其實是把python2裡的urllib庫和urllib2整合在一起的.我們這裡主要說的就是python3首先開發環境:系統: windows10開發語言: Python3IDE: py

【網路爬蟲】：Python：url基礎：urllib

文章目錄 1 簡單介紹 2 相關區別 3 例項講解（1）urllib （2）ulrlib2 （3）httplib （4）requests 4 專案實戰 1 簡單介紹

Python爬蟲基礎——urllib.request

#-*- coding:UTF-8 -*- #Author Chen Da import urllib.request import random # 所謂網頁抓取，就是把URL地址中指定的網路資源從網路流中讀取出來； # User-Agent是爬蟲與反爬蟲的第一步，養成

urllib庫的簡單使用 && 一個簡單的Python爬蟲示例

urllib庫的簡單使用 && 一個簡單的Python爬蟲示例本篇文章，介紹urllib.request庫的簡單使用以及注意的問題。最後實現一個Python爬蟲的示例。本文是基於Python3.6.2實現的。urllib.request相

python爬蟲基礎知識（一）--Urllib.request

explain：The urllib.request module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest aut

小白學 Python 爬蟲（11）：urllib 基礎使用（一）

人生苦短，我用 Python 前文傳送門：小白學 Python 爬蟲（1）：開篇小白學 Python 爬蟲（2）：前置準備（一）基本類庫的安裝小白學 Python 爬蟲（3）：前置準備（二）Linux基礎入門小白學 Python 爬蟲（4）：前置準備（三）Docker基礎入門小白學 Pyth

python 爬蟲urllib基礎示例

python 爬蟲urllib基礎示例

Python爬蟲-urllib的基本用法

python爬蟲 urllib庫基本使用

python 爬蟲開發基礎知識

python爬蟲urllib庫使用

1.0 -Python爬蟲-Urllib/Requests

Python 爬蟲 urllib模組：get方式

Python 爬蟲 urllib模組：post方式

Python爬蟲--urllib

python爬蟲urllib庫詳解

Python 爬蟲 --- urllib

python爬蟲——BeautifulSoup基礎操作

(二)python爬蟲urllib庫的基本使用及瞭解第一小節

【網路爬蟲】：Python：url基礎：urllib

Python爬蟲基礎——urllib.request

urllib庫的簡單使用 && 一個簡單的Python爬蟲示例

python爬蟲基礎知識（一）--Urllib.request

小白學 Python 爬蟲（11）：urllib 基礎使用（一）

小白學 Python 爬蟲（12）：urllib 基礎使用（二）

小白學 Python 爬蟲（13）：urllib 基礎使用（三）

python 爬蟲urllib基礎示例

相關推薦