爬蟲--Python常用模組之requests,urllib和re

阿新 • • 發佈：2018-11-10

一、爬蟲的步驟

　　1.發起請求，模擬瀏覽器傳送一個http請求

　　2.獲取響應的內容

　　3.解析內容（解析出對自己有用的部分）

　　　　a.正則表示式

　　　　b.BeautifulSoup模組

　　　　c.pyquery模組

　　　　d.selenium模組

　　4.儲存資料

　　　　a.文字檔案(txt,csv等)

　　　　b.資料庫(mysql)

　　　　c.redis,mongdb（最長使用）

二、使用Jupyter

　　2.1.使用理由：Jupyter可以一次執行，講結果儲存到記憶體裡，供後面python語句多次使用，避免每次除錯程式，都要重新請求網頁，頻繁爬取某個網頁，容易導致該網站封IP

　　2.2.使用步驟：

　　　　a.安裝：pip install jupyter (前提是已經安裝了python3)

　　　　b.執行：jupyter notebook，瀏覽器自動彈出使用介面

　　　　c.右側New-->python3，新建一個python程式

　　2.3.快捷鍵

　　　　shift + enter鍵：選定行執行，執行結果保留到記憶體

三、爬蟲請求模組之urllib

　　3.1 urllib介紹

Python標準庫中提供了:urllib等模組以供Http請求，但是它的API能力不佳，需要巨量的工作，甚至包括各種方法覆蓋，來完成最簡單的任務，不推薦使用 
，此處只是瞭解一下

　　3.2 簡單使用

#方式一：
import urllib.request

f = urllib.request.urlopen('http://www.baidu.com')
result = f.read().decode('utf-8')
print(result)


#方式二：
import urllib.request
req = urllib.request.Request('http://www.baidu.com')
response = urllib.urlopen(req)
result = response.read().decode('utf-8 
')
print(result)

ps：硬要使用urllib模組，推薦使用方式二，因為req是一個Request物件，在這個物件裡，可以定義請求的頭部資訊，這樣可以把自己包裝成像個瀏覽器發起的請求，如下面的一個例子

　　3.3自定義請求頭資訊

import urllib.request
req = urllib.request.Request('http://www.example.com')

#自定義頭部，第一個引數為關鍵字引數key，第二個引數為內容
req.add_header("User-Agent","Mozilla/5.0(X11;Ubuntu;Linux x86_64;rv:39.0) Gecko/20100101 Firefox/39.0") 

f = urllib.request.urlopen(req)
result = f.read().decode('utf-8')

#有一個模組fake_useragent可以隨機產生User-Agent資訊，對於網站的反爬蟲機制有一定的欺騙作用

　　3.4 fake_useragent使用

#1.安裝pip install fake_useragent

#2.基本使用
from fake_useragent import UserAgent
ua = UserAgent()
print(ua.chrome)   #產生一個谷歌的核心欄位

#常用屬性
ua.chrome      #產生一個谷歌的核心欄位
ua.ie              #隨機產生ie核心欄位
ua.firefox       #隨機產生火狐核心欄位
ua.random    #隨機產生不同瀏覽器的核心欄位

四、爬蟲請求模組之requests

　　4.1 requests模組介紹

Requests是使用Apache2 Licensed許可證的，基於Python開發的HTTP庫，其在Python內建模組的基礎上進行了高度的封裝，從而使得進行網路請求時，
變得美好了許多，而且使用Requests可以輕而易舉的完成瀏覽器可以做到的任何操作

　　4.2 requests安裝

pip3 install requests

　　4.3 簡單使用

import requests

r = requests.get('http://www.example.com')
print(type(r))
print (r.status_code)   #伺服器返回的狀態碼
print (r.encoding)       #網站使用的編碼
print (r.text)              #返回的內容，字串型別

　　4.4 get請求

#1.無引數例項
import requests
res = requests.get('http://www.example.com')

print (res.url)    #列印請求的url
print (res.text)    #列印伺服器返回的內容


#2.有引數例項
import requests
payload = ['k1':'v1','k2':'v2']
res = requests.get('http://httpbin.org/get'，params=payload)

print (res.url)
print (res.text)

#3.解析json
import requests
import json

response = rquests.get('http://httpbin.org/get')
print (type(response.text))    #返回結果是字串型別
pirnt (response.json())          #字串轉成json格式
print (json.loads(response.text))  #字串轉成json格式
print (type(response.json()))    #json型別


#4.新增headers
import requests
from fake_useragent import UserAgent
ua = UserAgent()

#自定義請求頭部資訊
headers= {
    'User-Agent':ua.chrome
}
response = requests.get('http://www.zhihui.com',headers = headers)
print (response.text)

　　4.5 post請求

#1.基本POST例項

import requests

#當headers為application/content的時候，請求例項如下：
payload = {'k1':'v1','k2':'v2'}
res = requests.post('http://httpbin.org/post',data = payload)

print (res.text)
print (type(res.headers),res.headers)
print (type(res.cookies),res.cookies)
print (type(res.url),res.url)
print (type(res.history),res.history)


#2.傳送請求頭和資料例項
import json
import requests

url = 'http://httpbin.org/post'
payload = {'some':'data'}
headers = {'content-type':'application/json'}

#當headers為application/json的時候,請求例項如下：
res = requests.post(url,data=json.dumps(payload), headers = headers)

print (res.text)

　　4.6關於get與post請求的差別

get請求方法引數只有params，而沒有data引數，而post請求中兩者都是有的

　　4.7 http返回程式碼

 1 100：continue
 2 101 : switching_protocols
 3 102 : processing
 4 103 : checkpoint
 5 122 : uri_too_long , request_uri_too_long
 6 
 7 200 : ok ， okay, all_ok all_okay , all_good, \\o/ , '√'
 8 201 : created
 9 202 ： accepted
10 203 : non_authoritative_info , non_authoritative_information
11 204 : no_content
12 205 : rest_content , reset
13 206 : partial_content, partial
14 207 :multi_status , multiple_status multi_stati multiple_stati
15 208 : already_reported
16 226 : im_used
17 
18 #Redirection
19 300 :multipel_choices
20 301 : moved_permanently , moved , \\o-
21 302 : found
22 303 : see_other , other
23 304 : not_modified
24 305 : use_proxy
25 306 : switch_proxy
26 307 : remporay_redirect , temporary_moved , temporary
27 308 : permanent_redirect , resume_incomplete , resume  #These 2 to be removed in 3.0
28 
29 #client Error
30 400 :bad_request , bad
31 401 : unauthorized
32 402 : payment_required payment
33 403 : forbiden
34 404 : not_found , -o-
35 405 : method_not_allowed not_allowed
36 406 : not_acceptable
37 407 : proxy_authentication_required , proxy_auth , proxy_authentication
38 408 : request_timeout  , timeout
39 409 : conflict 
40 410 :gone
41 411 :length_required
42 412 : precondition_failed , precondition
43 413 : request_entity_too_large
44 414 : requests_uri_too_large
45 415 : unsupported_media_type, unsupported_media , media_type
46 416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'),
47 417: ('expectation_failed',),
48 418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'),
49 421: ('misdirected_request',),
50 422: ('unprocessable_entity', 'unprocessable'),
51 423: ('locked',),
52 424: ('failed_dependency', 'dependency'),
53 425: ('unordered_collection', 'unordered'),
54 426: ('upgrade_required', 'upgrade'),
55 428: ('precondition_required', 'precondition'),
56 429: ('too_many_requests', 'too_many'),
57 431: ('header_fields_too_large', 'fields_too_large'),
58 444: ('no_response', 'none'),
59 449: ('retry_with', 'retry'),
60 450: ('blocked_by_windows_parental_controls', 'parental_controls'),
61 451: ('unavailable_for_legal_reasons', 'legal_reasons'),
62 499: ('client_closed_request',),
63 # Server Error.
64 500: ('internal_server_error', 'server_error', '/o\\', '✗'),
65 501: ('not_implemented',),
66 502: ('bad_gateway',),
67 503: ('service_unavailable', 'unavailable'),
68 504: ('gateway_timeout',),
69 505: ('http_version_not_supported', 'http_version'),
70 506: ('variant_also_negotiates',),
71 507: ('insufficient_storage',),
72 509: ('bandwidth_limit_exceeded', 'bandwidth'),
73 510: ('not_extended',),
74 511: ('network_authentication_required', 'network_auth', 'network_authentication')

View Code

　　4.8 獲得cookies

#會話登入
import requests

s = requests.Session()
s.get('http://www.httpbin.org/cookies/set/123456789') #設定cookies
res = s.get('http://www.httpbin.org/cookies')  #獲得cookies
print (res.text)   #列印cookies

此httpbin.org是通過以上方式來設定cookies


#獲得cookie
import requests
response = requests.get('http://www.baidu.com')
#print ('response.cookies')

for key,value in reponse.cookies.items():
    print (key + '=' + value)         #組合key = value

　　4.7 SSL設定

#ssl設定
import requests
from requests.packages import urllib3
urllib3.disable_warnings()
res = requests.get('http://www.12306.cn',verify = False)
print (res.status_code)


#證書認證
import requests
res = requests.get('https://www.12306.cn',cert=('/path/server.crt','/path/key'))
print (res.status_code)

　　4.8 代理設定

import requests
proxies = {
    "http":"http://127.0.0.1:9746",
    "https":"https://127.0.0.1:9924"
}

res = requests.get("http://www.taobao.com",proxies = procies)
print (res.status_code)


#有密碼的代理
import requests
proxies = {
    "https":"https://user:[email protected]:9924"
}

res = requests.get("http://www.taobao.com",proxies = procies)
print (res.status_code)

　　4.9 超時時間設定與異常處理

import requests
from requests.exceptions import ReadTimeout
try:
    res = requests.get('http://httpbin.org/get',timeout=0.5)
except ReadTimeout:
    print ('Timeout')

　　4.10 案例：檢測QQ是否線上

import urllib
import requests
from xml.etree import ElementTree as ET

#使用內建模組urllib傳送http請求
r = urllib.request.urlopen('http://www.webxml.com.cn/webservices/qqOnlineWebService.asmx/qqCheckOnline?qqCode=3455306**')
result = r.read().decode('utf-8')


#使用第三方模組requests傳送http請求
r = requetsts.get('http://www.webxml.com.cn/webservices/qqOnlineWebService.asmx/qqCheckOnline?qqCode=3455306**')
result = r.text

#解析XML格式內容
node = ET.XML(result)

#獲取內容
if node.text =='Y':
    print ('線上')
else:
    print ('離線')

五、爬蟲分析之re模組

　　5.1 關於re模組的使用方法

http://www.cnblogs.com/lisenlin/articles/8797892.html#1

　　5.2 爬蟲簡單案例

import requests
import re
from fake_useragent import UserAgent

def get_page(url):
    ua = UserAgent()
    headers = {
        'User-Agent':ua.chrome,
    }
    response = requests.get(url, headers = headers)
    try:
        if response.status_code == 200:
            res = response.text
            return res
        return None
    except Exception as e:
        print(e)

def get_movie(html):
    partten = '<p.*?><a.*?>(.*?)</a></p>.*?<p.*?>(.*?)</p>.*?<p.*?>(.*?)</p>'
    items = re.findall(partten, html, re.S)
    #print((items))
    return items
    
def write_file(items):
    fileMovie = open('movie.txt', 'w', encoding='utf8')
    try:
        for movie in items:
            fileMovie.write('電影排名：' + movie[0] + '\r\n')
            fileMovie.write('電影主演：' + movie[1].strip() + '\r\n')
            fileMovie.write('上映時間：' + movie[2] + '\r\n\r\n')
        print('檔案寫入成功...')
    finally:
        fileMovie.close()
        
def main(url):
    html = get_page(url)
    items = get_movie(html)
    write_file(items)
    
if __name__ == '__main__':
    url = "http://maoyan.com/board/4"
    main(url)

爬蟲--Python常用模組之requests,urllib和re

一、爬蟲的步驟　　1.發起請求，模擬瀏覽器傳送一個http請求　　2.獲取響應的內容　　3.解析內容（解析出對自己有用的部分）　　　　a.正則表示式　　　　b.BeautifulSoup模組　　　　c.pyquery模組　　　　d.selenium模組　　4.儲存資料　　　

python常用模組之requests

一、requests 1、GET url帶引數請求 >>> payload = {'key1': 'value1', 'key2': 'value2'} >>> r = requests.get("http://h

python常用模組之json和pickle模組

json模組 json.dumps 將 Python 物件編碼成 JSON 字串 json.loads 用於解碼 JSON 資料。該函式返回 Python 欄位的資料型別。 pick

Python OS模組之操作檔案和目錄

#-*-coding:utf-8-*- import os import shutil ###############OS模組############## #獲得當前python指令碼的工作目錄 os.getcwd() #獲得指定目錄下的所有檔案和目錄名 os.listdir("C:\\")

Python 常用模組之re 正則表示式的使用

re模組用來使用正則表示式。正則表示式用來對字串進行搜尋的工作。我們最應該掌握正則表示式的查詢，更改，刪除的功能。特別是做爬蟲的時候，re模組就顯得格外重要。 1.查詢 1 import re 2 a = re.match("abc","aabccc") 3 b = re.search("abc",

python常用模組之time模組

python中的time和datetime模組是時間方面的模組 time模組中時間表現的格式主要有三種：　　1、timestamp：時間戳，時間戳表示的是從1970年1月1日00:00:00開始按秒計算的偏移量　　2、struct_time：時間元組，共有九個元素組。　　3、format tim

python 常用模組之random,os,sys 模組

python 常用模組random,os,sys 模組 python全棧開發 OS模組，Random模組，sys模組 OS模組 os模組是與作業系統互動的一個介面，常見的函式以及用法見一下程式碼：

Python學習【第8篇】：Python之常用模組一（主要是正則以及collections模組） python--------------常用模組之正則

python--------------常用模組之正則一、認識模組　　什麼是模組：一個模組就是一個包含了python定義和宣告的檔案，檔名就是加上.py的字尾，但其實import載入的模組分為四個通用類別：

Python常用模組之numpy

numpy在討論numpy的具體函式和方法之前，我要先說明一下兩個問題：1，numpy中的陣列和向量。2，numpy中的“多軸陣列”。維度vs軸數numpy中裡有多維陣列，為了避免和線性代數中的多維陣列區別開，這裡暫時稱之為多軸陣列。我們首先生成一個三維陣列，裡面存放數字0-

常用模組之hashlib,subprocess,logging,re,collections

一、hashlib 1、什麼叫hash:hash是一種演算法（3.x裡代替了md5模組和sha模組，主要提供 SHA1, SHA224, SHA256, SHA384, SHA512 ，MD5 演算法），該演算法接受傳入的內容，經過運算得到一串hash值 2、hash值的特點是：

python爬蟲學習實踐(一)：requests庫和正則表示式之淘寶爬蟲實戰

使用requests庫是需要安裝的，requests庫相比urllib 庫來說更高階方便一點，同時與scrapy相比較還是不夠強大，本文主要介紹利用requests庫和正則表示式完成一項簡單的爬蟲小專案----淘寶商品爬蟲。有關於更多requests庫的使用方法請參考：官方文件第一步：我們先開啟淘寶網頁然後搜

python資料型別之列表(list)和其常用方法

列表是python常用資料型別之一，是可變的，可由n = []建立，也可由n = list()建立，第一種方法更常用。常用方法總結： # 建立方法 n = [] 或者 n = list() # index 查詢索引值 li = ['Edward', 'Mark'

python資料型別之字典(dict)和其常用方法

字典的特徵： key-value結構key必須可hash,且必須為不可變資料型別、必須唯一。 # hash值都是數字，可以用類似於2分法(但比2分法厲害的多的方法）找。可存放任意多個值、可修改、可以不唯一無序查詢速度快常用方法： info = {'stu01': 'alex', 'stu02':

python資料型別之集合(set)和其常用方法

集合是一個無序的，不重複的資料組合作用(集合的重點)：1.去重，把一個列表變成集合就自動去重了2.關係測試，測試兩組資料庫之前的交集、差集、並集等關係 s = {1, 1, 2, 2, 3, 4, 'a', 'a', '!', '!'} print(type(s)) # <class 'set

python使用requests庫和re庫寫的京東商品資訊爬蟲

1 import requests 2 import re 3 4 def getHTMLText(url): 5 try: 6 r = requests.get(url, timeout=30) 7 r.raise_for_status()

python使用requests庫和re庫寫的京東商品信息爬蟲

fin 搜索 goods tle 爬取 val timeout stat for 1 import requests 2 import re 3 4 def getHTMLText(url): 5 try: 6 r = reques

python常用模組(模組和包的解釋，time模組，sys模組，random模組，os模組，json和pickle序列化模組)

1.1模組什麼是模組：在計算機程式的開發過程中，隨著程式程式碼越寫越多，在一個檔案裡程式碼就會越來越長，越來越不容易維護。為了編寫可維護的程式碼，我們把很多函式分組，分別放到不同的檔案裡，這樣，每個檔案包含的程式碼就相對較少，在python中。一個.py檔案就稱為一個模組（Module

Python爬蟲大殺器之Requests快速入門

轉載：http://blog.csdn.net/iloveyin/article/details/21444613 快速上手迫不及待了嗎？本頁內容為如何入門Requests提供了很好的指引。其假設你已經安裝了Requests。如果還沒有，去安裝一節看看吧。

Python常用模組（time、numpy、pandas、matplotlib）之簡單使用

一、time模組常用的一種獲取當前時間以及格式化模組，模組名稱：time 匯入方式：import time 1. 時間元祖屬性 2. 常用方法 3. 使用 3.1 導包 import time 3.

python常用模組介紹之三：logging模組

簡介： Python的logging模組提供了通用的日誌系統，可以方便第三方模組或者是應用使用。這個模組提供不同的日誌級別，並可以採用不同的方式記錄日誌，比如檔案，HTTP GET/POST，SMTP，Socket等，甚至可以自己實現具體的日誌記錄方式。模組

爬蟲--Python常用模組之requests,urllib和re

一、爬蟲的步驟

二、使用Jupyter

三、爬蟲請求模組之urllib

四、爬蟲請求模組之requests

五、爬蟲分析之re模組

相關推薦