爬蟲簡介和請求模組urllib，requests

阿新 • • 發佈：2021-01-02

技術標籤：Python爬蟲 python 爬蟲

爬蟲簡介和請求模組urllib，requests

1. 爬蟲簡介

什麼是爬蟲？

簡單一句話就是代替人去模擬瀏覽器進行網頁操作

為什麼需要爬蟲？

提供資料來源，列如一些搜尋引擎就是先去網站爬取資訊，再形成一個返回的結果畫面呈現給使用者
爬取資料進行資料分析
AI人工只能（智慧家居、無人駕駛、智慧語音、智慧導航、人臉識別。。。）

企業獲取資料的方式？

公司自有資料
第三方平臺購買的資料（百度指數、資料堂等）
爬蟲爬取的資料

2. python爬蟲大致分為三步，資料爬過來，資料提取，資料分析。

3. 網路request 的幾個概念

get查詢引數會在url顯示出來
post查詢引數不會顯示再url地址之上的
url：
User-Agent 使用者代理，記錄了使用者的作業系統、瀏覽器等，為了讓使用者更好的獲取HTML頁面效果

User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36

Referer 表明當前的這個請求是從哪個url過來了，一般作為一個反爬的工具。

4. urllib

urllib 是內建的模組，而requests 是三方的模組
python2: urllib2， urllib
python3: 把urllib和urllib2合併統一的urllib。
還有urllib3模組，接觸不多
urllib.request.urlopen()不能自定義header。如果一個request需要自定義header，用 urllib.request.Request() 自定義一個request物件，再把這個物件傳給urlopen()

req = urllib.request.Request(url, headers=headers)
res = urllib.request.urlopen(req)

get型別的url 中如果包含中文，那麼中文字元是以16進位制的形式儲存的
urllib.parse 中包含兩個方法，

urllib.parse.urlencode(“字典”)
urllib.parse.quote(“字串”)

urllib.parse.urlencode({"kw": "海賊王"})

Output: kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B

urllib.parse.quote("海賊王")

Output: %E6%B5%B7%E8%B4%BC%E7%8E%8B
海賊王貼吧第2頁URL:

https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=50

Note: 對於中文字元，三個16進製表示一個字元。比如：海=%E6%B5%B7

5. requests

requests.get() 能自定義header
res = requests.get(url, headers=headers)
type(res.content) = <class ‘bytes’>， type(res.text) = <class ‘str’>
POST 請求， url中看不到查詢引數

requests.get(url, data=date, headers=headers)

date 裡面需要包含查詢引數

6. 網上爬取一張圖片，並儲存到本地

urllib 實現

import urllib.request
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"}
url = "https://ss1.bdstatic.com/70cFvXSh_Q1YnxGkpoWK1HF6hhy/it/u=4001356234,2763706243&fm=26&gp=0.jpg"

req = urllib.request.Request(url, headers=headers)
res = urllib.request.urlopen(req)
with open("tupian.jpg", 'wb') as pic:
    pic.write(res.read())

requests實現

import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"}
url = "https://ss2.bdstatic.com/70cFvnSh_Q1YnxGkpoWK1HF6hhy/it/u=3323914398,641435642&fm=26&gp=0.jpg"

res = requests.get(url, headers=headers)
with open("dahua.jpg", "wb") as pic:
    pic.write(res.content)

7. 爬取百度貼吧資料，並儲存到本地

基本思路：獲取貼吧每頁的url，然後把每頁資料都儲存到本地。

輸入想要查詢的貼吧 2. 輸入想爬取的起始頁 3. 輸入想要爬取的終止頁 4. 爬取資料並儲存在本地

urllib 實現

import urllib.parse
import urllib.request

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"}

kw_input = input("pls input keyword: ")
start_pn = int(input("Enter the first page: "))
end_pn = int(input("Enter the last page: "))

kw_input = urllib.parse.quote(kw_input)
# print(kw, type(kw))

#第二頁  https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=50
#第三頁  https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=100

base_url = "https://tieba.baidu.com/f?kw={kw_input}&pn={page}"

for page in range(start_pn, end_pn+1):

    url = base_url.format(kw_input=kw_input, page=(page-1)*50)
    req = urllib.request.Request(url, headers=headers)
    res = urllib.request.urlopen(req)
    with open(f"第{page}頁.html", 'w', encoding='utf-8', newline='') as f:
        content = res.read().decode('utf-8')
        f.write(content)

requestes 實現

import requests

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"}

kw_input = input("pls input keyword: ")
start_pn = int(input("Enter the first page: "))
end_pn = int(input("Enter the last page: "))

base_url = "https://tieba.baidu.com/f?kw={kw_input}&pn={page}"

for page in range(start_pn, end_pn+1):

    url = base_url.format(kw_input=kw_input, page=(page-1)*50)
    res = requests.get(url, headers=headers)
    res.encoding = 'utf-8'
    print(res.status_code)
    with open(f"第{page}頁.html", 'w', encoding='utf-8', newline='') as f:
        f.write(res.text)

爬蟲簡介和請求模組urllib，requests

技術標籤：Python爬蟲python爬蟲爬蟲簡介和請求模組urllib，requests 1. 爬蟲簡介什麼是爬蟲？為什麼需要爬蟲？企業獲取資料的方式？

一木.溪橋學爬蟲-03：請求模組urllib、 urllib.request、urllib.parse.urlencode、urllib.parse.quote(str)、.unquote()

技術標籤：Python 爬蟲python 一木.溪橋在Logic Education跟Jerry學爬蟲 07期：Python 爬蟲一木.溪橋學爬蟲-03：請求模組urllib、 urllib.request、urllib.parse.urlencode、urllib.parse.quote(str)、parse.