第七章 requests模組高階操作

阿新 • • 發佈：2022-05-17

模擬登入：
    - 爬取基於某些使用者的使用者資訊。
需求：對人人網進行模擬登入。
    - 點選登入按鈕之後會發起一個post請求
    - post請求中會攜帶登入之前錄入的相關的登入資訊（使用者名稱，密碼，驗證碼......）
    - 驗證碼：每次請求都會變化

#編碼流程：
#1.驗證碼的識別，獲取驗證碼圖片的文字資料
#2.對post請求進行傳送（處理請求引數）
#3.對響應資料進行持久化儲存

from CodeClass import YDMHttp
import requests
from lxml import etree
#封裝識別驗證碼圖片的函式
def getCodeText(imgPath,codeType):
     
# 普通使用者使用者名稱
    username = 'bobo328410948'

    # 普通使用者密碼
    password = 'bobo328410948'

    # 軟體ＩＤ，開發者分成必要引數。登入開發者後臺【我的軟體】獲得！
    appid = 6003

    # 軟體金鑰，開發者分成必要引數。登入開發者後臺【我的軟體】獲得！
    appkey = '1f4b564483ae5c907a1d34f8e2f2776c'

    # 圖片檔案：即將被識別的驗證碼圖片的路徑
    filename = imgPath

    # 驗證碼型別，# 例：1004表示4位字母數字，不同型別收費不同。請準確填寫，否則影響識別率。在此查詢所有型別 http://www.yundama.com/price.html 

    codetype = codeType

    # 超時時間，秒
    timeout = 20
    result = None
    # 檢查
    if (username == 'username'):
        print('請設定好相關引數再測試')
    else:
        # 初始化
        yundama = YDMHttp(username, password, appid, appkey)

        # 登陸雲打碼
        uid = yundama.login();
        print('uid: %s' 
 % uid)

        # 查詢餘額
        balance = yundama.balance();
        print('balance: %s' % balance)

        # 開始識別，圖片路徑，驗證碼型別ID，超時時間（秒），識別結果
        cid, result = yundama.decode(filename, codetype, timeout);
        print('cid: %s, result: %s' % (cid, result))
    return result


#1.對驗證碼圖片進行捕獲和識別
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
}
url = 'http://www.renren.com/SysHome.do' # 驗證碼所在頁面
page_text = requests.get(url=url,headers=headers).text # 傳送請求 拿到text響應資料
tree = etree.HTML(page_text)
code_img_src = tree.xpath('//*[@id="verifyPic_login"]/@src')[0]
code_img_data = requests.get(url=code_img_src,headers=headers).content #捕獲驗證碼圖片
with open('./code.jpg','wb') as fp:
    fp.write(code_img_data)  # 持久化儲存到本地

#使用雲打碼提供的示例程式碼對驗證碼圖片進行識別
result = getCodeText('code.jpg',1000)
print(result)
# 2、post請求的傳送（模擬登入）
login_url = 'http://www.renren.com/ajaxLogin/login?1=1&uniqueTimestamp=2019431046983'
data = {
    'email': '[email protected]',
    'icode': result,
    'origURL': 'http://www.renren.com/home',
    'domain': 'renren.com',
    'key_id': '1',
    'captcha_type': 'web_login',
    'password': '06768edabba49f5f6b762240b311ae5bfa4bcce70627231dd1f08b9c7c6f4375',
    'rkey': '1028219f2897941c98abdc0839a729df',
    'f':'https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3Dgds6TUs9Q1ojOatGda5mVsLKC34AYwc5XiN8OuImHRK%26wd%3D%26eqid%3D8e38ba9300429d7d000000035cedf53a',
}  #引數處理
response = requests.post(url=login_url,headers=headers,data=data)
print(response.text)
print(response.status_code) #響應狀態碼是200表示請求成功 做驗證


需求：爬取當前使用者的相關的使用者資訊（個人主頁中顯示的使用者資訊）

http/https協議特性：無狀態。
沒有請求到對應頁面資料的原因：
    發起的第二次基於個人主頁頁面請求的時候，伺服器端並不知道該此請求是基於登入狀態下的請求 一定不會響應對應的頁面資料。（服務端不會記錄第一次登陸的狀態）
cookie：用來讓伺服器端記錄客戶端的相關狀態。
    - 手動處理：通過抓包工具獲取cookie值，將該值封裝到request headers中。（不建議）比較麻煩 還得手動獲取cookie 還有有效時長
    - 自動處理： （建議使用）
        - cookie值的來源是哪裡？
            - 第一次模擬登入post請求後，由伺服器端建立。
       　　 session會話物件：
            　　- 作用：
                　　1.可以進行請求的傳送。
                　　2.如果請求過程中產生了cookie，則該cookie會被自動儲存/攜帶在該session物件中。
        - 建立一個session物件：session = requests.Session()
        - 使用session物件進行模擬登入post請求的傳送（cookie就會被儲存在session中）
        - session物件對個人主頁對應的get請求進行傳送（攜帶了cookie）

·#編碼流程：
#1.驗證碼的識別，獲取驗證碼圖片的文字資料
#2.對post請求進行傳送（處理請求引數）
#3.對響應資料進行持久化儲存

from CodeClass import YDMHttp
import requests
from lxml import etree
#封裝識別驗證碼圖片的函式
def getCodeText(imgPath,codeType):
    # 普通使用者使用者名稱
    username = 'bobo328410948'

    # 普通使用者密碼
    password = 'bobo328410948'

    # 軟體ＩＤ，開發者分成必要引數。登入開發者後臺【我的軟體】獲得！
    appid = 6003

    # 軟體金鑰，開發者分成必要引數。登入開發者後臺【我的軟體】獲得！
    appkey = '1f4b564483ae5c907a1d34f8e2f2776c'

    # 圖片檔案：即將被識別的驗證碼圖片的路徑
    filename = imgPath

    # 驗證碼型別，# 例：1004表示4位字母數字，不同型別收費不同。請準確填寫，否則影響識別率。在此查詢所有型別 http://www.yundama.com/price.html
    codetype = codeType

    # 超時時間，秒
    timeout = 20
    result = None
    # 檢查
    if (username == 'username'):
        print('請設定好相關引數再測試')
    else:
        # 初始化
        yundama = YDMHttp(username, password, appid, appkey)

        # 登陸雲打碼
        uid = yundama.login();
        print('uid: %s' % uid)

        # 查詢餘額
        balance = yundama.balance();
        print('balance: %s' % balance)

        # 開始識別，圖片路徑，驗證碼型別ID，超時時間（秒），識別結果
        cid, result = yundama.decode(filename, codetype, timeout);
        print('cid: %s, result: %s' % (cid, result))
    return result

# 自動處理cookie 一 建立一個session物件
session = requests.Session()

#1.對驗證碼圖片進行捕獲和識別
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
}
url = 'http://www.renren.com/SysHome.do'
page_text = requests.get(url=url,headers=headers).text
tree = etree.HTML(page_text)
code_img_src = tree.xpath('//*[@id="verifyPic_login"]/@src')[0]
code_img_data = requests.get(url=code_img_src,headers=headers).content
with open('./code.jpg','wb') as fp:
    fp.write(code_img_data)

#使用雲打碼提供的示例程式碼對驗證碼圖片進行識別
result = getCodeText('code.jpg',1000)

#post請求的傳送（模擬登入）
login_url = 'http://www.renren.com/ajaxLogin/login?1=1&uniqueTimestamp=2019431046983'
data = {
    'email': '[email protected]',
    'icode': result,
    'origURL': 'http://www.renren.com/home',
    'domain': 'renren.com',
    'key_id': '1',
    'captcha_type': 'web_login',
    'password': '06768edabba49f5f6b762240b311ae5bfa4bcce70627231dd1f08b9c7c6f4375',
    'rkey': '3d1f9abdaae1f018a49d38069fe743c8',
    'f':'',
}
# 二 使用session進行post請求的傳送
response = session.post(url=login_url,headers=headers,data=data)
print(response.status_code)

#爬取當前使用者的個人主頁對應的頁面資料
detail_url = 'http://www.renren.com/289676607/profile'  # 當前使用者主頁的url
# 手動cookie處理 不建議 比較麻煩 話得手動獲取cookie 還有有效時長 （將cookie封裝到headers中）
# headers = {
#     'Cookie':'xxxx'
# }
# 三 使用攜帶cookie的session進行get請求的傳送
detail_page_text = session.get(url=detail_url,headers=headers).text
with open('bobo.html','w',encoding='utf-8') as fp:
    fp.write(detail_page_text)


代理：破解封IP這種反爬機制。（伺服器檢測單位時間內某IP請求次數）
什麼是代理：
    - 代理伺服器。
代理的作用：
    - 突破自身IP訪問的限制。
    - 隱藏自身真實IP
代理相關的網站：
    - 快代理
    - 西祠代理
    - www.goubanjia.com
代理ip的型別：
    - http：應用到http協議對應的url中
    - https：應用到https協議對應的url中

代理ip的匿名度：
    - 透明：伺服器知道該次請求使用了代理，也知道請求對應的真實ip
    - 匿名：知道使用了代理，不知道真實ip
    - 高匿：不知道使用了代理，更不知道真實的ip

#需求：
import requests
url = 'https://www.baidu.com/s?wd=ip'
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
}
# 使用代理proxies={"https":'222.110.147.50:3128'}
page_text = requests.get(url=url,headers=headers,proxies={"https":'222.110.147.50:3128'}).text

with open('ip.html','w',encoding='utf-8') as fp:
    fp.write(page_text)

#反爬機制：  封ip
#反反爬策略：使用代理進行請求傳送