使用requests和scrapy模擬知乎登入

阿新 • • 發佈：2019-01-21

獲取登入傳遞的引數

可以看到，這裡當登入的時候，是傳遞紅色部分標註出來的四個引數的,並且訪問的是https://www.zhihu.com/login/phone_num地址，但是這裡驗證碼需要使用者點選倒立的字，目前我還沒有辦法，但是可以使用手機端登入看看，其實是讓使用者輸入登入驗證碼的，因此，可以使用手機端的user-agent

使用requests模擬登入

手機端登入時候需要傳遞下面四個引數

data = {
        '_xsrf': _xsrf,
        'password': password,
        'phone_num': phonenumber,
        'captcha' 
: captcha
    }

其中password和phone_num是密碼和使用者名稱，_xsrf和captcha_type是瀏覽器自己帶的hidden值
這裡寫圖片描述

獲取xsrf引數

由於使用者名稱和密碼已經知道，下面就定義兩個方法，分別用來獲取_xsrf和captcha_type引數，從上面的分析可知，只需要獲取網頁內容，然後通過正則表示式解析對應的內容即可獲得

獲取網頁內容

import requests
def get_xsrf():
    response = requests.get('https://www.zhihu.com')
    print(response.text)

這裡寫圖片描述
這是因為沒有配置請求頭，新增請求頭即可

header = {
    'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Mobile Safari/537.36',
    'Host': 'www.zhihu.com',
    'Referer': 'https://www.zhihu.com/',
}

def get_xsrf():
    response = requests.get('https://www.zhihu.com' 
, headers=header)
    print(response.text)

此時就可以正常訪問了
這裡寫圖片描述

獲取xsrf引數

text = '<input type="hidden" name="_xsrf" value="f559db84fb92c29de2b277a48a3bdd62"/>'
response = session.get('https://www.zhihu.com', headers=header)
soup = BeautifulSoup(response.text)
crsf = soup.select('input[name="_xsrf"]')[0]['value']
print(crsf)

上面可以正確獲取到xsrf引數，所以只需要將text替換為網頁內容即可

def get_xsrf():
    response = session.get('https://www.zhihu.com', headers=header)
    soup = BeautifulSoup(response.text)
    crsf = soup.select('input[name="_xsrf"]')[0]['value']
    print(soup.select('input[name="_xsrf"]')[0]['value'])
    return crsf

獲取驗證碼

def get_captcha():
    t = str(int(time.time() * 1000))
    captcha_url = 'https://www.zhihu.com/captcha.gif?r=' + t + "&type=login"
    print(captcha_url)
    response = session.get(captcha_url, headers=header)
    with open('captcha.gif', 'wb') as f:
        f.write(response.content)
        f.close()
    from PIL import Image
    try:
        im = Image.open('captcha.gif')
        im.show()
        im.close()
    except:
        pass

    captcha = input('請輸入驗證碼: ')
    return captcha

這裡獲取驗證碼，然後人工識別，手動輸入賦值給captcha

使用requests登入

import requests
from http import cookiejar

session = requests.session()
session.cookies = cookiejar.LWPCookieJar(filename='cookies.txt')
try:
    session.cookies.load(ignore_discard=True)
except:
    print ("cookie未能載入")

def zhihu_login(username, passwd):
    login_url = 'https://www.zhihu.com/login/phone_num'
    login_data = {
        '_xsrf': get_xsrf(),
        'phone_num': username,
        'password': passwd,
        'captcha': get_captcha()
    }
    response = session.post(login_url, data=login_data, headers=header)
    print(response.text)
    session.cookies.save() # 儲存cookie

zhihu_login('手機號','密碼')

此時執行結果如下：
這裡寫圖片描述

已經登入成功

判斷是否登入成功

另外，當用戶登入成功以後，可以訪問私信介面
這裡寫圖片描述
如果退出登入，或者沒有登入成功，則會跳轉到登入介面，並且返回302的狀態碼，後面會自動跳轉到https://www.zhihu.com/?next=%2Finbox,返回200狀態碼

使用requests登入完整程式碼

# -*- coding: utf-8 -*-

import requests
from http import cookiejar
from bs4 import BeautifulSoup
import time

# 獲取session
session = requests.session()
# 獲取cookies
session.cookies = cookiejar.LWPCookieJar(filename='cookies.txt')
# 獲取cookie，如果之前登入成功，並且已經cookie，則可以獲取到
try:
    session.cookies.load(ignore_discard=True)
except:
    print ("cookie未能載入")

# 設定請求頭
header = {
    'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Mobile Safari/537.36',
    'Host': 'www.zhihu.com',
    "Referer": "https://www.zhihu.com/",
}

# 獲取xsrf
def get_xsrf():
    response = session.get('https://www.zhihu.com', headers=header)
    soup = BeautifulSoup(response.text)
    crsf = soup.select('input[name="_xsrf"]')[0]['value']
    print(soup.select('input[name="_xsrf"]')[0]['value'])
    return crsf

# 獲取驗證碼
def get_captcha():
    t = str(int(time.time() * 1000))
    captcha_url = 'https://www.zhihu.com/captcha.gif?r=' + t + "&type=login"
    print(captcha_url)
    response = session.get(captcha_url, headers=header)
    with open('captcha.gif', 'wb') as f:
        f.write(response.content)
        f.close()
    from PIL import Image
    try:
        im = Image.open('captcha.gif')
        im.show()
        im.close()
    except:
        pass

    captcha = input('請輸入驗證碼: ')
    return captcha

# 判斷是否登入成功
def is_login():
    inbox_url = 'https://www.zhihu.com/inbox'
    response = session.get(inbox_url, headers=header, allow_redirects=False)
    if response.status_code == 200:
        print('登入成功')
    else:
        print('登入失敗')

# 登入方法
def zhihu_login(username, passwd):
    login_url = 'https://www.zhihu.com/login/phone_num'
    login_data = {
        '_xsrf': get_xsrf(),
        'phone_num': username,
        'password': passwd,
        'captcha': get_captcha()
    }
    response = session.post(login_url, data=login_data, headers=header)
    print(response.text)
    session.cookies.save() # 儲存cookie



# get_captcha()
# get_xsrf()
# zhihu_login('18710840098','這裡輸入密碼')
is_login()

使用scrapy模擬登入

在正式開始前，先建立工程和spider

 scrapy startproject zhihu
 cd zhihu
 scrapy genspider zhihu www.zhihu.com

完整程式碼如下：

# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup
import json

class ZhihuspiderSpider(scrapy.Spider):
    name = 'zhihuspider'
    allowed_domains = ['www.zhihu.com']
    start_urls = ['https://www.zhihu.com/']
    # 定義請求頭
    header = {
        # 使用手機的User-Agent
        'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Mobile Safari/537.36',
        'Host': 'www.zhihu.com',
        "Referer": "https://www.zhihu.com/",
    }

    def parse(self, response):
        print(response.text)
        pass

    # spider入口方法
    def start_requests(self):
        # 訪問https://www.zhihu.com/login/phone_num登入頁面,在do_login回撥中處理
        return [scrapy.Request('https://www.zhihu.com/login/phone_num', headers=self.header, callback=self.do_login)]

    def do_login(self, response):
        response_text = response.text
        soup = BeautifulSoup(response.text)
        # 解析獲取xsrf
        xsrf = soup.select('input[name="_xsrf"]')[0]['value']
        if xsrf:
            login_data = {
                '_xsrf': xsrf,
                'phone_num': '手機號',
                'password': '密碼',
                'captcha': ''
            }
            # 由於登入需要驗證碼,因此需要先獲取驗證碼,在do_login_after_captcha回撥獲取驗證碼,封裝傳遞的login_data引數
            import time
            t = str(int(time.time() * 1000))
            captcha_url = 'https://www.zhihu.com/captcha.gif?r=' + t + "&type=login"
            yield scrapy.Request(captcha_url, headers=self.header, meta={'login_data': login_data},
                                 callback=self.do_login_after_captcha)


    def do_login_after_captcha(self, response):
        # 獲取驗證碼操作
        with open('captcha.gif', 'wb') as f:
            f.write(response.body)
            f.close()
        from PIL import Image
        try:
            im = Image.open('captcha.gif')
            im.show()
            im.close()
        except:
            pass

        captcha = input('請輸入驗證碼: ')

        # 登入
        login_data = response.meta.get("login_data", {})
        login_data['captcha'] = captcha
        login_url = 'https://www.zhihu.com/login/phone_num'
        # FormRequest可以完成表單提交,在check_login回撥中驗證登入是否成功
        return [scrapy.FormRequest(
            url=login_url,
            formdata=login_data,
            headers=self.header,
            callback=self.check_login
        )]

    def check_login(self, response):
        #驗證登入是否成功
        text_json = json.loads(response.text)
        if "msg" in text_json and text_json["msg"] == "登入成功":
            for url in self.start_urls:
                yield scrapy.Request(url, dont_filter=True, headers=self.header)

此時結果response.text中的內容儲存到本地網頁，效果如下：
這裡寫圖片描述

程式碼下載

使用requests和scrapy模擬知乎登入

獲取登入傳遞的引數

使用requests模擬登入

獲取xsrf引數

獲取網頁內容

獲取xsrf引數

獲取驗證碼

使用requests登入

判斷是否登入成功

使用requests登入完整程式碼

使用scrapy模擬登入

使用requests和scrapy模擬知乎登入

菜鳥寫Python實戰：Scrapy完成知乎登入並儲存cookies檔案用於請求他頁面（by Selenium）

python爬蟲scrapy框架——人工識別知乎登入知乎倒立文字驗證碼和數字英文驗證碼

scrapy 爬取知乎登入認證部分（採用cookie登入）

怎麽及時掌握/把握深度學習的發展動向和狀態？(知乎)

python 模擬知乎登錄，包含驗證碼（轉）

知乎登入出現Miss argument grant_type 無法成功登入解決方法

scrapy 登陸知乎

requests 和 scrapy 在不同的爬蟲應用中，各自有什麼優勢？

新版知乎登入request登入（2）（類程式設計）

知乎登入以及改版後的知乎登入(小知識點)

Requests 和 Scrapy 新增動態IP代理

scrapy知乎模擬登入和cookie登入

selenium 模擬登入知乎和微博

Python-requests-知乎模擬登入

Python3 模擬登入知乎（requests）

python爬蟲scrapy框架——人工識別登入知乎倒立文字驗證碼和數字英文驗證碼(2)

通過scrapy，從模擬登入開始爬取知乎的問答資料

Scrapy基礎(十四)————知乎模擬登陸

爬蟲入門到精通-headers的詳細講解（模擬登入知乎）

使用requests和scrapy模擬知乎登入

獲取登入傳遞的引數

使用requests模擬登入

獲取xsrf引數

獲取網頁內容

獲取xsrf引數

獲取驗證碼

使用requests登入

判斷是否登入成功

使用requests登入完整程式碼

使用scrapy模擬登入

相關推薦