爬蟲cookie過期_python instagram 爬蟲
阿新 • • 發佈:2021-02-05
技術標籤:爬蟲cookie過期
葉湘倫:【文字篇】如何系統地自學 Python?zhuanlan.zhihu.com直接介紹一下具體的步驟以及注意點:
instagram 爬蟲注意點
- instagram 的首頁資料是 服務端渲染的,所以首頁出現的 11 或 12 條資料是以 html 中的一個 json 結構存在的(additionalData), 之後的帖子載入才是走 ajax 請求的
- 在 2019/06 之前,ins 是有反爬機制的,請求時需要在請求頭加了 'X-Instagram-GIS' 欄位。其演算法是:
1、將 rhx_gis 和 queryVariables 進行組合
rhx_gis 可以在首頁處的 sharedData 這個 json 結構中獲得
2、然後進行 md5 雜湊
e.g.
queryVariables = '{"id":"' + user_id + '","first":12,"after":"' +cursor+ '"}'
print(queryVariables)
headers['X-Instagram-GIS'] = hashStr(GIS_rhx_gis + ":" + queryVariables)
- 但是在在 2019/06 之後, instagram 已經取消了 X-Instagram-GIS 的校驗,所以無需再生成 X-Instagram-GIS,上一點內容可以當做歷史來了解了
- 初始訪問 ins 首頁的時候會設定一些 cookie,設定的內容 (response header) 如下:
set-cookie: rur=PRN; Domain=.instagram.com; HttpOnly; Path=/; Secure set-cookie: ds_user_id=11859524403; Domain=.instagram.com; expires=Mon, 15-Jul-2019 09:22:48 GMT; Max-Age=7776000; Path=/; Secure set-cookie: urlgen="{"45.63.123.251": 20473}:1hGKIi:7bh3mEau4gMVhrzWRTvtjs9hJ2Q"; Domain=.instagram.com; HttpOnly; Path=/; Secure set-cookie: csrftoken=Or4nQ1T3xidf6CYyTE7vueF46B73JmAd; Domain=.instagram.com; expires=Tue, 14-Apr-2020 09:22:48 GMT; Max-Age=31449600; Path=/; Secure
- 關於 query_hash,一般這個雜湊值不用怎麼管,可以直接寫死
- 特別注意:在每次請求時務必帶上自定義的 header,且 header 裡面要有 user-agent,這樣子才能使用 rhx_gis 來進行簽名訪問並且獲取到資料。切記!是每次訪問!例如:
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}
大部分 api 的訪問需要在請求頭的 cookie 中攜帶 session-id 才能得到資料,一個正常的請求頭 (request header) 如下:
:authority: www.instagram.com
:method: GET
:path: /graphql/query/?query_hash=ae21d996d1918b725a934c0ed7f59a74&variables=%7B%22fetch_media_count%22%3A0%2C%22fetch_suggested_count%22%3A30%2C%22ignore_cache%22%3Atrue%2C%22filter_followed_friends%22%3Atrue%2C%22seen_ids%22%3A%5B%5D%2C%22include_reel%22%3Atrue%7D
:scheme: https
accept: */*
accept-encoding: gzip, deflate, br
accept-language: zh-CN,zh;q=0.9,en;q=0.8,la;q=0.7
cache-control: no-cache
cookie: mid=XI-joQAEAAHpP4H2WkiI0kcY3sxg; csrftoken=Or4nQ1T3xidf6CYyTE7vueF46B73JmAd; ds_user_id=11859524403; sessionid=11859524403%3Al965tcIRCjXmVp%3A25; rur=PRN; urlgen="{"45.63.123.251": 20473}:1hGKIj:JvyKtYz_nHgBsLZnKrbSq0FEfeg"
pragma: no-cache
referer: https://www.instagram.com/
user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36
x-ig-app-id: 936619743392459
x-instagram-gis: 8f382d24b07524ad90b4f5ed5d6fccdb
x-requested-with: XMLHttpRequest
- 注意 user-agent、x-ig-app-id (html 中的 sharedData 中獲取)、x-instagram-gis,以及 cookie 中的 session-id 配置
api 的分頁 (請求下一頁資料),如使用者帖子列表
ins 中一個帶分頁的 ajax 請求,一般請求引數會類似下面:
# 網頁頁面資訊
page_info = js_data["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"]['page_info']
# 下一頁的索引值AQCSnXw1JsoV6LPOD2Of6qQUY7HWyXRc_CBSMWB6WvKlseC-7ibKho3Em0PEG7_EP8vwoXw5zwzsAv_mNMR8yX2uGFZ5j6YXdyoFfdbHc6942w
cursor = page_info['end_cursor']
# 是否有下一頁
flag = page_info['has_next_page']
- end_cursor 即為 after 的值,has_next_page 檢測是否有下一頁
如果是有下一頁,可進行第一次分頁資料請求,第一次分頁請求的響應資料回來之後,id,first 的值不用變,after 的值變為響應資料中 page_info 中 end_cursor 的值,再構造 variables,連同 query_hash 發起再下一頁的請求
再判斷響應資料中的 page_info 中 has_next_page 的值,迴圈下去,可拿完全部資料。若不想拿完,可利用響應資料中的 edge_owner_to_timeline_media 中的 count 值來做判斷,該值表示使用者總共有多少媒體 - 視訊帖子和圖片帖子資料結構不一樣,注意判斷響應資料中的 is_video 欄位
- 如果是用一個 ins 賬號去採集的話,只要請求頭的 cookie 中帶上合法且未過期的 session_id,可直接訪問介面,無需計算簽名。
最直接的做法是:開啟瀏覽器,登入 instagram 後,F12 檢視 xhr 請求,將 request header 中的 cookie 複製過來使用即可,向下面:
# -*- coding:utf-8 -*-
import requests
import re
import json
import urllib.parse
import hashlib
import sys
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
BASE_URL = 'https://www.instagram.com'
ACCOUNT_MEDIAS = "http://www.instagram.com/graphql/query/?query_hash=42323d64886122307be10013ad2dcc44&variables=%s"
ACCOUNT_PAGE = 'https://www.instagram.com/%s'
proxies = {
'http': 'http://127.0.0.1:1087',
'https': 'http://127.0.0.1:1087',
}
# 一次設定proxy的辦法,將它設定在一次session會話中,這樣就不用每次都在呼叫requests的時候指定proxies引數了
# s = requests.session()
# s.proxies = {'http': '121.193.143.249:80'}
def get_shared_data(html=''):
"""get window._sharedData from page,return the dict loaded by window._sharedData str
"""
if html:
target_text = html
else:
header = generate_header()
response = requests.get(BASE_URL, proxies=proxies, headers=header)
target_text = response.text
regx = r"s*.*s*<script.*?>.*_sharedDatas*=s*(.*?);</script>"
match_result = re.match(regx, target_text, re.S)
data = json.loads(match_result.group(1))
return data
# def get_rhx_gis():
# """get the rhx_gis value from sharedData
# """
# share_data = get_shared_data()
# return share_data['rhx_gis']
def get_account(user_name):
"""get the account info by username
:param user_name:
:return:
"""
url = get_account_link(user_name)
header = generate_header()
response = requests.get(url, headers=header, proxies=proxies)
data = get_shared_data(response.text)
account = resolve_account_data(data)
return account
def get_media_by_user_id(user_id, count=50, max_id=''):
"""get media info by user id
:param id:
:param count:
:param max_id:
:return:
"""
index = 0
medias = []
has_next_page = True
while index <= count and has_next_page:
varibles = json.dumps({
'id': str(user_id),
'first': count,
'after': str(max_id)
}, separators=(',', ':')) # 不指定separators的話key:value的:後會預設有空格,因為其預設separators為(', ', ': ')
url = get_account_media_link(varibles)
header = generate_header()
response = requests.get(url, headers=header, proxies=proxies)
media_json_data = json.loads(response.text)
media_raw_data = media_json_data['data']['user']['edge_owner_to_timeline_media']['edges']
if not media_raw_data:
return medias
for item in media_raw_data:
if index == count:
return medias
index += 1
medias.append(general_resolve_media(item['node']))
max_id = media_json_data['data']['user']['edge_owner_to_timeline_media']['page_info']['end_cursor']
has_next_page = media_json_data['data']['user']['edge_owner_to_timeline_media']['page_info']['has_next_page']
return medias
def get_media_by_url(media_url):
response = requests.get(get_media_url(media_url), proxies=proxies, headers=generate_header())
media_json = json.loads(response.text)
return general_resolve_media(media_json['graphql']['shortcode_media'])
def get_account_media_link(varibles):
return ACCOUNT_MEDIAS % urllib.parse.quote(varibles)
def get_account_link(user_name):
return ACCOUNT_PAGE % user_name
def get_media_url(media_url):
return media_url.rstrip('/') + '/?__a=1'
# def generate_instagram_gis(varibles):
# rhx_gis = get_rhx_gis()
# gis_token = rhx_gis + ':' + varibles
# x_instagram_token = hashlib.md5(gis_token.encode('utf-8')).hexdigest()
# return x_instagram_token
def generate_header(gis_token=''):
# todo: if have session, add the session key:value to header
header = {
'user-agent': USER_AGENT,
}
if gis_token:
header['x-instagram-gis'] = gis_token
return header
def general_resolve_media(media):
res = {
'id': media['id'],
'type': media['__typename'][5:].lower(),
'content': media['edge_media_to_caption']['edges'][0]['node']['text'],
'title': 'title' in media and media['title'] or '',
'shortcode': media['shortcode'],
'preview_url': BASE_URL + '/p/' + media['shortcode'],
'comments_count': media['edge_media_to_comment']['count'],
'likes_count': media['edge_media_preview_like']['count'],
'dimensions': 'dimensions' in media and media['dimensions'] or {},
'display_url': media['display_url'],
'owner_id': media['owner']['id'],
'thumbnail_src': 'thumbnail_src' in media and media['thumbnail_src'] or '',
'is_video': media['is_video'],
'video_url': 'video_url' in media and media['video_url'] or ''
}
return res
def resolve_account_data(account_data):
account = {
'country': account_data['country_code'],
'language': account_data['language_code'],
'biography': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['biography'],
'followers_count': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_followed_by']['count'],
'follow_count': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_follow']['count'],
'full_name': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['full_name'],
'id': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['id'],
'is_private': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['is_private'],
'is_verified': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['is_verified'],
'profile_pic_url': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['profile_pic_url_hd'],
'username': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['username'],
}
return account
account = get_account('shaq')
result = get_media_by_user_id(account['id'], 56)
media = get_media_by_url('https://www.instagram.com/p/Bw3-Q2XhDMf/')
print(len(result))
print(result)
封裝成庫了!
如果還有問題未能得到解決,搜尋887934385交流群,進入後下載資料工具安裝包等。最後,感謝觀看!