1. 程式人生 > 實用技巧 >pyhton爬取:爬取愛豆(李易峰)微博評論,看看愛豆粉絲的關注點在哪(附原始碼)

pyhton爬取:爬取愛豆(李易峰)微博評論,看看愛豆粉絲的關注點在哪(附原始碼)

本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,版權歸原作者所有,如有問題請及時聯絡我們以作處理

本片文章來自騰訊雲, 作者:孤獨的明月

今日目標:微博

以李易峰的微博為例:

https://weibo.com/liyifeng2007?is_all=1

然後進入評論頁面,進入XHR查詢真是地址:

https://weibo.com/aj/v6/comment/big?ajwvr=6&id=4353796790279702&from=singleWeiBo

很明顯,是動態的,抓取也是按我以前寫的那些方法來,就不一一說了,他這裡最重要的還是那串數字,所以我們只要在第一個網址哪裡把那串數字找出來就算成功一半了,這次需要用到re正則,嗯,這個我不擅長,不過沒事,應該還是可以搞到的:

target = 'https://weibo.com/liyifeng2007?is_all=1'

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36',
    'cookie': 'SUB=_2AkMowDDgf8NxqwJRmPoSyWnqao53ywzEieKenME7JRMxHRl-yT9kqnEjtRB6A0AeDzsLF_aeZGlWOMf4mEl-MBZZXqc_; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WWcpq860VQlJcIRRNP9pzqS; SINAGLOBAL=1033839351635.7524.1604108279474; login_sid_t=c071efc77911ceace152df2be5986e09; cross_origin_proto=SSL; WBStorage=8daec78e6a891122|undefined; _s_tentry=-; Apache=8275565331127.246.1604195643561; ULV=1604195643568:3:1:1:8275565331127.246.1604195643561:1604122447982; wb_view_log=1920*10801; UOR=,,editor.csdn.net
' } html = requests.get(target,headers=headers).text for each in re.findall('<a name=(.*?)date=',html): real_id = each.split(" ")[0] filename = each.split("\\")[-2].replace('"',"").replace(":",".") print(real_id,filename)

輸出如下:

第一個就是我們需要的ID,後面則是發微博的時間,我們用它來做儲存評論資料的檔名稱。

然後我們把ID傳入第二個網址:

comment_url = f'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={real_id}&from=singleWeiBo'

當然這個是抓取熱度的,如你要抓取最新回覆的,需要下面這個:

comment_url = f'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={read_id}&page=1'

拿到這個就簡單了,JSON 資料,直接進json網站解析就行,然後找到我們需要的資料,這裡就直接上程式碼了:

comment_url = f'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={real_id}&page=1'
res = requests.get(comment_url,headers=headers).json()["data"]["html"]

# 提取評論人和評論內容
conmment = re.findall('ucardconf="type=1">(.*?)</div>', res)

for each in conmment:
      # 將 內容裡的那些表情替換
      each = re.sub('<.*?>','',each)
      print(each)


對比一下:

把那些表情給去除了,有些只發表情沒法字的就會只顯示名字,這個是正常的,其餘就是一毛一樣了。
資料拿到了,咱們就儲存到本地吧,全部程式碼:

# -*- coding: utf-8 -*-
"""
Created on 2020-11-18

@author: 李運辰
"""

#https://weibo.com/liyifeng2007?is_all=1

import requests
import re,os

url = 'https://s.weibo.com/?topnav=1&wvr=6'
target = 'https://weibo.com/liyifeng2007?is_all=1'

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36',
    'cookie': 'SUB=_2AkMowDDgf8NxqwJRmPoSyWnqao53ywzEieKenME7JRMxHRl-yT9kqnEjtRB6A0AeDzsLF_aeZGlWOMf4mEl-MBZZXqc_; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WWcpq860VQlJcIRRNP9pzqS; SINAGLOBAL=1033839351635.7524.1604108279474; login_sid_t=c071efc77911ceace152df2be5986e09; cross_origin_proto=SSL; WBStorage=8daec78e6a891122|undefined; _s_tentry=-; Apache=8275565331127.246.1604195643561; ULV=1604195643568:3:1:1:8275565331127.246.1604195643561:1604122447982; wb_view_log=1920*10801; UOR=,,editor.csdn.net'
}

html = requests.get(target,headers=headers).text

for each in re.findall('<a name=(.*?)date=',html):
    real_id = each.split(" ")[0]
    filename = each.split("\\")[-2].replace('"',"").replace(":",".")
    # print(real_id,filename)

    # print(filename)
    for page in range(1,11):
        comment_url = f'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={real_id}&page={page}'
        res = requests.get(comment_url,headers=headers).json()["data"]["html"]

        # 提取評論人和評論內容
        conmment = re.findall('ucardconf="type=1">(.*?)</div>', res)
        # conmment = re.findall('</i></a>(.*?) </div>', res)
        for each in conmment:
            # 將 內容裡的那些表情替換
            each = re.sub('<.*?>','',each)
            print(each)
            f_name = "./images/"+filename
            with open(f_name+"_李運辰.txt","a",encoding="utf-8") as f:
                f.write(each)
                f.write("\n")

只是測試,所以就只爬了十幾頁:


爬下來後可以自己對比一下:

搞定!!!!