python爬取微博評論的例項講解

阿新 • • 發佈：2021-01-18

python爬蟲是程式設計師們一定會掌握的知識，練習python爬蟲時，很多人會選擇爬取微博練手。python爬蟲微博根據微博存在於不同媒介上，所爬取的難度有差異，無論是python新入手的小白，還是已經熟練掌握的程式設計師，可以拿來練手。本文介紹python爬取微博評論的程式碼例項。

一、爬蟲微博

與QQ空間爬蟲類似，可以爬取新浪微博使用者的個人資訊、微博資訊、粉絲、關注和評論等。

爬蟲抓取微博的速度可以達到 1300萬/天以上，具體要視網路情況。

難度程度排序：網頁端>手機端>移動端。微博端就是最好爬的微博端。

二、python爬蟲爬取微博評論

第一步：確定評論使用者的id

# -*- coding:utf-8 -*-
import requests
import re
import time
import pandas as pd
urls = 'https://m.weibo.cn/api/comments/show?id=4073157046629802&page={}'
headers = {'Cookies':'Your cookies','User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) 
AppleWebKit/537.36 (KHTML,like Gecko) Chrome/66.0.3359.181 Safari/537.36'}

第二步：找到html標籤

tags = re.compile('</?\w+[^>]*>')

第三步：設定提取評論function

def get_comment(url):
j = requests.get(url,headers=headers).json()
comment_data = j['data']['data']
for data in comment_data:
try:

第四步：利用正則表示式去除文字中的html標籤

comment = tags.sub('',data['text']) # 去掉html標籤
reply = tags.sub('',data['reply_text'])
weibo_id = data['id']
reply_id = data['reply_id']
comments.append(comment)
comments.append(reply)
ids.append(weibo_id)
ids.append(reply_id)

第五步：爬取評論

df = pd.DataFrame({'ID': ids,'評論': comments})
df = df.drop_duplicates()
df.to_csv('觀察者網.csv',index=False,encoding='gb18030')

例項擴充套件：

# -*- coding: utf-8 -*-
# Created : 2018/8/26 18:33
# author ：GuoLi
 
import requests
import json
import time
from lxml import etree
import html
import re
from bs4 import BeautifulSoup
 
 
class Weibospider:
 def __init__(self):
  # 獲取首頁的相關資訊：
  self.start_url = 'https://weibo.com/u/5644764907?page=1&is_all=1'
 
  self.headers = {
   "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8","accept-encoding": "gzip,deflate,br","accept-language": "zh-CN,zh;q=0.9,en;q=0.8","cache-control": "max-age=0","cookie": 使用自己本機的cookie,"referer": "https://www.weibo.com/u/5644764907?topnav=1&wvr=6&topsug=1","upgrade-insecure-requests": "1","user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/72.0.3626.96 Safari/537.36",}
  self.proxy = {
   'HTTP': 'HTTP://180.125.70.78:9999','HTTP': 'HTTP://117.90.4.230:9999','HTTP': 'HTTP://111.77.196.229:9999','HTTP': 'HTTP://111.177.183.57:9999','HTTP': 'HTTP://123.55.98.146:9999',}
 
 def parse_home_url(self,url): # 處理解析首頁面的詳細資訊（不包括兩個通過ajax獲取到的頁面）
  res = requests.get(url,headers=self.headers)
  response = res.content.decode().replace("\\","")
  # every_url = re.compile('target="_blank" href="(/\d+/\w+\?from=\w+&wvr=6&mod=weibotime)" rel="external nofollow"  ',re.S).findall(response)
  every_id = re.compile('name=(\d+)',re.S).findall(response) # 獲取次級頁面需要的id
  home_url = []
  for id in every_id:
   base_url = 'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={}&from=singleWeiBo'
   url = base_url.format(id)
   home_url.append(url)
  return home_url
 
 def parse_comment_info(self,url): # 爬取直接發表評論的人的相關資訊(name,info,time,info_url)
  res = requests.get(url,headers=self.headers)
  response = res.json()
  count = response['data']['count']
  html = etree.HTML(response['data']['html'])
  name = html.xpath("//div[@class='list_li S_line1 clearfix']/div[@class='WB_face W_fl']/a/img/@alt") # 評論人的姓名
  info = html.xpath("//div[@node-type='replywrap']/div[@class='WB_text']/text()") # 評論資訊
  info = "".join(info).replace(" ","").split("\n")
  info.pop(0)
  comment_time = html.xpath("//div[@class='WB_from S_txt2']/text()") # 評論時間
  name_url = html.xpath("//div[@class='WB_face W_fl']/a/@href") # 評論人的url
  name_url = ["https:" + i for i in name_url]
  comment_info_list = []
  for i in range(len(name)):
   item = {}
   item["name"] = name[i] # 儲存評論人的網名
   item["comment_info"] = info[i] # 儲存評論的資訊
   item["comment_time"] = comment_time[i] # 儲存評論時間
   item["comment_url"] = name_url[i] # 儲存評論人的相關主頁
   comment_info_list.append(item)
  return count,comment_info_list
 
 def write_file(self,path_name,content_list):
  for content in content_list:
   with open(path_name,"a",encoding="UTF-8") as f:
    f.write(json.dumps(content,ensure_ascii=False))
    f.write("\n")
 
 def run(self):
  start_url = 'https://weibo.com/u/5644764907?page={}&is_all=1'
  start_ajax_url1 = 'https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=100406&is_all=1&page={0}&pagebar=0&pl_name=Pl_Official_MyProfileFeed__20&id=1004065644764907&script_uri=/u/5644764907&pre_page={0}'
  start_ajax_url2 = 'https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=100406&is_all=1&page={0}&pagebar=1&pl_name=Pl_Official_MyProfileFeed__20&id=1004065644764907&script_uri=/u/5644764907&pre_page={0}'
  for i in range(12): # 微博共有12頁
   home_url = self.parse_home_url(start_url.format(i + 1)) # 獲取每一頁的微博
   ajax_url1 = self.parse_home_url(start_ajax_url1.format(i + 1)) # ajax載入頁面的微博
   ajax_url2 = self.parse_home_url(start_ajax_url2.format(i + 1)) # ajax第二頁載入頁面的微博
   all_url = home_url + ajax_url1 + ajax_url2
   for j in range(len(all_url)):
    print(all_url[j])
    path_name = "第{}條微博相關評論.txt".format(i * 45 + j + 1)
    all_count,comment_info_list = self.parse_comment_info(all_url[j])
    self.write_file(path_name,comment_info_list)
    for num in range(1,10000):
     if num * 15 < int(all_count) + 15:
      comment_url = all_url[j] + "&page={}".format(num + 1)
      print(comment_url)
      try:
       count,comment_info_list = self.parse_comment_info(comment_url)
       self.write_file(path_name,comment_info_list)
      except Exception as e:
       print("Error:",e)
       time.sleep(60)
       count,comment_info_list)
      del count
      time.sleep(0.2)
 
    print("第{}微博資訊獲取完成！".format(i * 45 + j + 1))
 
 
if __name__ == '__main__':
 weibo = Weibospider()
 weibo.run()

到此這篇關於python爬取微博評論的例項講解的文章就介紹到這了,更多相關python爬蟲爬取微博評論內容請搜尋我們以前的文章或繼續瀏覽下面的相關文章希望大家以後多多支援我們！

python爬取微博評論的例項講解

python爬蟲是程式設計師們一定會掌握的知識，練習python爬蟲時，很多人會選擇爬取微博練手。python爬蟲微博根據微博存在於不同媒介上，所爬取的難度有差異，無論是python新入手的小白，還是已經熟練掌握的程式設計師

Python爬取微博熱搜榜，將資料存入資料庫

#-*-coding:utf-8-*- import urllib, pymysql, requests, re # 配置資料庫 config = { \'host\': \'127.0.0.1\',

Python selenium爬取微博資料程式碼例項

爬取某人的微博資料，把某人所有時間段的微博資料都爬下來。具體思路：建立driver-----get網頁----找到並提取資訊-----儲存csv----翻頁----get網頁（開始迴圈）----...----沒有“下一頁”就結束，

Python爬取微信小程式通用方法程式碼例項詳解

背景介紹最近遇到一個需求，大致就是要獲取某個小程式上的資料。心想小程式本質上就是移動端加殼的瀏覽器，所以想到用Python去獲取資料。在網上學習了一下如何實現後，記錄一下我的實現過程以及所踩過的小坑。本文關

爆雷，抖音視訊被曝，我連夜爬了微博評論，結果。。。

昨天娛樂圈又又又爆雷了，lixiaolu 和 pg1 的抖音視訊瘋傳網路，看來嫂子就要成內子了。

Scrapy嘗試爬取微博熱搜

首先自己想要的item： 1 import scrapy 2 3 4 class WeiboItem(scrapy.Item): 5 6rank = scrapy.Field()

Python爬取藍橋杯真題講解課程

今年疫情期間藍橋杯課程全線免費，但是如果每次聽課都要登入賬號實在太麻煩了，所以想著用爬蟲抓去一下視訊到本地。

Python爬取微信小程式Charles實現過程圖解

一、前言最近需要獲取微信小程式上的資料進行分析處理，第一時間想到的方式就是採用python爬蟲爬取資料，嘗試後發現諸多問題，比如無法獲取目標網址、解析網址中存在指定引數的不確定性、加密問題等等，經過一番嘗試

python爬取天氣資料的例項詳解

就在前幾天還是二十多度的舒適溫度，今天一下子就變成了個位數，小編已經感受到冬天寒風的無情了。之前對獲取天氣都是資料上的蒐集，做成了一個數據表後，對溫度變化的感知並不直觀。那麼，我們能不能用python中的方

關於Python爬取天氣資料的例項詳解內容

前言本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,如有問題請及時聯絡我們以作處理。

一篇文章教會你用Python爬取淘寶評論資料（寫在記事本）

【一、專案簡介】本文主要目標是採集淘寶的評價，找出客戶所需要的功能。統計客戶評價上面誇哪個功能多，比如防水，容量大，好看等等。

Python爬取微信公眾號文章、標題、文章地址

前言本文的文字及圖片過濾網路，可以學習，交流使用，不具有任何商業用途，如有問題請及時聯絡我們以作處理。

爬取微博簽到頁(一)——確定底層抓取邏輯

技術標籤：爬蟲分享大資料爬蟲seleniumpythonchrome 我是利用Python的 webdriver+selenium工具抓取的動態連結

python爬取京東商品評論

可爬取的內容上程式碼 import requests import json import csv from lxml import etree from bs4 import BeautifulSoup

Python爬取京東手機評論資訊

程式碼如下： 1 # coding=\'utf-8\' 2 import requests 3 import json 4 import time 5 import random 6 import xlwt

使用python爬取微信公眾號文章

一、批量獲取公眾號往期推送url連結 1. 獲取微信公眾號文章的長期連結原因由於我們檢視的微信公眾號的文章連結都是隨機生成的，如果在前端想要獲取往期推送的所有文章，就需要手動點開一個個複製，非常麻煩，所以我

Python 爬取必應桌布的例項講解

最近看了下python，就想著獲取下bing的圖片，每天定時爬取，儲存到本地，可以做背景圖片用。也在網上看了一些其他的例子。就自己動手寫了一個小的爬圖片的python指令碼。

Python 爬取某音某皮某博個關於’清華學姐‘事件網友對待這個態度，個10w評論

某皮 import json from mitmproxy import ctx def response(flow): #下面這個網址是通過fiddler獲取到的但是有些資料我們無法解密，所以需要用mitmdump捕獲資料包然後做分析

pyhton爬取：爬取愛豆（李易峰）微博評論，看看愛豆粉絲的關注點在哪（附原始碼）

本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,版權歸原作者所有,如有問題請及時聯絡我們以作處理

pyhton爬取愛豆（李易峰）微博評論

今日目標：微博，以李易峰的微博為例： https://weibo.com/liyifeng2007?is_all=1 然後進入評論頁面，進入XHR查詢真是地址：

python爬取微博評論的例項講解

一、爬蟲微博

二、python爬蟲爬取微博評論

第一步：確定評論使用者的id

第二步：找到html標籤

第三步：設定提取評論function

第四步：利用正則表示式去除文字中的html標籤

第五步：爬取評論

相關推薦