Scrapy實現對新浪微博某關鍵詞的爬取以及不同url中重複內容的過濾

阿新 • • 發佈：2018-12-14

工作原因需要爬取微博上相關微博內容以及評論。直接scrapy上手，發現有部分重複的內容出現。（標題重複，內容重複，但是url不重複）

1.scrapy爬取微博內容

為了降低爬取難度，直接爬取微博的移動端：（電腦訪問到移動版本微博，之後F12調出控制檯來操作）

點選搜尋欄：輸入相關搜尋關鍵詞：

我們要搜尋的“范冰冰” 其實做了URL編碼：

class SinaspiderSpider(scrapy.Spider):
    name = 'weibospider'
    allowed_domains = ['m.weibo.cn']
    start_urls = ['https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall']
    Referer = {"Referer": "https://m.weibo.cn/p/searchall?containerid=100103type%3D1%26q%3D"+quote("范冰冰")}
    def start_requests(self):

        yield Request(url="https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D"+quote("范冰冰")+"&page_type=searchall&page=1",headers=self.Referer,meta={"page":1,"keyword":"范冰冰"})

之後我們滾動往下拉發現url是有規律的：

 https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall&page=2
 https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall&page=3
 https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall&page=4
 https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall&page=5
 https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall&page=6
 https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall&page=7

在原來的基礎上新增了一個引數“&page=2” 這些引數從哪裡來的呢？我們如何判斷多少頁的時候就沒有了呢？

開啟我們最開始的那條URL：

複製這段json，然後通過下面兩個網站格式化一下，便於我們觀察規律：

線上工具有特別豐富的功能讓我們更好的檢視json：

我們發現JSON中儲存著我們要的頁面資訊：

其他的資訊一次類推在JSON或者URL中觀察：

微博爬取parse函式：

    def parse(self, response):
        base_url = "https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D"+quote("范冰冰")+"&page_type=searchall&page="
        results = json.loads(response.text,encoding="utf-8")
        page = response.meta.get("page")
        keyword = response.meta.get("keyword")
        # 下一頁
        next_page = results.get("data").get("cardlistInfo").get("page")
        if page != next_page:
            yield Request(url=base_url+str(next_page), headers=self.Referer, meta={"page":next_page,"keyword":keyword})
        result = results.get("data").get("cards")
        # 獲取微博
        for j in result:
            card_type = j.get("card_type")
            show_type = j.get("show_type")
            # 過濾
            if show_type ==1 and card_type ==11 :
                for i in j.get("card_group"):
                    reposts_count = i.get("mblog").get("reposts_count")
                    comments_count = i.get("mblog").get("comments_count")
                    attitudes_count = i.get("mblog").get("attitudes_count")
                    # 過濾到評論 轉發 喜歡都為0 的微博
                    if reposts_count and comments_count and attitudes_count:
                        message_id = i.get("mblog").get("id")
                        status_url = "https://m.weibo.cn/comments/hotflow?id=%s&mid=%s&max_id_type=0"
                        # 返回微博評論爬取
                        yield Request(url=status_url%(message_id,message_id),callback=self.commentparse, meta={"keyword":keyword,"message_id":message_id})
                        title = keyword
                        status_url = "https://m.weibo.cn/status/%s"
                        # response1 = requests.get(status_url%message_id)
                        if i.get("mblog").get("page_info"):
                            content = i.get("mblog").get("page_info").get("page_title")
                            content1 = i.get("mblog").get("page_info").get("content1")
                            content2 = i.get("mblog").get("page_info").get("content2")
                        else:
                            content = ""
                            content1 = ""
                            content2 = ""
                        text = i.get("mblog").get("text").encode(encoding="utf-8")
                        textLength = i.get("mblog").get("textLength")
                        isLongText = i.get("mblog").get("isLongText")
                        create_time = i.get("mblog").get("created_at")
                        spider_time =  datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
                        user = i.get("mblog").get("user").get("screen_name")
                        message_url = i.get("scheme")
                        longText = i.get("mblog").get("longText").get("longTextContent") if isLongText else ""
                        reposts_count = reposts_count
                        comments_count = comments_count
                        attitudes_count = attitudes_count
                        weiboitemloader = WeiBoItemLoader(item=WeibopachongItem())
                        weiboitemloader.add_value("title",title )
                        weiboitemloader.add_value("message_id",message_id )
                        weiboitemloader.add_value("content",content )
                        weiboitemloader.add_value("content1",content1 )
                        weiboitemloader.add_value("content2",content2 )
                        weiboitemloader.add_value("text",text )
                        weiboitemloader.add_value("textLength",textLength )
                        weiboitemloader.add_value("create_time",create_time )
                        weiboitemloader.add_value("spider_time",spider_time )
                        weiboitemloader.add_value("user1",user )
                        weiboitemloader.add_value("message_url",message_url )
                        weiboitemloader.add_value("longText1",longText )
                        weiboitemloader.add_value("reposts_count",reposts_count )
                        weiboitemloader.add_value("comments_count",comments_count )
                        weiboitemloader.add_value("attitudes_count",attitudes_count )
                        yield weiboitemloader.load_item()

2.scrapy爬取微博評論

評論在微博正文中往下拉滑鼠可以獲得URL規律,下面是微博評論解析函式：

    def commentparse(self,response):
        status_after_url = "https://m.weibo.cn/comments/hotflow?id=%s&mid=%s&max_id=%s&max_id_type=%s"
        message_id = response.meta.get("message_id")
        keyword = response.meta.get("keyword")
        results = json.loads(response.text, encoding="utf-8")
        if results.get("ok"):
            max_id = results.get("data").get("max_id")
            max_id_type = results.get("data").get("max_id_type")
            if max_id:
                # 評論10 個為一段，下一段在上一段JSON中定義：
                yield Request(url=status_after_url%(message_id,message_id,str(max_id),str(max_id_type)),callback=self.commentparse,meta={"keyword":keyword,"message_id":message_id})
            datas = results.get("data").get("data")
            for data in datas:
                text1 = data.get("text")
                like_count = data.get("like_count")
                user1 = data.get("user").get("screen_name")
                user_url = data.get("user").get("profile_url")
                emotion = SnowNLP(text1).sentiments
                weibocommentitem = WeiboCommentItem()
                weibocommentitem["title"] = keyword
                weibocommentitem["message_id"] = message_id
                weibocommentitem["text1"] = text1
                weibocommentitem["user1"] = user1
                weibocommentitem["user_url"] = user_url
                weibocommentitem["emotion"] = emotion
                yield weibocommentitem

最後非同步存入MYSQL：

item：

import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import Compose

def get_First(values):
    if values is not None:
        return values[0]
class WeiBoItemLoader(ItemLoader):
   default_output_processor = Compose(get_First)

class WeibopachongItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    message_id = scrapy.Field()
    content = scrapy.Field()
    content1 = scrapy.Field()
    content2 = scrapy.Field()
    text = scrapy.Field()
    textLength = scrapy.Field()
    create_time = scrapy.Field()
    spider_time = scrapy.Field()
    user1 = scrapy.Field()
    message_url = scrapy.Field()
    longText1 = scrapy.Field()
    reposts_count = scrapy.Field()
    comments_count = scrapy.Field()
    attitudes_count = scrapy.Field()

    def get_insert_sql(self):
        insert_sql = """
        insert into  t_public_opinion_realtime_weibo(title,message_id,content,content1,content2,text,textLength,create_time,spider_time,user1,message_url,longText1,reposts_count,comments_count,attitudes_count)values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
        """
        parms = (self["title"],self["message_id"],self["content"],self["content1"],self["content2"],self["text"],self["textLength"],self["create_time"],self["spider_time"],self["user1"],self["message_url"],self["longText1"],self["reposts_count"],self["comments_count"],self["attitudes_count"])
        return insert_sql, parms

class WeiboCommentItem(scrapy.Item):
    title = scrapy.Field()
    message_id = scrapy.Field()
    text1 = scrapy.Field()
    user1 = scrapy.Field()
    user_url = scrapy.Field()
    emotion = scrapy.Field()

    def get_insert_sql(self):
        insert_sql = """
        insert into  t_public_opinion_realtime_weibo_comment(title,message_id,text1,user1,user_url,emotion)
        values (%s,%s,%s,%s,%s,%s)
        """
        parms = (self["title"],self["message_id"],self["text1"],self["user1"],self["user_url"],self["emotion"])
        return insert_sql, parms

Pipline：非同步插入：

# 插入
class MysqlTwistedPipline(object):
    def __init__(self,dbpool):
        self.dbpool=dbpool
    @classmethod
    def from_settings(cls,setting):
        dbparms=dict(
                host=setting["MYSQL_HOST"],
                db=setting["MYSQL_DBNAME"],
                user=setting["MYSQL_USER"],
                passwd=setting["MYSQL_PASSWORD"],
                charset='utf8mb4',
                cursorclass=MySQLdb.cursors.DictCursor,
                use_unicode=True,
        )
        dbpool=adbapi.ConnectionPool("MySQLdb",**dbparms)
        return cls(dbpool)
    #mysql非同步插入執行
    def process_item(self, item, spider):
        query=self.dbpool.runInteraction(self.do_insert,item)
        query.addErrback(self.handle_error,item,spider)

    def handle_error(self,failure,item,spider):
        #處理非同步插入的異常
        print (failure)
    def do_insert(self,cursor,item):
        insert_sql,parms=item.get_insert_sql()
        print(parms)
        cursor.execute(insert_sql, parms)

按照規則來寫爬蟲還是難免有重複：

所以需要在插入內容前對資料進行去重處理

3.scrapy+Redis實現對重複微博的過濾

這裡使用Redis中的Set集合來實現，也可以用Python中的Set來做，資料量不大的情況下，Redis中Set有Sadd方法，當成功插入資料後，會返回1。如果插入重複資料則會返回0。

redis_db = redis.Redis(host='127.0.0.1', port=6379, db=0)
result = redis_db.sadd("wangliuqi","12323")
print(result)
result1 = redis_db.sadd("wangliuqi","12323")
print(result1)






結果：=========》》》》》》》》
        1
        0

在Scrapy中新增一個pipline，然後對每一個要儲存的item進行判斷，如果是重複的微博則對其進行丟棄操作：

RemoveReDoPipline:

class RemoveReDoPipline(object):
    def __init__(self,host):
        self.conn = MySQLdb.connect(host, 'root', 'root', 'meltmedia', charset="utf8", use_unicode=True)
        self.redis_db = redis.Redis(host='127.0.0.1', port=6379, db=0)
        sql = "SELECT message_id FROM t_public_opinion_realtime_weibo"
        # 獲取全部的message_id,這是區分是不是同一條微博的標識
        df = pd.read_sql(sql, self.conn)
        # 全部放入Redis中
        for mid in df['message_id'].get_values():
            self.redis_db.sadd("weiboset", mid)
    # 獲取setting檔案配置
    @classmethod
    def from_settings(cls,setting):
        host=setting["MYSQL_HOST"]
        return cls(host)

    def process_item(self, item, spider):
        # 只對微博的Item過濾，微博評論不需要過濾直接return：
        if isinstance(item,WeibopachongItem):
            if self.redis_db.sadd("weiboset",item["message_id"]):
                return item
            else:
                print("重複內容：", item['text'])
                raise DropItem("same title in %s" % item['text'])
        else:
            return item

最後別忘了在setting檔案中把pipline配置進去，並且要配置到儲存資料pipline前面才可以。否則起不到過濾效果：

ITEM_PIPELINES = {
   'weibopachong.pipelines.MysqlTwistedPipline': 200,
   'weibopachong.pipelines.RemoveReDoPipline': 100,
}

Scrapy實現對新浪微博某關鍵詞的爬取以及不同url中重複內容的過濾

工作原因需要爬取微博上相關微博內容以及評論。直接scrapy上手，發現有部分重複的內容出現。（標題重複，內容重複，但是url不重複） 1.scrapy爬取微博內容為了降低爬取難度，直接爬取微博的移動端：（電腦訪問到移動版本微博，之後F12調出控制檯來操作）點選

基於redis分散式快取實現（新浪微博案例）

第一：Redis 是什麼？ Redis是基於記憶體、可持久化的日誌型、Key-Value資料庫高效能儲存系統，並提供多種語言的API. 第二：出現背景資料結構(Data Structure)需求越來越多, 但memcache中沒有, 影響開發效率效能需求, 隨

使用新浪微博官方API抓取微博資料（Python版）

一、安裝環境二、一個簡單的例子 # coding=utf-8 from weibo import APIClient import webbrowser # python內建的包 APP_

（一一六）新浪微博client的離線緩存實現思路

aso 離線要求北京 ... comm roo rep 功能上一節（一一五）利用NSKeyedArchiver實現隨意對象轉為二進制介紹了將隨意對象轉化為二進制數據和還原的方法。可用於實現本節介紹的微博數據離線緩存。通過新浪官方的API能夠發現，返回的微博

實現QQ、微信、新浪微博和百度第三方登錄(Android Studio)

wiki protocol super cli 路徑參考 syn jar包 all 前言：對於大多數的APP都有第三方登錄這個功能，自己也做過幾次，最近又有一個新項目用到了第三方登錄，所以特意總結了一下關於第三方登錄的實現，並拿出來與大家一同分享；各大開放平臺註冊

修改 support 包 TabLayout，實現新浪微博/即刻 APP 蚯蚓導航效果

原博地址: 修改 support 包 TabLayout，實現新浪微博/即刻 APP 蚯蚓導航效果 Github: tablayout-android 修改 support 包 TabLayout，實現新浪微博/即刻 APP 蚯蚓導航效果用法 TabLayout 核心用法不變，新增一些自定義屬性。

仿新浪微博@功能 JS的實現 ——使用JQ At.js 和HTML5 contentEditable

專案需求增加@功能先上效果圖 Atwho.js gitHub地址 https://github.com/ichord/At.js 需要引入的JS： <script type="text/javascript"

PHP 實現新浪微博自動評論及爬取微博id

public function jiaoben(){ $code = $this->request->param('code'); $access_token = session('access'); echo $access_token; if(

Java和PHP兩種方式實現上傳圖片到新浪微博的圖床

這幾天遇到一個需求,需要將圖片上傳到新浪微博的圖傳,研究了一下, 特此記錄1.模擬登陸,獲取cookie登入地址為:https://login.sina.com.cn/sso/login.php?client=ssologin.js(v1.4.15)&_=140313

【爬蟲初探】新浪微博搜尋爬蟲實現

全文概述功能：爬取新浪微博的搜尋結果,支援高階搜尋中對搜尋時間的限定網址：http://s.weibo.com/ 實現：採取selenium測試工具，模擬微博登入，結合PhantomJS/Firefox，分析DOM節點後，採用Xpath對節點資訊進行獲

scrapy爬取新浪微博並存入MongoDB中

spider.pyimport json from scrapy import Request, Spider from weibo.items import * class WeiboSpider(Spider): name = 'weibocn'

基於scrapy的分散式爬蟲抓取新浪微博個人資訊和微博內容存入MySQL

為了學習機器學習深度學習和文字挖掘方面的知識，需要獲取一定的資料，新浪微博的大量資料可以作為此次研究歷程的物件一、環境準備 python 2.7 scrapy框架的部署（可以檢視上一篇部落格的簡要操作，傳送門：點選開啟連結） mysql的部署（需要的資源

Java基於新浪微博SDK實現發微博的功能

背景簡介 demo 總結背景最近用實現了一個簡單的發微博的功能。新浪微博的SDK已經經歷了多次更新，而網上的資料、教程大多還是基於舊版本的，很多細節上有了一些變化。本文將基於最新的新浪微博SDK介紹發微博的過程。簡介首先，需要在

新浪微博PC端登陸js分析及Python實現微博post登陸

新浪微博的安全級別還是比較高，前端的資訊採用RSA非對稱加密方式，加密的內容處理過，不僅僅是使用者輸入的密碼，加密公鑰是實時請求而來。首選抓個包瞧瞧： entry:weibogateway:1from:savestate:7qrcode_flag:falseuseticke

關於QQ、新浪微博、微信的分享功能的實現

1，QQ空間分享 http://sns.qzone.qq.com/cgi-bin/qzshare/cgi_qzshare_onekey ?url= //分享url &title //分享內容 &summary //分享內容摘

php 實現分享到QQ空間新浪微博

//分享到新浪微博//分享到新浪微博 $('#blog').click(function(){ window.sharetitle = '<

新浪微博自動轉發評論原始碼按鍵精靈實現詳細註釋幾十行程式碼實現涉及影象識別模擬鍵盤滑鼠

自動翻頁轉發，自動滾屏，一條一條地轉發。 1.開啟微博，登入 2.開啟按鍵精靈，進入除錯 3.回到要轉發的微博介面，按F10啟動轉發。我的微博已經轉發了很多條。不信可以看看：http://weibo.com/p/1005053019480453/myfollow 原始碼

使用Javascript 實現分享到新浪微博 QQ 空間等

我們閱讀部落格的時候經常會用到這樣功能，當然有時候也會想把自己的網站上也加入類似的分享功能，各大廠商已經給出了相應的API，點選一個按鈕即可彈出視窗進入分享，我們事先可以設定一些引數，一般常用的就是網站的網址，圖片還有一些內容描述。這裡我寫了三個方法，分別分享到新浪微

Android實現新浪微博和QQ登陸並獲取使用者的資訊

首先在新浪微博和騰訊的開發平臺下載相應的SDK，這裡不作介紹，直接上程式碼： LoginActivity.java package com.qingning.share; import java.io.ByteArrayOutputStream; import java

實現新浪微博第三方登入獲取使用者資訊

第一步：建立Android專案下載新浪sdk 下載地址：https://github.com/sinaweibosdk/weibo_android_sdk 裡面包含簽名工具和新浪官方的debug.keystore 新浪的demo必須用官方的debug.keystore編譯才

Scrapy實現對新浪微博某關鍵詞的爬取以及不同url中重複內容的過濾

1.scrapy爬取微博內容

2.scrapy爬取微博評論

3.scrapy+Redis實現對重複微博的過濾

相關推薦