1. 程式人生 > >Scrapy實現對新浪微博某關鍵詞的爬取以及不同url中重複內容的過濾

Scrapy實現對新浪微博某關鍵詞的爬取以及不同url中重複內容的過濾

工作原因需要爬取微博上相關微博內容以及評論。直接scrapy上手,發現有部分重複的內容出現。(標題重複,內容重複,但是url不重複)

1.scrapy爬取微博內容 

為了降低爬取難度,直接爬取微博的移動端:(電腦訪問到移動版本微博,之後F12調出控制檯來操作)

點選搜尋欄:輸入相關搜尋關鍵詞:

我們要搜尋的“范冰冰” 其實做了URL編碼:

class SinaspiderSpider(scrapy.Spider):
    name = 'weibospider'
    allowed_domains = ['m.weibo.cn']
    start_urls = ['https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall']
    Referer = {"Referer": "https://m.weibo.cn/p/searchall?containerid=100103type%3D1%26q%3D"+quote("范冰冰")}
    def start_requests(self):

        yield Request(url="https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D"+quote("范冰冰")+"&page_type=searchall&page=1",headers=self.Referer,meta={"page":1,"keyword":"范冰冰"})

之後我們滾動往下拉發現url是有規律的:

 https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall&page=2
 https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall&page=3
 https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall&page=4
 https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall&page=5
 https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall&page=6
 https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall&page=7

在原來的基礎上新增了一個引數“&page=2” 這些引數從哪裡來的呢?我們如何判斷多少頁的時候就沒有了呢?

開啟我們最開始的那條URL:

複製這段json,然後通過下面兩個網站格式化一下,便於我們觀察規律:

線上工具有特別豐富的功能讓我們更好的檢視json:

我們發現JSON中儲存著我們要的頁面資訊:

其他的資訊一次類推在JSON或者URL中觀察:

微博爬取parse函式:

    def parse(self, response):
        base_url = "https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D"+quote("范冰冰")+"&page_type=searchall&page="
        results = json.loads(response.text,encoding="utf-8")
        page = response.meta.get("page")
        keyword = response.meta.get("keyword")
        # 下一頁
        next_page = results.get("data").get("cardlistInfo").get("page")
        if page != next_page:
            yield Request(url=base_url+str(next_page), headers=self.Referer, meta={"page":next_page,"keyword":keyword})
        result = results.get("data").get("cards")
        # 獲取微博
        for j in result:
            card_type = j.get("card_type")
            show_type = j.get("show_type")
            # 過濾
            if show_type ==1 and card_type ==11 :
                for i in j.get("card_group"):
                    reposts_count = i.get("mblog").get("reposts_count")
                    comments_count = i.get("mblog").get("comments_count")
                    attitudes_count = i.get("mblog").get("attitudes_count")
                    # 過濾到評論 轉發 喜歡都為0 的微博
                    if reposts_count and comments_count and attitudes_count:
                        message_id = i.get("mblog").get("id")
                        status_url = "https://m.weibo.cn/comments/hotflow?id=%s&mid=%s&max_id_type=0"
                        # 返回微博評論爬取
                        yield Request(url=status_url%(message_id,message_id),callback=self.commentparse, meta={"keyword":keyword,"message_id":message_id})
                        title = keyword
                        status_url = "https://m.weibo.cn/status/%s"
                        # response1 = requests.get(status_url%message_id)
                        if i.get("mblog").get("page_info"):
                            content = i.get("mblog").get("page_info").get("page_title")
                            content1 = i.get("mblog").get("page_info").get("content1")
                            content2 = i.get("mblog").get("page_info").get("content2")
                        else:
                            content = ""
                            content1 = ""
                            content2 = ""
                        text = i.get("mblog").get("text").encode(encoding="utf-8")
                        textLength = i.get("mblog").get("textLength")
                        isLongText = i.get("mblog").get("isLongText")
                        create_time = i.get("mblog").get("created_at")
                        spider_time =  datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
                        user = i.get("mblog").get("user").get("screen_name")
                        message_url = i.get("scheme")
                        longText = i.get("mblog").get("longText").get("longTextContent") if isLongText else ""
                        reposts_count = reposts_count
                        comments_count = comments_count
                        attitudes_count = attitudes_count
                        weiboitemloader = WeiBoItemLoader(item=WeibopachongItem())
                        weiboitemloader.add_value("title",title )
                        weiboitemloader.add_value("message_id",message_id )
                        weiboitemloader.add_value("content",content )
                        weiboitemloader.add_value("content1",content1 )
                        weiboitemloader.add_value("content2",content2 )
                        weiboitemloader.add_value("text",text )
                        weiboitemloader.add_value("textLength",textLength )
                        weiboitemloader.add_value("create_time",create_time )
                        weiboitemloader.add_value("spider_time",spider_time )
                        weiboitemloader.add_value("user1",user )
                        weiboitemloader.add_value("message_url",message_url )
                        weiboitemloader.add_value("longText1",longText )
                        weiboitemloader.add_value("reposts_count",reposts_count )
                        weiboitemloader.add_value("comments_count",comments_count )
                        weiboitemloader.add_value("attitudes_count",attitudes_count )
                        yield weiboitemloader.load_item()

2.scrapy爬取微博評論

評論在微博正文中往下拉滑鼠可以獲得URL規律,下面是微博評論解析函式:

    def commentparse(self,response):
        status_after_url = "https://m.weibo.cn/comments/hotflow?id=%s&mid=%s&max_id=%s&max_id_type=%s"
        message_id = response.meta.get("message_id")
        keyword = response.meta.get("keyword")
        results = json.loads(response.text, encoding="utf-8")
        if results.get("ok"):
            max_id = results.get("data").get("max_id")
            max_id_type = results.get("data").get("max_id_type")
            if max_id:
                # 評論10 個為一段,下一段在上一段JSON中定義:
                yield Request(url=status_after_url%(message_id,message_id,str(max_id),str(max_id_type)),callback=self.commentparse,meta={"keyword":keyword,"message_id":message_id})
            datas = results.get("data").get("data")
            for data in datas:
                text1 = data.get("text")
                like_count = data.get("like_count")
                user1 = data.get("user").get("screen_name")
                user_url = data.get("user").get("profile_url")
                emotion = SnowNLP(text1).sentiments
                weibocommentitem = WeiboCommentItem()
                weibocommentitem["title"] = keyword
                weibocommentitem["message_id"] = message_id
                weibocommentitem["text1"] = text1
                weibocommentitem["user1"] = user1
                weibocommentitem["user_url"] = user_url
                weibocommentitem["emotion"] = emotion
                yield weibocommentitem

最後非同步存入MYSQL:

item:

import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import Compose

def get_First(values):
    if values is not None:
        return values[0]
class WeiBoItemLoader(ItemLoader):
   default_output_processor = Compose(get_First)

class WeibopachongItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    message_id = scrapy.Field()
    content = scrapy.Field()
    content1 = scrapy.Field()
    content2 = scrapy.Field()
    text = scrapy.Field()
    textLength = scrapy.Field()
    create_time = scrapy.Field()
    spider_time = scrapy.Field()
    user1 = scrapy.Field()
    message_url = scrapy.Field()
    longText1 = scrapy.Field()
    reposts_count = scrapy.Field()
    comments_count = scrapy.Field()
    attitudes_count = scrapy.Field()

    def get_insert_sql(self):
        insert_sql = """
        insert into  t_public_opinion_realtime_weibo(title,message_id,content,content1,content2,text,textLength,create_time,spider_time,user1,message_url,longText1,reposts_count,comments_count,attitudes_count)values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
        """
        parms = (self["title"],self["message_id"],self["content"],self["content1"],self["content2"],self["text"],self["textLength"],self["create_time"],self["spider_time"],self["user1"],self["message_url"],self["longText1"],self["reposts_count"],self["comments_count"],self["attitudes_count"])
        return insert_sql, parms

class WeiboCommentItem(scrapy.Item):
    title = scrapy.Field()
    message_id = scrapy.Field()
    text1 = scrapy.Field()
    user1 = scrapy.Field()
    user_url = scrapy.Field()
    emotion = scrapy.Field()

    def get_insert_sql(self):
        insert_sql = """
        insert into  t_public_opinion_realtime_weibo_comment(title,message_id,text1,user1,user_url,emotion)
        values (%s,%s,%s,%s,%s,%s)
        """
        parms = (self["title"],self["message_id"],self["text1"],self["user1"],self["user_url"],self["emotion"])
        return insert_sql, parms

Pipline:非同步插入:

# 插入
class MysqlTwistedPipline(object):
    def __init__(self,dbpool):
        self.dbpool=dbpool
    @classmethod
    def from_settings(cls,setting):
        dbparms=dict(
                host=setting["MYSQL_HOST"],
                db=setting["MYSQL_DBNAME"],
                user=setting["MYSQL_USER"],
                passwd=setting["MYSQL_PASSWORD"],
                charset='utf8mb4',
                cursorclass=MySQLdb.cursors.DictCursor,
                use_unicode=True,
        )
        dbpool=adbapi.ConnectionPool("MySQLdb",**dbparms)
        return cls(dbpool)
    #mysql非同步插入執行
    def process_item(self, item, spider):
        query=self.dbpool.runInteraction(self.do_insert,item)
        query.addErrback(self.handle_error,item,spider)

    def handle_error(self,failure,item,spider):
        #處理非同步插入的異常
        print (failure)
    def do_insert(self,cursor,item):
        insert_sql,parms=item.get_insert_sql()
        print(parms)
        cursor.execute(insert_sql, parms)

按照規則來寫爬蟲還是難免有重複:

所以需要在插入內容前對資料進行去重處理

3.scrapy+Redis實現對重複微博的過濾

這裡使用Redis中的Set集合來實現,也可以用Python中的Set來做,資料量不大的情況下,Redis中Set有Sadd方法,當成功插入資料後,會返回1。如果插入重複資料則會返回0。

redis_db = redis.Redis(host='127.0.0.1', port=6379, db=0)
result = redis_db.sadd("wangliuqi","12323")
print(result)
result1 = redis_db.sadd("wangliuqi","12323")
print(result1)






結果:=========》》》》》》》》
        1
        0

在Scrapy中新增一個pipline,然後對每一個要儲存的item進行判斷,如果是重複的微博則對其進行丟棄操作:

RemoveReDoPipline:
class RemoveReDoPipline(object):
    def __init__(self,host):
        self.conn = MySQLdb.connect(host, 'root', 'root', 'meltmedia', charset="utf8", use_unicode=True)
        self.redis_db = redis.Redis(host='127.0.0.1', port=6379, db=0)
        sql = "SELECT message_id FROM t_public_opinion_realtime_weibo"
        # 獲取全部的message_id,這是區分是不是同一條微博的標識
        df = pd.read_sql(sql, self.conn)
        # 全部放入Redis中
        for mid in df['message_id'].get_values():
            self.redis_db.sadd("weiboset", mid)
    # 獲取setting檔案配置
    @classmethod
    def from_settings(cls,setting):
        host=setting["MYSQL_HOST"]
        return cls(host)

    def process_item(self, item, spider):
        # 只對微博的Item過濾,微博評論不需要過濾直接return:
        if isinstance(item,WeibopachongItem):
            if self.redis_db.sadd("weiboset",item["message_id"]):
                return item
            else:
                print("重複內容:", item['text'])
                raise DropItem("same title in %s" % item['text'])
        else:
            return item

最後別忘了在setting檔案中把pipline配置進去,並且要配置到儲存資料pipline前面才可以。否則起不到過濾效果:

ITEM_PIPELINES = {
   'weibopachong.pipelines.MysqlTwistedPipline': 200,
   'weibopachong.pipelines.RemoveReDoPipline': 100,
}