根據地理位置和關鍵詞爬取twitter資料並生成詞雲

阿新 • • 發佈：2019-01-02

根據地理位置和關鍵詞爬取twitter資料存入MongoDB並生成詞雲

轉載註明出處

tweepy獲取資料
生成詞雲

tweepy獲取資料

1. 建立model model.py

class twitter_post(Document):
    _id = ObjectIdField(primary_key = True)
    screen_name = StringField(max_length = 128)
    text = StringField(required = True, max_length = 2048)
    text_id = IntField(required = True 
)
    created_at = DateTimeField(required = True)
    in_reply_to_screen_name = StringField(max_length = 64)
    retweet_count = IntField()
    favorite_count = IntField()
    source = StringField(max_length = 1024)
    longitude = StringField(max_length = 32)
    latitude = StringField(max_length = 32 
)
    location = StringField(max_length = 256)
    country_code = StringField(max_length = 64)
    lang = StringField(max_length = 4)
    time_zone = StringField(max_length = 64)
    province = StringField(max_length = 64)
    city = StringField(max_length = 64)
    district = StringField(max_length = 64 
)
    street = StringField(max_length = 64)
    street_number = StringField(max_length = 64)

    meta = {
        'ordering': ['created_at','screen_name'],
        'collection': 'twitter_posts'
    }

2. 訪問百度地圖介面根據經緯度拿到省市街道資訊

import requests
def GetAddress(lon,lat):
    url = 'http://api.map.baidu.com/geocoder/v2/'
    header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36'}
    payload = { 'output':'json', 'ak':'pAjezQsQBe8v1c1Lel87r4vprwXiGCEn' }
    payload['location'] = '{0:s},{1:s}'.format(str(lon),str(lat))
    print(lon,lat)
    content = requests.get(url,params=payload,headers=header).json()
    try:
        content = requests.get(url,params=payload,headers=header).json()
        content = content['result']['addressComponent']
        if content['street'] == None:#有一些地理位置街道資訊拿不到
            content['street'] = 'NULL'
        if content['street_number'] == None:
            content['street_number'] = 'NULL'
    except:
        content["province"]="NULL"
        content["city"]="NULL"
        content["district"]="NULL"
        content["street"]="NULL"
        content["street_number"]="NULL"
    return content
print(GetAddress(40.07571952, 116.60609467))

下面是三組經緯度拿到的地理位置資訊
三個經緯度拿到的資訊

3. 訪問tweepy開放的介面爬取資料

consumer_key = 'I1XowkiAc72fEp2CXPv0'
consumer_secret = 'drfnZHVUQrq1dyeqepCrbKyGWeYJCeTFQZpkLcXkgKFw3P'
access_key = '936432882482143235-jNLGPsCpZaSqR1D2WarSEshgQcyi'
access_secret = 'YF4ddleSgGxj8BsfmH2DELr7TsNNKAp08ZvqC'

# consumer_key = 'qEgHKHnL55g7k4U9xih'
# consumer_secret= 'QcUDHJS04wK5hrmlxV5C4gweiRPDca9JQoc4gp7ft'
# access_key= '863573499436122112-LA60oJLBzwVnhZjGOUPzRsJc'
# access_secret= '8CKFpp6qyxkAk1KfjWJPoHKloppPrvd7Tjiwllyk'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth_handler=auth, parser=JSONParser(), proxy = '127.0.0.1:1080', wait_on_rate_limit=True)

conn = connect('ANXIETY', alias='default', host='localhost', port=27017, username='', password='')#連線本地mongoDB

由於tweepy只提供過去一週的資料，而且每執行一次api.search()介面只會最多返回100條資料，而tweepy官方生成最多可以連續請求450次左右，因此我們大概最多可以拿到4萬多資料，為了拿到儘可能多的資料，我們設定MAX_QUERIES進行多次查詢,，本程式只爬取北京，上海，港澳臺資料，爬取的主題在line.csv檔案中：

regions = ["beijing","shanghai","hongkong","macau","taiwan"]
for line in open("normal_user.csv"):
    #try:
        #u = api.get_user(line)
    #ms = myStream.filter(track=[line])
    #print(results[0])
    for r in regions:#對每一個地理位置
        places = api.geo_search(query=r)["result"]["places"][0]#首先獲取地理位置ID
        print(places)
        place_id = places["id"]
        tweet_id = []
        i = MAX_QUERIES
        MAX_ID = 10
        while i > 0:

            if MAX_ID == 10:
                #tweets = api.search(q="place:%s AND line" % place_id)#根據地理位置和關鍵詞同時過去爬取
                for tweet in api.search(q="place:%s" % place_id,count = 100, until=until)["statuses"]:#把截止日期放到七日後同時設定沒次爬取最多數目一百，保證資料量
                    #print(tweet)
                    #j.write(json.dumps(tweet)+'\n')
                    tweet_id.append(tweet["id"])
                    Obj_id = ObjectId()
                    tweet_item = twitter_post(#生成一條mongo中的資料
                        _id = Obj_id,
                        text_id = tweet["id"],
                        created_at = datetime.datetime.strptime(tweet["created_at"], GMT_FORMAT),
                        screen_name = tweet["user"]["screen_name"],
                        favorite_count = tweet["favorite_count"],
                        retweet_count = tweet["retweet_count"],
                        text = tweet["text"],
                        source = tweet["source"],
                        country_code = (tweet["place"]["country_code"] if tweet['place'] != None else 'NULL'),
                        location = tweet["user"]["location"],
                        latitude = str(tweet["coordinates"]["coordinates"][0] if tweet["coordinates"] != None else 'NULL'),#根據返回的json檔案拿到維度，注意返回時緯度在前，但是訪問百度介面時，經度在前
                        longitude = str(tweet["coordinates"]["coordinates"][1] if tweet["coordinates"] != None else 'NULL'),
                        time_zone = tweet["user"]["time_zone"],
                        lang = tweet["lang"],
                        province = (GetAddress(tweet["coordinates"]["coordinates"][1],tweet["coordinates"]["coordinates"][0])['province'] if tweet["coordinates"] != None else 'NULL'),
                        city = (GetAddress(tweet["coordinates"]["coordinates"][1],tweet["coordinates"]["coordinates"][0])['city'] if tweet["coordinates"] != None else 'NULL'),
                        district = (GetAddress(tweet["coordinates"]["coordinates"][1],tweet["coordinates"]["coordinates"][0])['district'] if tweet["coordinates"] != None else 'NULL'),
                        street = (GetAddress(tweet["coordinates"]["coordinates"][1],tweet["coordinates"]["coordinates"][0])['street'] if tweet["coordinates"] != None else 'NULL'),
                        street_number = (GetAddress(tweet["coordinates"]["coordinates"][1],tweet["coordinates"]["coordinates"][0])['street_number'] if tweet["coordinates"] != None else 'NULL')
                        )
                    try:
                        tweet_item.save()#存入資料庫
                    except:
                        continue
                MAX_ID = min(tweet_id)
                #print(MAX_ID)

            else:
                for tweet in api.search(q="place:%s",count = 100, max_id = MAX_ID-1)["statuses"]:
                    #print(tweet)
                    #j.write(json.dumps(tweet)+'\n')
                    tweet_id.append(tweet["id"])
                    Obj_id = ObjectId()
                    tweet_item = twitter_post(
                        _id = Obj_id,
                        text_id = tweet["id"],
                        created_at = datetime.datetime.strptime(tweet["created_at"], GMT_FORMAT),
                        screen_name = tweet["user"]["screen_name"],
                        favorite_count = tweet["favorite_count"],
                        retweet_count = tweet["retweet_count"],
                        text = tweet["text"],
                        source = tweet["source"],
                        country_code = (tweet["place"]["country_code"] if tweet['place'] != None else 'NULL'),
                        location = tweet["user"]["location"],
                        latitude = str(tweet["coordinates"]["coordinates"][0] if tweet["coordinates"] != None else 'NULL'),
                        longitude = str(tweet["coordinates"]["coordinates"][1] if tweet["coordinates"] != None else 'NULL'),
                        time_zone = tweet["user"]["time_zone"],
                        lang = tweet["lang"],
                        province = (GetAddress(tweet["coordinates"]["coordinates"][1],tweet["coordinates"]["coordinates"][0])['province'] if tweet["coordinates"] != None else 'NULL'),
                        city = (GetAddress(tweet["coordinates"]["coordinates"][1],tweet["coordinates"]["coordinates"][0])['city'] if tweet["coordinates"] != None else 'NULL'),
                        district = (GetAddress(tweet["coordinates"]["coordinates"][1],tweet["coordinates"]["coordinates"][0])['district'] if tweet["coordinates"] != None else 'NULL'),
                        street = (GetAddress(tweet["coordinates"]["coordinates"][1],tweet["coordinates"]["coordinates"][0])['street'] if tweet["coordinates"] != None else 'NULL'),
                        street_number = (GetAddress(tweet["coordinates"]["coordinates"][1],tweet["coordinates"]["coordinates"][0])['street_number'] if tweet["coordinates"] != None else 'NULL')
                        )
                    try:
                        tweet_item.save()
                    except:
                        continue
                MAX_ID = min(tweet_id)
                #print(MAX_ID)
            i -= 1

得到的資料如下
這裡寫圖片描述

生成詞雲（輸入：地理位置和詞雲的個數，返回詞雲）

（難點，中英文文字同時處理）
思路：根據欄位lang將不同語言的文字分別處理生成詞雲，在進行排序合併

1. 訪問資料庫

class MongoConn():
    def __init__(self, db_name):
        try:
            url = '127.0.0.1:27017'
            self.client = pymongo.MongoClient(url, connect=True)
            self.db = self.client[db_name]
        except Exception as e:
            print ('連線mongo資料失敗!')
            traceback.print_exc()

    def destroy(self):
        self.client.close()

    def getDb(self):
        return self.db

    def __del__(self):
        self.client.close()

2. 英文文字預處理

def docs_preprocessor(docs):
    tokenizer = RegexpTokenizer(r'\w+')
    for idx in range(len(docs)):
        docs[idx] = docs[idx].lower()  # Convert to lowercase.
        docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

    # Remove numbers, but not words that contain numbers.
    docs = [[token for token in doc if not token.isdigit()] for doc in docs]

    # Remove words that are only one character.
    docs = [[token for token in doc if token not in _STOP_WORDS] for doc in docs]

    # Lemmatize all words in documents.

    return docs

2. 中文文字預處理

def nltk_tokenize(self,text):
        tokens = []
        # pos_tokens = []
        # entities = []
        features = []
        stop_words = stop.load_stopwords()

        try:
            #分詞,去空格
            # tokens = text.split() #英語
            tokens_cut = jieba.cut(text)

            for word in tokens_cut:
                #如果不是停止詞並且長度大於1小於5而且不是數字,在特徵中加上這個單詞
                if word not in stop_words and len(word) > 1 and len(word) < 5 and not is_number(word):        
                    #features.append(word + "." + postag)
                    features.append(word)

            for word in tokens_cut:
                tokens.append(word)
            # print 'feature here ', features
        except: pass
        return features

2. 詞雲生成器

class cloudProducer():

    def __init__(self):

        self.mon = MongoConn('ANXIETY')
        self.db = self.mon.getDb()

    def getMainData(self, region_type, region):
        #取最近一週的資料
        en_docs = []
        ch_docs = []

        twitter_in_english = self.db.twitter_posts.find({region_type:region,"lang":"en"})
        twitter_in_chinese = self.db.twitter_posts.find({region_type:region,"lang":"zh"})

        for x in twitter_in_english:
            #print(x)
            en_docs.append(x["text"])
        print(len(en_docs))
        for x in twitter_in_chinese:
            ch_docs.append(x["text"])

        return [en_docs,ch_docs]

    def produce_en_Cloud(self,region_type,region,num):
        #main page
        docs = self.getMainData(region_type,region)[0]
        print(len(docs))
        words_dump = []

        docs = docs_preprocessor(docs)

        for text in docs:
            #print(text)
            #features = text
            #print(features)
            words_dump = words_dump + text
        cloud = collections.Counter(words_dump).most_common(num)
        print(cloud)
        #json.dump(cloud,open("wordCloud.json","w",encoding="utf-8"))

        return cloud

    def produce_ch_Cloud(self,region_type,region,num):
        #main page
        docs = self.getMainData(region_type,region)[1]
        print(len(docs))
        words_dump = []

        for text in docs:
            features = nltk_tokenize(text)
            #print(features)
            words_dump = words_dump + features
        cloud = collections.Counter(words_dump).most_common(num)#返回一個元組陣列
        print(cloud)
        #json.dump(cloud,open("wordCloud.json","w",encoding="utf-8"))

        return cloud

    def produce_cloud(self,region_type,region,num):
        en_cloud = self.produce_en_Cloud(region_type,region,15)
        ch_cloud = self.produce_ch_Cloud(region_type,region,15)
        cloud = en_cloud + ch_cloud
        cloud = sorted(cloud,key=lambda t: t[1],reverse=True)#中英文詞雲排序
        return cloud[0:15]

cp = cloudProducer()
cloud = cp.produce_cloud("province","北京市",15)
print(cloud)

根據地理位置和關鍵詞爬取twitter資料並生成詞雲

根據地理位置和關鍵詞爬取twitter資料存入MongoDB並生成詞雲轉載註明出處 tweepy獲取資料生成詞雲 tweepy獲取資料 1. 建立model model.py class twitter_post(Document):

Python3網路爬蟲：requests+mongodb+wordcloud 爬取豆瓣影評並生成詞雲

Python版本： python3.+ 執行環境： Mac OS IDE： pycharm 一前言二豆瓣網影評爬取網頁分析程式碼編寫三資料庫實裝四

詳解使用Python爬取豆瓣短評並繪製詞雲

使用Python爬取豆瓣短評並繪製詞雲成果如下(比較醜，湊合看) 1.分析網頁開啟想要爬取的電影，比如《找到你》，其短評如下: 檢視原始碼發現短評存放在<span>標籤裡並且class為short，所以通過爬取其裡邊的內容即可

[轉載]Python爬取豆瓣影評並生成詞雲圖程式碼

# -*- coding:utf-8 -*- ''' 抓取豆瓣電影某部電影的評論這裡以《我不是潘金蓮為例》網址連結:https://movie.douban.com/subject/26630781/comments 為了抓取全部評論需要先進行登入 '''

Python爬取動態說說，生成詞雲，看看朋友的現狀

今天我們要做的事情是使用動態爬蟲來爬取QQ空間的說說，並把這些內容存在txt中，然後讀取出來生成雲圖，這樣可以清晰的看出朋友的狀況。這是好友的QQ空間10年說說內容，基本有一個大致的印象了。爬取動態內容 1.因為動態頁面的內容是動態加載出來的，所以

根據使用者ID爬取Twitter資料

我需要爬取的使用者ID存放在一個.csv檔案下，然後從官網註冊到一個APP，並獲得你的key和secret，寫入下邊的程式碼，就可以爬取tweets了。每個ID會輸出相應的tweet並且s會放在一個.csv檔案裡，而這個.csv檔案就在你執行這段程式碼的資料夾下。 #!/

福利！NodeJs爬取網路教程並生成PDF檔案，以阮一峰JavaScript教程和ES6教程為例（附原始碼和PDF檔案）

前言你想一夜暴富嗎？你想一夜成名嗎？你想開蘭博基尼泡妞嗎？你想拿鈔票點菸嗎？你想成為世界主宰嗎？那麼，趕緊往下看吧，雖然它不能達成前面所說的任何一個夢想，但是，你將獲得：通過命令列將某網站的內容轉成PDF檔案通過NodeJS爬蟲將某網路教程（例如阮一峰的JavaScript教程和ES6教

python 使用selenium和requests爬取頁面資料

目的：獲取某網站某使用者下市場大於1000秒的視訊資訊 1.本想通過介面獲得結果，但是使用post傳送資訊到介面，提示服務端錯誤。 2.通過requests獲取頁面結果，使用html解析工具，發現麻煩而且得不到想要的結果 3.直接通過selenium獲取控制元件的屬性資訊，如圖片、視訊地址，再對時間進行篩選

利用Twitter開放者平臺爬取Twitter資料

前言 Twitter對外提供了api介面且Twitter官方提供了Python第三方庫Tweepy，因此我直接參考Tweepy文件寫程式碼。現在Twitter國內是訪問不了的，我配置了Shadowsocks代理，ss預設是用socks5協議，對於Termina

scrapy爬取海量資料並儲存在MongoDB和MySQL資料庫中

前言一般我們都會將資料爬取下來儲存在臨時檔案或者控制檯直接輸出，但對於超大規模資料的快速讀寫，高併發場景的訪問，用資料庫管理無疑是不二之選。首先簡單描述一下MySQL和MongoDB的區別：MySQL與MongoDB都是開源的常用資料庫，MySQL是傳

用python爬取微博數據並生成詞雲

font 意思 extra 很多返回 json 自己技術分享 pre 很早之前寫過一篇怎麽利用微博數據制作詞雲圖片出來，之前的寫得不完整，而且只能使用自己的數據，現在重新整理了一下，任何的微博數據都可以制作出來，放在今天應該比較應景。一年一度的虐汪節，是繼續蹲在角落默

python爬蟲爬取QQ說說並且生成詞雲圖，回憶滿滿！

運維開發網絡分析 matplot 容易 jieba 編程語言提示框然而 Python（發音：英[?pa?θ?n]，美[?pa?θɑ:n]），是一種面向對象、直譯式電腦編程語言，也是一種功能強大的通用型語言，已經具有近二十年的發展歷史，成熟且穩定。它包含了一組完善而且

python 爬取視頻評論生成詞雲圖

爬取評論生成詞雲首先爬取評論寫入文件，用上一篇爬取騰訊是視頻的評論的方法提取評論http://blog.51cto.com/superleedo/2126099 代碼需要稍作修改如下：#!/usr/bin/env python # -*- coding: utf-8 -*- import re import

Python爬取QQ空間好友說說並生成詞雲(超詳細)

near 當前面數據請求 range 頁面 blank sleep 點擊前言先看效果圖: 思路 1.確認訪問的URL 2.模擬登錄你的QQ號 3.判斷好友空間是否加了權限，切換到說說的frame，爬取當前頁面數據，下拉滾動條，翻頁繼續獲取爬取的內容寫

爬取LeetCode資料，生成README檔案，美化GitHub倉庫

專案地址：LeetCodeCrawler 概述現在一般或多或少都會在LeetCode上面進行刷題練習，然後將程式碼放在GitHub上，當然我也一樣，這是我的刷題倉庫Algorithm。刷完題如果每次都去重新編輯README.md檔案進行更新，未免顯得有些費時，因此有了需求，個人就

Python爬取網頁資料並匯入表格

import requests import time import random import socket import http.client from bs4 import BeautifulSoup import csv def getContent(url

Python爬取豆瓣電影的短評資料並進行詞雲分析處理

前言對於爬蟲很不陌生，而爬蟲最為經典的案例就是爬取豆瓣上面的電影資料了，今天小編就介紹一下如果爬取豆瓣上面電影影評，以《我不是藥神》為例。基本環境配置版本：Python3.6 系統：Windows 本人對於Python學習建立了一個小小的學習圈子，為各位提供了

如何通過jsoup網路爬蟲工具爬取網頁資料,並通過jxl工具匯出到excel

1：閒話少說,直接看需求: 抓取的url:http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=. 參考的資料:http://blog.csdn.net/lmj6235

nodejs 爬取前端面經並生成詞雲

前言最近有 CVTE 的面試但是一直沒有到我，昨天下午牛客網上 CVTE 前端的面經突然多了起來，大致看了一下，和自己之前整理的知識點差的不多，但是基本都問了 nodejs 的問題。正好之前的爬蟲都沒有做過詞雲，藉著這個機會爬一下牛客網的前端面經，順便

Python3爬蟲之五：爬取網站資料並寫入excel

本文主要講解如何將網頁上的資料寫入到excel表中，因為我比較喜歡看小說，我們就以筆趣閣的小說資料為例，來說明怎麼把筆趣閣的小說關鍵資訊統計出來，比如：小說名、字數、作者、網址等。根據之前的幾次爬蟲例項分析筆趣網原始碼知道，小說名在唯一的標籤h1中，因此可以

根據地理位置和關鍵詞爬取twitter資料並生成詞雲

根據地理位置和關鍵詞爬取twitter資料存入MongoDB並生成詞雲

tweepy獲取資料

1. 建立model model.py

2. 訪問百度地圖介面根據經緯度拿到省市街道資訊

3. 訪問tweepy開放的介面爬取資料

生成詞雲 （輸入：地理位置和詞雲的個數，返回詞雲）

1. 訪問資料庫

2. 英文文字預處理

2. 中文文字預處理

2. 詞雲生成器

相關推薦

生成詞雲（輸入：地理位置和詞雲的個數，返回詞雲）