R語言進行Twitter資料視覺化

阿新 • • 發佈：2020-10-18

作者|Audhi Aprilliant
編譯|VK
來源|Towards Datas Science

概述

對於這個專案，我們在2019年5月28-29日通過爬蟲來使用Twitter的原始資料。此外，資料是CSV格式（逗號分隔），可以在這裡下載。

https://github.com/audhiaprilliant/Indonesia-Public-Election-Twitter-Sentiment-Analysis/tree/master/Datasets

它涉及兩個主題，一個是包含關鍵字“Joko Widodo”的Joko Widodo的資料，另一個是帶有關鍵字“Prabowo Subianto”的Prabowo Subianto的資料。其中包括幾個變數和資訊，以確定使用者情緒。實際上，資料有16個變數或屬性和1000多個觀察值。表1列出了一些變數。

# 匯入庫
library(ggplot2)
library(lubridate)

# 載入Joko Widodo的資料
data.jokowi.df = read.csv(file = 'data-joko-widodo.csv',
                          header = TRUE,
                          sep = ',')
senti.jokowi = read.csv(file = 'sentiment-joko-widodo.csv',
                        header = TRUE,
                        sep = ',')
                        
# 載入Prabowo Subianto的資料
data.prabowo.df = read.csv(file = 'data-prabowo-subianto.csv',
                           header = TRUE,
                           sep = ',')
senti.prabowo = read.csv(file = 'sentiment-prabowo-subianto.csv',
                         header = TRUE,
                         sep = ',')

資料視覺化

資料探索旨在從Twitter資料中獲取任何資訊。應該指出的是，資料已經進行了文字預處理。我們對那些被認為是很有趣的變數進行探索。。

# TWEETS的條形圖-JOKO WIDODO
data.jokowi.df$created = ymd_hms(data.jokowi.df$created,
                                 tz = 'Asia/Jakarta')
# 另一種製作“date”和“hour”變數的方法
data.jokowi.df$date = date(data.jokowi.df$created)
data.jokowi.df$hour = hour(data.jokowi.df$created)
# 日期2019-05-29
data.jokowi.date1 = subset(x = data.jokowi.df,
                           date == '2019-05-29')
data.hour.date1 = data.frame(table(data.jokowi.date1$hour))
colnames(data.hour.date1) = c('Hour','Total.Tweets')
# 建立資料視覺化
ggplot(data.hour.date1)+
  geom_bar(aes(x = Hour,
               y = Total.Tweets,
               fill = I('blue')),
           stat = 'identity',
           alpha = 0.75,
           show.legend = FALSE)+
  geom_hline(yintercept = mean(data.hour.date1$Total.Tweets),
             col = I('black'),
             size = 1)+
  geom_text(aes(fontface = 'italic',
                label = paste('Average:',
ceiling(mean(data.hour.date1$Total.Tweets)),
                              'Tweets per hour'),
                x = 8,
                y = mean(data.hour.date1$Total.Tweets)+20),
            hjust = 'left',
            size = 4)+
  labs(title = 'Total Tweets per Hours - Joko Widodo',
       subtitle = '28 May 2019',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlab('Time of Day')+
  ylab('Total Tweets')+
  scale_fill_brewer(palette = 'Dark2')+
  theme_bw()
  
# TWEETS的條形圖-PRABOWO SUBIANTO
data.prabowo.df$created = ymd_hms(data.prabowo.df$created,
                                  tz = 'Asia/Jakarta')
                                  
# 另一種製作“date”和“hour”變數的方法
data.prabowo.df$date = date(data.prabowo.df$created)
data.prabowo.df$hour = hour(data.prabowo.df$created)

# 日期2019-05-28
data.prabowo.date1 = subset(x = data.prabowo.df,
                            date == '2019-05-28')
data.hour.date1 = data.frame(table(data.prabowo.date1$hour))
colnames(data.hour.date1) = c('Hour','Total.Tweets')

# 日期 2019-05-29
data.prabowo.date2 = subset(x = data.prabowo.df,
                            date == '2019-05-29')
data.hour.date2 = data.frame(table(data.prabowo.date2$hour))
colnames(data.hour.date2) = c('Hour','Total.Tweets')
data.hour.date3 = rbind(data.hour.date1,data.hour.date2)
data.hour.date3$Date = c(rep(x = '2019-05-28',
                             len = nrow(data.hour.date1)),
                         rep(x = '2019-05-29',
                             len = nrow(data.hour.date2)))
data.hour.date3$Labels = c(letters,'A','B')
data.hour.date3$Hour = as.character(data.hour.date3$Hour)
data.hour.date3$Hour = as.numeric(data.hour.date3$Hour)

# 資料預處理
for (i in 1:nrow(data.hour.date3)) {
  if (i%%2 == 0) {
    data.hour.date3[i,'Hour'] = ''
  }
  if (i%%2 == 1) {
    data.hour.date3[i,'Hour'] = data.hour.date3[i,'Hour']
  }
}
data.hour.date3$Hour = as.factor(data.hour.date3$Hour)

# 資料視覺化
ggplot(data.hour.date3)+
  geom_bar(aes(x = Labels,
               y = Total.Tweets,
               fill = Date),
           stat = 'identity',
           alpha = 0.75,
           show.legend = TRUE)+
  geom_hline(yintercept = mean(data.hour.date3$Total.Tweets),
             col = I('black'),
             size = 1)+
  geom_text(aes(fontface = 'italic',
                label = paste('Average:',
ceiling(mean(data.hour.date3$Total.Tweets)),
                              'Tweets per hour'),
                x = 5,
                y = mean(data.hour.date3$Total.Tweets)+6),
            hjust = 'left',
            size = 3.8)+
  scale_x_discrete(limits = data.hour.date3$Labels,
                   labels = data.hour.date3$Hour)+
  labs(title = 'Total Tweets per Hours - Prabowo Subianto',
       subtitle = '28 - 29 May 2019',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlab('Time of Day')+
  ylab('Total Tweets')+
  ylim(c(0,100))+
  theme_bw()+
  theme(legend.position = 'bottom',
        legend.title = element_blank())+
  scale_fill_brewer(palette = 'Dark2')

根據圖1，我們可以得出結論，通過資料抓取（關鍵字“Jokow Widodo”和“Prabowo Subianto”）得到的tweet數量並不相似，即使在同一日期。

例如，在圖1（左）中，從視覺上看，對於關鍵字為“Joko Widodo”的推文，僅在2019年5月28日03:00–17:00 WIB期間獲得。而在圖1（右圖）中，我們得出的結論是，在2019年5月28日至29日12:00-23:59 WIB（2019年5月28日）和00:00-15:00 WIB（2019年5月29日）期間獲得的關鍵詞為“Prabowo Subianto”的推文。

# 2019-05-28的推特
ggplot(data.hour.date1)+
  geom_bar(aes(x = Hour,
               y = Total.Tweets,
               fill = I('red')),
           stat = 'identity',
           alpha = 0.75,
           show.legend = FALSE)+
  geom_hline(yintercept = mean(data.hour.date1$Total.Tweets),
             col = I('black'),
             size = 1)+
  geom_text(aes(fontface = 'italic',
                label = paste('Average:',
ceiling(mean(data.hour.date1$Total.Tweets)),
                              'Tweets per hour'),
                x = 6.5,
                y = mean(data.hour.date1$Total.Tweets)+5),
            hjust = 'left',
            size = 4)+
  labs(title = 'Total Tweets per Hours - Prabowo Subianto',
       subtitle = '28 May 2019',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlab('Time of Day')+
  ylab('Total Tweets')+
  ylim(c(0,100))+
  theme_bw()+
  scale_fill_brewer(palette = 'Dark2')
  
# 2019-05-29的推特
ggplot(data.hour.date2)+
  geom_bar(aes(x = Hour,
               y = Total.Tweets,
               fill = I('red')),
           stat = 'identity',
           alpha = 0.75,
           show.legend = FALSE)+
  geom_hline(yintercept = mean(data.hour.date2$Total.Tweets),
             col = I('black'),
             size = 1)+
  geom_text(aes(fontface = 'italic',
                label = paste('Average:',
ceiling(mean(data.hour.date2$Total.Tweets)),
                              'Tweets per hour'),
                x = 1,
                y = mean(data.hour.date2$Total.Tweets)+6),
            hjust = 'left',
            size = 4)+
  labs(title = 'Total Tweets per Hours - Prabowo Subianto',
       subtitle = '29 May 2019',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlab('Time of Day')+
  ylab('Total Tweets')+
  ylim(c(0,100))+
  theme_bw()+
  scale_fill_brewer(palette = 'Dark2')

根據圖2，我們得到了使用關鍵字“Joko Widodo”和“Prabowo Subianto”的使用者之間的顯著差異。關鍵詞為“Joko Widodo”的tweet在某個特定時間（07:00–09:00 WIB）談論Joko Widodo往往非常激烈，08:00 WIB的tweet數量最多。它有348條推文。然而，在2019年5月28日至29日期間，關鍵詞為“Prabowo Subianto”的推文往往會不斷地談論Prabowo Subianto。2019年5月28日至29日，每小時上傳關鍵詞為“Prabowo Subianto”的推文平均為36條。

# JOKO WIDODO
df.score.1 = subset(senti.jokowi,class == c('Negative','Positive'))
colnames(df.score.1) = c('Score','Text','Sentiment')
# Data viz
ggplot(df.score.1)+
  geom_density(aes(x = Score,
                   fill = Sentiment),
               alpha = 0.75)+
  xlim(c(-11,11))+
  labs(title = 'Density Plot of Sentiment Scores',
       subtitle = 'Joko Widodo',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlab('Score')+ 
  ylab('Density')+
  theme_bw()+
  scale_fill_brewer(palette = 'Dark2')+
  theme(legend.position = 'bottom',
        legend.title = element_blank())
        
# PRABOWO SUBIANTO
df.score.2 = subset(senti.prabowo,class == c('Negative','Positive'))
colnames(df.score.2) = c('Score','Text','Sentiment')
ggplot(df.score.2)+
  geom_density(aes(x = Score,
                   fill = Sentiment),
               alpha = 0.75)+
  xlim(c(-11,11))+
  labs(title = 'Density Plot of Sentiment Scores',
       subtitle = 'Prabowo Subianto',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlab('Density')+ 
  ylab('Score')+
  theme_bw()+
  scale_fill_brewer(palette = 'Dark2')+
  theme(legend.position = 'bottom',
        legend.title = element_blank())

圖3是2019年5月28日至29日以“Joko Widodo”和“Prabowo Subianto”為關鍵詞的多條推文的條形圖。由圖3(左)可以得出，Twitter使用者在19:00-23:59 WIB上談論Prabowo Subianto的頻率較低。這是由於印尼人的休息時間造成的。然而，這些帶有主題的推文總是在午夜更新，因為有的使用者居住在國外，有的使用者仍然活躍。然後，使用者在04:00 WIB開始活動，在07:00 WIB達到高峰，然後下降，直到12:00 WIB再次上升。

# JOKO WIDODO
df.senti.score.1 = data.frame(table(senti.jokowi$score))
colnames(df.senti.score.1) = c('Score','Freq')
# 資料預處理
df.senti.score.1$Score = as.character(df.senti.score.1$Score)
df.senti.score.1$Score = as.numeric(df.senti.score.1$Score)
Score1 = df.senti.score.1$Score
sign(df.senti.score.1[1,1])
for (i in 1:nrow(df.senti.score.1)) {
  sign.row = sign(df.senti.score.1[i,'Score'])
  for (j in 1:ncol(df.senti.score.1)) {
    df.senti.score.1[i,j] = df.senti.score.1[i,j] * sign.row
  }
}
df.senti.score.1$Label = c(letters[1:nrow(df.senti.score.1)])
df.senti.score.1$Sentiment = ifelse(df.senti.score.1$Freq < 0,
                                    'Negative','Positive')
df.senti.score.1$Score1 = Score1
# 資料視覺化
ggplot(df.senti.score.1)+
  geom_bar(aes(x = Label,
               y = Freq,
               fill = Sentiment),
           stat = 'identity',
           show.legend = FALSE)+
  # 積極情感
  geom_hline(yintercept = mean(abs(df.senti.score.1[which(df.senti.score.1$Sentiment == 'Positive'),'Freq'])),
             col = I('black'),
             size = 1)+
  geom_text(aes(fontface = 'italic',
                label = paste('Average Freq:',
ceiling(mean(abs(df.senti.score.1[which(df.senti.score.1$Sentiment == 'Positive'),'Freq'])))),
                x = 10,
                y = mean(abs(df.senti.score.1[which(df.senti.score.1$Sentiment == 'Positive'),'Freq']))+30),
            hjust = 'right',
            size = 4)+
  # 消極情感
  geom_hline(yintercept = mean(df.senti.score.1[which(df.senti.score.1$Sentiment == 'Negative'),'Freq']),
             col = I('black'),
             size = 1)+
  geom_text(aes(fontface = 'italic',
                label = paste('Average Freq:',
ceiling(mean(abs(df.senti.score.1[which(df.senti.score.1$Sentiment == 'Negative'),'Freq'])))),
                x = 5,
                y = mean(df.senti.score.1[which(df.senti.score.1$Sentiment == 'Negative'),'Freq'])-15),
            hjust = 'left',
            size = 4)+
  labs(title = 'Barplot of Sentiments',
       subtitle = 'Joko Widodo',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlab('Score')+
  scale_x_discrete(limits = df.senti.score.1$Label,
                   labels = df.senti.score.1$Score1)+
  theme_bw()+
  scale_fill_brewer(palette = 'Dark2')

# PRABOWO SUBIANTO
df.senti.score.2 = data.frame(table(senti.prabowo$score))
colnames(df.senti.score.2) = c('Score','Freq')
# 資料預處理
df.senti.score.2$Score = as.character(df.senti.score.2$Score)
df.senti.score.2$Score = as.numeric(df.senti.score.2$Score)
Score2 = df.senti.score.2$Score
sign(df.senti.score.2[1,1])
for (i in 1:nrow(df.senti.score.2)) {
  sign.row = sign(df.senti.score.2[i,'Score'])
  for (j in 1:ncol(df.senti.score.2)) {
    df.senti.score.2[i,j] = df.senti.score.2[i,j] * sign.row
  }
}
df.senti.score.2$Label = c(letters[1:nrow(df.senti.score.2)])
df.senti.score.2$Sentiment = ifelse(df.senti.score.2$Freq < 0,
                                    'Negative','Positive')
df.senti.score.2$Score1 = Score2
# 資料視覺化
ggplot(df.senti.score.2)+
  geom_bar(aes(x = Label,
               y = Freq,
               fill = Sentiment),
           stat = 'identity',
           show.legend = FALSE)+
  # 積極情感
  geom_hline(yintercept = mean(abs(df.senti.score.2[which(df.senti.score.2$Sentiment == 'Positive'),'Freq'])),
             col = I('black'),
             size = 1)+
  geom_text(aes(fontface = 'italic',
                label = paste('Average Freq:',
ceiling(mean(abs(df.senti.score.2[which(df.senti.score.2$Sentiment == 'Positive'),'Freq'])))),
                x = 11,
                y = mean(abs(df.senti.score.2[which(df.senti.score.2$Sentiment == 'Positive'),'Freq']))+20),
            hjust = 'right',
            size = 4)+
  # 消極情感
  geom_hline(yintercept = mean(df.senti.score.2[which(df.senti.score.2$Sentiment == 'Negative'),'Freq']),
             col = I('black'),
             size = 1)+
  geom_text(aes(fontface = 'italic',
                label = paste('Average Freq:',
ceiling(mean(abs(df.senti.score.2[which(df.senti.score.2$Sentiment == 'Negative'),'Freq'])))),
                x = 9,
                y = mean(df.senti.score.2[which(df.senti.score.2$Sentiment == 'Negative'),'Freq'])-10),
            hjust = 'left',
            size = 4)+
  labs(title = 'Barplot of Sentiments',
       subtitle = 'Prabowo Subianto',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlab('Score')+
  scale_x_discrete(limits = df.senti.score.2$Label,
                   labels = df.senti.score.2$Score1)+
  theme_bw()+
  scale_fill_brewer(palette = 'Dark2')

圖4是包含關鍵字“Joko Widodo”和“Prabowo Subianto”的情感得分密度圖。tweets的得分是由組成tweets的詞根的平均得分得到的。因此，它的分數是針對每個詞根給出的，其值介於-10到10之間。如果分數越小，那麼微博中的負面情緒就越多，反之亦然。根據圖4（左），可以得出結論，包含關鍵字“Joko Widodo”的推文的負面情緒在-10到-1之間，中間得分為-4。它也適用於積極的情緒（當然，有一個積極的分數）。根據圖4（左）中的密度圖，我們發現積極情緒的得分具有相當小的方差。因此，我們得出結論，對包含關鍵詞“Joko Widodo”的微博的積極情緒並不是太多樣化。

圖4（右）顯示了包含關鍵字“Prabowo Subianto”的情感得分密度圖。它與圖4（左）不同，因為圖4（右）上的負面情緒在-8到-1之間。這意味著tweets沒有太多負面情緒（tweets有負面情緒，但不夠高）。此外，負面情緒得分的分佈在4和1之間有兩個峰值。然而，積極情緒從1到10不等。與圖4（左）相比，圖4（右）的積極情緒具有較高的方差，在3和10範圍內有兩個峰值。這表明，包含關鍵詞“Prabowo Subianto”的微博具有很高的積極情緒。

# JOKO WIDODO
df.senti.3 = as.data.frame(table(senti.jokowi$class))
colnames(df.senti.3) = c('Sentiment','Freq')
# 資料預處理
df.pie.1 = df.senti.3
df.pie.1$Prop = df.pie.1$Freq/sum(df.pie.1$Freq)
df.pie.1 = df.pie.1 %>%
  arrange(desc(Sentiment)) %>%
  mutate(lab.ypos = cumsum(Prop) - 0.5*Prop)
# 資料視覺化
ggplot(df.pie.1,
       aes(x = 2,
           y = Prop,
           fill = Sentiment))+
  geom_bar(stat = 'identity',
           col = 'white',
           alpha = 0.75,
           show.legend = TRUE)+
  coord_polar(theta = 'y', 
              start = 0)+
  geom_text(aes(y = lab.ypos,
                label = Prop),
            color = 'white',
            fontface = 'italic',
            size = 4)+
  labs(title = 'Piechart of Sentiments',
       subtitle = 'Joko Widodo',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlim(c(0.5,2.5))+
  theme_void()+
  scale_fill_brewer(palette = 'Dark2')+
  theme(legend.title = element_blank(),
        legend.position = 'right')
        
# PRABOWO SUBIANTO
df.senti.4 = as.data.frame(table(senti.prabowo$class))
colnames(df.senti.4) = c('Sentiment','Freq')
# 資料預處理
df.pie.2 = df.senti.4
df.pie.2$Prop = df.pie.2$Freq/sum(df.pie.2$Freq)
df.pie.2 = df.pie.2 %>%
  arrange(desc(Sentiment)) %>%
  mutate(lab.ypos = cumsum(Prop) - 0.5*Prop)
# 資料視覺化
ggplot(df.pie.2,
       aes(x = 2,
           y = Prop,
           fill = Sentiment))+
  geom_bar(stat = 'identity',
           col = 'white',
           alpha = 0.75,
           show.legend = TRUE)+
  coord_polar(theta = 'y', 
              start = 0)+
  geom_text(aes(y = lab.ypos,
                label = Prop),
            color = 'white',
            fontface = 'italic',
            size = 4)+
  labs(title = 'Piechart of Sentiments',
       subtitle = 'Prabowo Subianto',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlim(c(0.5,2.5))+
  theme_void()+
  scale_fill_brewer(palette = 'Dark2')+
  theme(legend.title = element_blank(),
        legend.position = 'right')

圖5是推特的情緒得分彙總，這些微博被分為負面情緒、中性情緒和積極情緒。消極情緒是指得分低於零的情緒，中性是指分數等於零的情緒，積極情緒得分大於零。從圖5可以看出，關鍵字為“Joko Widodo”的微博的負面情緒百分比低於關鍵字為“Prabowo Subianto”的tweet。有6.3%的差異。研究還發現，與關鍵詞為Prabowo Subianto的微博相比，包含關鍵詞“Joko Widodo”的微博具有更高的中性情緒和積極情緒。通過piechart的研究發現，與關鍵字為“Prabowo Subianto”的tweet相比，帶有關鍵字“Joko Widodo”的tweet傾向於擁有更高比例的積極情緒。但是通過密度圖發現，積極和消極情緒得分的分佈表明，與“Joko Widodo”相比，包含關鍵字“Prabowo Subianto”的微博往往具有更高的情緒得分。它必須進行進一步的分析。

圖6顯示了使用者在2019年5月28-29日經常上傳的tweet（關鍵詞“Joko Widodo”和“Prabowo Subianto”）中的術語或單詞。通過這個WordCloud視覺化，可以找到熱門話題，這些話題都是針對關鍵詞進行討論的。對於包含關鍵詞“Joko Widodo”的tweet，我們發現術語“tuang”、“petisi”、“negara”、“aman”和“nusantara”是前五名，每個tweet出現的次數最多。然而，包含關鍵詞“Joko Widodo”的tweet發現，“Prabowo”、“Subianto”、“kriminalisasi”、“selamat”和“dubai”是每個tweet中出現次數最多的前五個詞。這間接地顯示了以關鍵字“Prabowo Subianto”上傳的tweet的模式，即：幾乎可以肯定的是，每個上傳的tweet都直接包含“Prabowo Subianto”的名稱，而不是通過提及（@）。這是因為，在文字預處理中，提到（@）已被刪除。

可以前往我的GitHub repo查詢程式碼：https://github.com/audhiaprilliant/Indonesia-Public-Election-Twitter-Sentiment-Analysis

參考引用

[1] K. Borau, C. Ullrich, J. Feng, R. Shen. Microblogging for Language Learning: Using Twitter to Train Communicative and Cultural Competence (2009), Advances in Web-Based Learning — ICWL 2009, 8th International Conference, Aachen, Germany, August 19–21, 2009.

原文連結：https://towardsdatascience.com/twitter-data-visualization-fb4f45b63728

歡迎關注磐創AI部落格站：
http://panchuang.net/

sklearn機器學習中文官方文件：
http://sklearn123.com/

歡迎關注磐創部落格資源彙總站：
http://docs.panchuang.net/

R語言進行Twitter資料視覺化

概述

資料視覺化

參考引用

R語言進行Twitter資料視覺化

利用d3.js對QQ群資料進行大資料視覺化分析

高階轉錄組分析和R語言資料視覺化第十二期（線上線下同時開課）

用Python來仿製一張R語言的資料視覺化圖

拓端tecdat|R語言空氣汙染資料的地理空間視覺化和分析：顆粒物2.5（PM2.5）和空氣質量指數（AQI）

拓端tecdat|R語言IRT理論：擴充套件Rasch模型等級量表模型lltm、 rsm 和 pcm模型分析心理和教育測驗資料視覺化

R語言逐步迴歸、方差anova分析電影市場調查問卷資料視覺化

資料分享|R語言對論文作者研究機構、知識單元地理空間資料視覺化

Python資料分析實戰：使用pyecharts進行資料視覺化

Python資料視覺化：分析某寶商品資料，進行視覺化處理

Python爬取北京地區蛋殼公寓資料，並進行資料視覺化處理

對磚石屬性表進行資料視覺化分析（使用seaborn工具）

使用ggplot2進行資料視覺化--案例

python爬蟲+R資料視覺化例項

超級好用的 Java 資料視覺化庫：Tablesaw

Python資料視覺化:頂級繪相簿plotly詳解

Python資料視覺化:冪律分佈例項詳解

Python資料視覺化:餅狀圖的例項講解

Python資料視覺化:泊松分佈詳解

wxPython繪圖模組wxPyPlot實現資料視覺化

R語言進行Twitter資料視覺化

概述

資料視覺化

參考引用

相關推薦