R語言創作詞雲 word cloud generation
阿新 • • 發佈:2019-01-02
目錄
1. Select packages
library("tm") #text mining library("SnowballC") #word stemming if necessary library("wordcloud2") #word cloud generation library("RColorBrewer") #color of word cloud library("webshot") #save a word cloud as image library("htmlwidgets")
2. clean text
因為我處理的是歌詞文字,所以需要去除標點、數字和一些常見特殊符號(unicode $ @等)。需要注意的是像I’m I’d We’d 這樣的詞在去除標點的時候R是預設不作考慮。根據專案要求,我們把I’m 變成 I m來處理.
clean.text <- function(x) { # remove rt ##x = gsub("rt", "", x) # remove at x = gsub("@\\w+", "", x) # remove punctuation x = gsub("[[:punct:]]", "", x) # remove numbers x = gsub("[[:digit:]]", "", x) # I'm to I m x = gsub("'", " ", x) # remove links http # x = gsub("http\\w+", "", x) ## remove tabs #x = gsub("[ |\t]{2,}", "", x) # remove blank spaces at the beginning #x = gsub("^ ", "", x) # remove blank spaces at the end x = gsub(" $", "", x) # remove unicode x = gsub("[^\x20-\x7E]", " ",x) return(x) }
3. generate word clouds
generate_sep<-function(){ for (i in (0:254)){ filePath <- file.path("C:/Users/ntu/Downloads/LEARNING/wordcloud/participants_lyrics",paste("p",toString(i),".txt",sep="")) #print(text[i]) text <- clean.text(filePath) docs <- Corpus(VectorSource(text)) dtm <- TermDocumentMatrix(docs) m <- as.matrix(dtm) #m是一個行為詞,列為文件的矩陣 v <- sort(colSums(m),decreasing=TRUE) d <- data.frame(word = names(v),freq=v) #print(m) #write.csv(m, "word_freq.csv") set.seed(1234) #wc <- wordcloud2(words = d$word, freq = d$freq, min.freq = 1, # max.words=200, random.order=FALSE, rot.per=0.35, # colors=brewer.pal(8, "Dark2")) wc <- wordcloud2(d, size=1.5) saveWidget(wc,paste("tmp",toString(i),".html"),selfcontained = F) # save in png webshot(paste("tmp",toString(i),".html"),paste("wordcloud_",toString(i),".png"), delay =5, vwidth = 480, vheight=480) } }
results
4. a little trick
如果文件中想用顏色和形狀分別表示不同的資料,比如形狀大小代表recall,顏色代表Precision, 那麼可以用下面的程式碼實現
library("openxlsx")
library("wordcloud")
library("RColorBrewer")
generate_wordclouds<-function(){
xlsxfile <- file.choose()
df <- read.xlsx(xlsxfile, colNames = TRUE, rowNames = FALSE)
png("wordcloud_consci31.png", width=800,height=800)
wc <- wordcloud(words = df$word, freq = df$recall, min.freq = 1,
scale = c(6, 0.2),max.words=1000, random.order=FALSE,
rot.per=0.1, ordered.colors=TRUE,
colors=brewer.pal(9, "Blues")[factor(df$precision)],family = "script",font=2)
dev.off()
}
4.1 wordcloud的屬性:
- freq = dataframe$$recall
- colors = brewer.pal(9, “Blues”)[factor(df$precision)],family = “script”,font=2) #9代表顏色個數