1. 程式人生 > >R語言創作詞雲 word cloud generation

R語言創作詞雲 word cloud generation

目錄

  • 1. select packages

  • 2. clean text

  • 3. generate word cloud

  • 4. 一個小技巧 a little trick

1. Select packages

library("tm") #text mining
library("SnowballC")  #word stemming if necessary
library("wordcloud2") #word cloud generation
library("RColorBrewer") #color of word cloud
library("webshot") #save a word cloud as image
library("htmlwidgets")

2. clean text

因為我處理的是歌詞文字,所以需要去除標點、數字和一些常見特殊符號(unicode $ @等)。需要注意的是像I’m I’d We’d 這樣的詞在去除標點的時候R是預設不作考慮。根據專案要求,我們把I’m 變成 I m來處理.
clean.text <- function(x)
{
  
  # remove rt
  ##x = gsub("rt", "", x)
  # remove at
  x = gsub("@\\w+", "", x)
  # remove punctuation
  x = gsub("[[:punct:]]", "", x)
  # remove numbers
  x = gsub("[[:digit:]]", "", x)
  # I'm to I m
  x = gsub("'", " ", x)
  # remove links http
  # x = gsub("http\\w+", "", x)
  ## remove tabs
  #x = gsub("[ |\t]{2,}", "", x)
  # remove blank spaces at the beginning
  #x = gsub("^ ", "", x)
  # remove blank spaces at the end
  x = gsub(" $", "", x)
  # remove unicode 
  x = gsub("[^\x20-\x7E]", " ",x)
  return(x)
}

3. generate word clouds

generate_sep<-function(){
	for (i in (0:254)){
		filePath <- file.path("C:/Users/ntu/Downloads/LEARNING/wordcloud/participants_lyrics",paste("p",toString(i),".txt",sep=""))
		#print(text[i])
		text <- clean.text(filePath)
		docs <- Corpus(VectorSource(text))
		dtm <- TermDocumentMatrix(docs)
		m <- as.matrix(dtm) #m是一個行為詞,列為文件的矩陣
		v <- sort(colSums(m),decreasing=TRUE)
		d <- data.frame(word = names(v),freq=v)
		#print(m)
		#write.csv(m, "word_freq.csv")

		set.seed(1234)
		#wc <- wordcloud2(words = d$word, freq = d$freq, min.freq = 1,
		#          max.words=200, random.order=FALSE, rot.per=0.35, 
		#          colors=brewer.pal(8, "Dark2"))
		wc <- wordcloud2(d, size=1.5)
		saveWidget(wc,paste("tmp",toString(i),".html"),selfcontained = F)
		# save in png
		webshot(paste("tmp",toString(i),".html"),paste("wordcloud_",toString(i),".png"), delay =5, vwidth = 480, vheight=480)
	}
}
results

某一個使用者歌詞的詞雲

4. a little trick

如果文件中想用顏色和形狀分別表示不同的資料,比如形狀大小代表recall,顏色代表Precision, 那麼可以用下面的程式碼實現
library("openxlsx")
library("wordcloud")
library("RColorBrewer")

generate_wordclouds<-function(){

	xlsxfile <- file.choose()
	df <- read.xlsx(xlsxfile, colNames = TRUE, rowNames = FALSE)
	png("wordcloud_consci31.png", width=800,height=800)
	wc <- wordcloud(words = df$word, freq = df$recall, min.freq = 1,
		scale = c(6, 0.2),max.words=1000, random.order=FALSE,
		rot.per=0.1, ordered.colors=TRUE,
		colors=brewer.pal(9, "Blues")[factor(df$precision)],family = "script",font=2)
	dev.off()
}
4.1 wordcloud的屬性:
  • freq = dataframe$$recall
  • colors = brewer.pal(9, “Blues”)[factor(df$precision)],family = “script”,font=2) #9代表顏色個數
4.2 視覺化結果

在這裡插入圖片描述

Note: 注意到我使用了兩個包wordcloud 和 wordcloud2, 本來想用wordcloud2實現後面的功能,但是這個包好像做不到,如果有成功的案例也可以分享給我,感謝!