Statistic summary in R
阿新 • • 發佈:2019-01-02
-
summary the statistics of data
-
visualize the statistics (boxplot and histogram)
-
view the data
library("AzureML") ws <- workspace() dat <- download.datasets(ws, "Automobile price data (Raw)") head(dat)
results:
-
process the dataframe:
cols = c('price','bore','stroke','horsepower','peak.rpm') ## convert ? to an NA dat[,cols] = lapply(dat[,cols], function(x) ifelse(x == '?', NA, x)) ## remove rows with NAs dat = dat[complete.cases(dat),] ## Covert character columns to numeric dat[,cols] = lapply(dat[,cols], as.numeric) str(dat)
complete.cases(dataframe) keep those rows without NA. lapply(dataframe, function) is used to manipulate the dataframe globally. The statement of the processed dataframe:
-
view the summary of the dataframe:
describe = function(df, col){ tmp = df[, col] sumry = summary(tmp) nms = names(sumry) nms = c(nms, 'std') out = c(sumry, sd(tmp)) names(out) = nms out } describe(dat, 'horsepower')
results including median, Q1(25%), Q2(50%), Q3(75%) and std.
-
Visualize the statistics:
options(repos = c(CRAN = "http://cran.rstudio.com")) install.packages('gridExtra') plotstats = function(df, col, bins = 30){ require(ggplot2) require(gridExtra) dat = as.factor('') ## Compute bin width bin.width = (max(df[, col]) - min(df[, col]))/ bins ## Plot a histogram p1 = ggplot(df, aes_string(col)) + geom_histogram(binwidth = bin.width) ## A simple boxplot p2 = ggplot(df, aes_string(dat, col)) + geom_boxplot() + coord_flip() + ylab('') ## Now stack the plots grid.arrange(p2, p1, nrow = 2) } plotstats(dat, 'price')
results: