1. 程式人生 > >Statistic summary in R

Statistic summary in R

  • summary the statistics of data

  • visualize the statistics (boxplot and histogram)

  1. view the data

    library("AzureML")
    ws <- workspace()
    dat <- download.datasets(ws, "Automobile price data (Raw)")
    head(dat)

    results:

  2. process the dataframe:

    cols = c('price','bore','stroke','horsepower','peak.rpm')
    ## convert ? to an NA
    dat[,cols] = lapply(dat[,cols], function(x) ifelse(x == '?', NA, x))
    ## remove rows with NAs
    dat = dat[complete.cases(dat),]
    ## Covert character columns to numeric
    dat[,cols] = lapply(dat[,cols], as.numeric)
    str(dat)

    complete.cases(dataframe) keep those rows without NA. lapply(dataframe, function) is used to manipulate the dataframe globally. The statement of the processed dataframe: 

  3. view the summary of the dataframe: 

    describe = function(df, col){
    tmp = df[, col]
    sumry = summary(tmp)
    nms = names(sumry)
    nms = c(nms, 'std')
    out = c(sumry, sd(tmp))
    names(out) = nms
    out
    }
    describe(dat, 'horsepower')

    results including median, Q1(25%), Q2(50%), Q3(75%) and std. 

  4. Visualize the statistics: 

    options(repos = c(CRAN = "http://cran.rstudio.com"))
    install.packages('gridExtra')
    plotstats = function(df, col, bins = 30){
    require(ggplot2)
    require(gridExtra)
    dat = as.factor('')
    ## Compute bin width
    bin.width = (max(df[, col]) - min(df[, col]))/ bins
    ## Plot a histogram
    p1 = ggplot(df, aes_string(col)) +
    geom_histogram(binwidth = bin.width)
    ## A simple boxplot
    p2 = ggplot(df, aes_string(dat, col)) +
    geom_boxplot() + coord_flip() + ylab('')
    ## Now stack the plots
    grid.arrange(p2, p1, nrow = 2)
    }
    
    plotstats(dat, 'price')

    results: