1. 程式人生 > >R語言缺失值處理

R語言缺失值處理

缺失值

1. is.na 確實值位置判斷

注意: 缺失值被認為是不可比較的,即便是與缺失值自身的比較。這意味著無法使用比較運算
符來檢測缺失值是否存在。例如,邏輯測試myvar == NA的結果永遠不會為TRUE。作為
替代,你只能使用處理缺失值的函式(如本節中所述的那些)來識別出R資料物件中的缺
失值。

2. na.omit() 刪除不完整觀測

這裡寫圖片描述

manyNAs

manyNAs(data, nORp = 0.2)
Arguments

data
A data frame with the data set.

nORp
A number controlling when a row is considered to have too many NA values (defaults to 0.2, i.e. 20% of the columns). If no rows satisfy the constraint indicated by the user, a warning is generated.
按照比例判斷缺失.

3. knnImputation K近鄰填補

library(DMwR)
knnImputation(data, k = 10, scale = T, meth = "weighAvg", distData = NULL)

Arguments

Arguments
data A data frame with the data set
k The number of nearest neighbours to use (defaults to 10)
scale Boolean setting if the data should be scale before finding the nearest neighbours (defaults to T)
meth String indicating the method used to calculate the value to fill in each NA. Available values are ‘median’ or ‘weighAvg’ (the default).
distData Optionally you may sepecify here a data frame containing the data set that should be used to find the neighbours. This is usefull when filling in NA values on a test set, where you should use only information from the training set. This defaults to NULL, which means that the neighbours will be searched in data

Details
This function uses the k-nearest neighbours to fill in the unknown (NA) values in a data set. For each case with any NA value it will search for its k most similar cases and use the values of these cases to fill in the unknowns.

If meth=’median’ the function will use either the median (in case of numeric variables) or the most frequent value (in case of factors), of the neighbours to fill in the NAs. If meth=’weighAvg’ the function will use a weighted average of the values of the neighbours. The weights are given by exp(-dist(k,x) where dist(k,x) is the euclidean distance between the case with NAs (x) and the neighbour k

例子:

#首先讀入程式包並對資料進行清理 
library(DMwR) 
data(algae) 
algae <- algae[-manyNAs(algae), ] 
clean.algae <- knnImputation(algae[,1:12],k=10) 
> head(clean.algae)
  season  size  speed mxPH mnO2     Cl    NO3     NH4    oPO4     PO4 Chla   a1
1 winter small medium 8.00  9.8 60.800  6.238 578.000 105.000 170.000 50.0  0.0
2 spring small medium 8.35  8.0 57.750  1.288 370.000 428.750 558.750  1.3  1.4
3 autumn small medium 8.10 11.4 40.020  5.330 346.667 125.667 187.057 15.6  3.3
4 spring small medium 8.07  4.8 77.364  2.302  98.182  61.182 138.700  1.4  3.1
5 autumn small medium 8.06  9.0 55.350 10.416 233.700  58.222  97.580 10.5  9.2
6 winter small   high 8.25 13.1 65.750  9.248 430.000  18.250  56.667 28.4 15.1

4. centralImputation()中心插值

用非缺失樣本的中位數(median)對缺失資料進行插值

data(algae)
cleanAlgae <- centralImputation(algae)
summary(cleanAlgae)

5. complete.cases() 尋找完整資料集

x <- airquality[, -1] # x is a regression design matrix
y <- airquality[,  1] # y is the corresponding response
#驗證是否complete.cases結果與is.na一樣
stopifnot(complete.cases(y) != is.na(y))
#x,y共同的非缺失行的bool結果
ok <- complete.cases(x, y)
#共有幾個缺失樣本
sum(!ok) # how many are not "ok" ?
#得到非缺失樣本
x <- x[ok,]
y <- y[ok]

6. na.fail()是否有遺漏值

DF <- data.frame(x = c(1, 2, 3), y = c(0, 10, NA))
na.fail(DF)

Error in na.fail.default(DF) : 物件裡有遺漏值