R語言選擇匯入文字資料部分欄位
阿新 • • 發佈:2019-02-06
背景:文字資料欄位太多,只需要讀取有用欄位,以此降低記憶體壓力方便分析
方法一:read.table、read.csv
直接使用自帶函式設定適當的引數(colClasses),該引數需要指定每列資料的型別(可以使用nrows引數讀取幾行檢視class),不需要的列指定為NULL即可。如下:
<span style="font-family:Courier New;font-size:14px;">dat <- structure(list(Year = 2009:2011, Jan = c(-41L, -41L, -21L), Feb = c(-27L, -27L, -27L), Mar = c(-25L, -25L, -2L), Apr = c(-31L, -31L, -6L), May = c(-31L, -31L, -10L), Jun = c(-39L, -39L, -32L), Jul = c(-25L, -25L, -13L), Aug = c(-15L, -15L, -12L), Sep = c(-30L, -30L, -27L), Oct = c(-27L, -27L, -30L), Nov = c(-21L, -21L, -38L), Dec = c(-25L, -25L, -29L)), .Names = c("Year", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), class = "data.frame", row.names = c(NA, -3L)) write.table(dat, "test.txt", row.names=FALSE) ## 檢視每個列的class df <- read.table("test.txt", nrow=2, header = TRUE) apply(df, 2, class) # Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec # "integer" "integer" "integer" "integer" "integer" "integer" "integer" "integer" "integer" "integer" "integer" "integer" "integer" df <- read.table("test.txt", colClasses = c(rep("integer", 7), rep("NULL", 6)), header = TRUE) # > df # Year Jan Feb Mar Apr May Jun # 1 2009 -41 -27 -25 -31 -31 -39 # 2 2010 -41 -27 -25 -31 -31 -39 # 3 2011 -21 -27 -2 -6 -10 -32 write.csv(dat, "test.csv", row.names=FALSE) df <- read.csv("test.csv", colClasses = c(rep("integer", 7), rep("NULL", 6)), header = TRUE) # > df # Year Jan Feb Mar Apr May Jun # 1 2009 -41 -27 -25 -31 -31 -39 # 2 2010 -41 -27 -25 -31 -31 -39 # 3 2011 -21 -27 -2 -6 -10 -32 </span>
方法二:使用package:colbycol
沒有安裝成功,好像也不怎麼支援了
方法三:使用package資料庫功能輔助(RJDBC)
實際就是用Java來解決這個問題,太複雜沒有去實現
<span style="font-family:Courier New;">library(RJDBC) write.table(x=mtcars, file="mtcars.csv", sep=",", row.names=F, col.names=T) path.to.jdbc.driver <- "jdbc//csvjdbc-1.0-18.jar" drv <- JDBC("org.relique.jdbc.csv.CsvDriver", path.to.jdbc.driver) conn <- dbConnect(drv, sprintf("jdbc:relique:csv:%s", getwd())) head(dbGetQuery(conn, "select * from mtcars"), 3) # mpg cyl disp hp drat wt qsec vs am gear carb # 1 21 6 160 110 3.9 2.62 16.46 0 1 4 4 # 2 21 6 160 110 3.9 2.875 17.02 0 1 4 4 # 3 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 head(dbGetQuery(conn, "select mpg, gear from mtcars"), 3)</span>
方法四:藉助linux工具命令實現
快速,方便;需要熟悉awk,cut的語法
cut功能比較單一適合處理分割整齊的資料,而awk的功能更加強大(awk使用1,awk使用2)
總結<span style="font-family:Courier New;font-size:14px;">dat <- structure(list(Year = 2009:2011, Jan = c(-41L, -41L, -21L), Feb = c(-27L, -27L, -27L), Mar = c(-25L, -25L, -2L), Apr = c(-31L, -31L, -6L), May = c(-31L, -31L, -10L), Jun = c(-39L, -39L, -32L), Jul = c(-25L, -25L, -13L), Aug = c(-15L, -15L, -12L), Sep = c(-30L, -30L, -27L), Oct = c(-27L, -27L, -30L), Nov = c(-21L, -21L, -38L), Dec = c(-25L, -25L, -29L)), .Names = c("Year", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), class = "data.frame", row.names = c(NA, -3L)) # 使用製表符分割 write.table(dat, "test.txt", row.names=FALSE, sep = "\t") df <- read.table(pipe("cut -f 1,5 test.txt"), header=TRUE) df system("cut -f 1,5 test.txt") # Year Apr # 1 2009 -31 # 2 2010 -31 # 3 2011 -6 # 使用空格分割 write.table(dat, "test.txt", row.names=FALSE, sep = " ") df <- read.table(pipe("cut -d ' ' -f 1,5 test.txt"), header=TRUE) df system("cut -d ' ' -f 1,5 test.txt") # Year Apr # 1 2009 -31 # 2 2010 -31 # 3 2011 -6</span>
(1)資料量不是很大且對資料內容瞭解時,可以使用read.table指定合理的引數colClasses讀取。
(2)資料較大且要求速度建議藉助linux下的資料處理工具