R語言 Julia以及全基因組選擇
阿新 • • 發佈:2018-11-03
小編著:
最近在學Julia語言,想測試一下和R的區別,發現前輩的部落格,翻譯時不禁感慨,這是2018年了,部落格是2010年的,8年已過,我才聽說Julia。但……不晚!
文章來源: https://www.r-bloggers.com/r-julia-and-genome-wide-selection/
我想起一些瑣事的事情,以及一些斷裂的程式碼,在2010年我參加了基因組選擇的summer school(http://taurus.ansci.iastate.edu/wiki/pages/E4o0S0C7/Course_materials.html),共有2000個個體,20000個SNP(0,1,2),然後使用MCMC 計算育種值,使用的是R語言。
nmarkers = 2000; # number of markers
startMarker = 1981; # set to 1 to use all
numiter = 2000; # number of iterations
vara = 1.0/20.0;
# input data
data = matrix(scan("trainData.out0"),ncol=nmarkers+2,byrow=TRUE);
nrecords = dim(data)[1];
beg = Sys.time()
# x has the mean followed by the markers
x = cbind(1,data[,startMarker:nmarkers]);
y = data[,nmarkers+1];
a = data[,nmarkers+2];
# inital values
nmarkers = nmarkers - startMarker + 1;
mean2pq = 0.5; # just an approximation
scalea = 0.5*vara/(nmarkers*mean2pq); # 0.5 = (v-2)/v for v=4
size = dim(x)[2];
b = array(0.0,size);
meanb = b;
b[1 ] = mean(y);
var = array(0.0,size);
# adjust y
ycorr = y - x%*%b;
# MCMC sampling
for (iter in 1:numiter){
# sample vare
vare = ( t(ycorr)%*%ycorr )/rchisq(1,nrecords + 3);
# sample intercept
ycorr = ycorr + x[,1]*b[1];
rhs = sum(ycorr)/vare;
invLhs = 1.0/(nrecords/vare);
mean = rhs*invLhs;
b[1] = rnorm(1,mean,sqrt(invLhs));
ycorr = ycorr - x[,1]*b[1];
meanb[1] = meanb[1] + b[1];
# sample variance for each locus
for (locus in 2:size){
var[locus] = (scalea*4+b[locus]*b[locus])/rchisq(1,4.0+1)
}
# sample effect for each locus
for (locus in 2:size){
# unadjust y for this locus
ycorr = ycorr + x[,locus]*b[locus];
rhs = t(x[,locus])%*%ycorr/vare;
lhs = t(x[,locus])%*%x[,locus]/vare + 1.0/var[locus];
invLhs = 1.0/lhs;
mean = invLhs*rhs;
b[locus]= rnorm(1,mean,sqrt(invLhs));
#adjust y for the new value of this locus
ycorr = ycorr - x[,locus]*b[locus];
meanb[locus] = meanb[locus] + b[locus];
}
}
Sys.time() - beg
meanb = meanb/numiter;
aHat = x %*% meanb;
然後,我們需要定義幾個新的變數,將基因組資料,表型資料以及育種值資料讀進矩陣裡面,簡歷幾個迴圈,進行向量的運算。
我使用Julia去做類似的事情:
nmarkers = 2000 # Number of markers
startmarker = 1981 # Set to 1 to use all
numiter = 2000 # Number of iterations
data = dlmread("markers.csv", ',')
(nrecords, ncols) = size(data)
tic()
#this is the mean and markers matrix
X = hcat(ones(Float64, nrecords), data[:, startmarker:nmarkers])
y = data[:, nmarkers + 1]
a = data[:, nmarkers + 2]
nmarkers = nmarkers - startmarker + 1
vara = 1.0/nmarkers
mean2pq = 0.5
scalea = 0.5*vara/(nmarkers*mean2pq) # 0.5 = (v-2)/v for v=4
ndesign = size(X, 2)
b = zeros(Float64, ndesign)
meanb = zeros(Float64, ndesign)
b[1] = mean(y)
varian = zeros(Float64, ndesign)
# adjust y
ycorr = y - X * b
# MCMC sampling
for i = 1:numiter
# sample vare
vare = dot(ycorr, ycorr )/randchi2(nrecords + 3)
# sample intercept
ycorr = ycorr + X[:, 1] * b[1];
rhs = sum(ycorr)/vare;
invlhs = 1.0/(nrecords/vare);
mn = rhs*invlhs;
b[1] = randn() * sqrt(invlhs) + mn;
ycorr = ycorr - X[:, 1] * b[1];
meanb[1] = meanb[1] + b[1];
# sample variance for each locus
for locus = 2:ndesign
varian[locus] = (scalea*4 + b[locus]*b[locus])/randchi2(4.0 + 1);
end
# sample effect for each locus
for locus = 2:ndesign
# unadjust y for this locus
ycorr = ycorr + X[:, locus] * b[locus];
rhs = dot(X[:, locus], ycorr)/vare;
lhs = dot(X[:, locus], X[:, locus])/vare + 1.0/varian[locus];
invlhs = 1.0/lhs;
mn = invlhs * rhs;
b[locus] = randn() * sqrt(invlhs) + mn;
#adjust y for the new value of this locus
ycorr = ycorr - X[:, locus] * b[locus];
meanb[locus] = meanb[locus] + b[locus];
end
end
toc()
meanb = meanb/numiter;
aHat = X * meanb;
這兩個程式碼比較相似,但是也有一些不同:
- 第一個讀入的資料是二進位制的,我不知道Julia如何操作,所以我就將其轉為csv,然後讀取。
- 為了防止名稱重複,R中可以隨意命名,但是Julia不行,所以我在Julia程式中進行了修改。
- R中可以賦值,a=b,你改變a和b都沒問題,但是Julia中你動了b,a也變了。語法不太一樣。
- Julia中向量和陣列不太一樣,太令人困惑了。
比較有意思的是,Julia的程式碼在速度上不是很突出,因為我的程式碼太粗糙了。我變化了marker的數目,發現Julia的運算速度大約是R的2.8倍。Julia在官網上稱進行數值運算時,速度是R的100倍,但是我的沒有達到這麼高。
在1996年或者1997年是,我由SAS轉到了ASReml進行基因組資料分析,它大約提高了1~2倍的速度,而且支援了更多的模型。
現在,又到了更換軟體的時候了,由R轉到Julia,特別是基因組選擇方面,Julia的速度是R的3倍,Julia效能非常優秀。