1. 程式人生 > 實用技巧 >Admixture的監督分群(Supervised analysis)

Admixture的監督分群(Supervised analysis)

目錄

說明

Admixture通過EM演算法一般用於指定亞群分類;或者在不知材料群體結構背景下,通過迭代交叉驗證獲得error值,取最小error對應的K值為推薦亞群數目。如果我們預先已知群體的型別(百分百確信),那麼可以考慮監督分類方法,設定標籤,提高分群的準確性。

Admixture目前是1.3.0,文件也剛更新不久。

怕翻譯有誤,貼上官方文件:

Estimating P and Q from the SNP matrix G, without any additional information, can be
viewed as an unsupervised learning problem. However it is not uncommon that some or
all of the individuals in our data sample will have known ancestries, allowing us to set
some rows in the matrix Q to known constants. This allows more accurate estimation of
the ancestries of the remaining individuals, and of the ancestral allele frequencies. Viewing
these reference individuals as training samples, the problem is transformed into a supervised
learning problem.

Supervised learning mode is enabled with the flag --supervised and requires an additional
file with a .pop suffix, specifying the ancestries of the reference individuals. It is assumed
that all reference samples have 100% ancestry from some ancestral population. Each line
of the .pop file corresponds to individual listed on the same line number in the .fam or
.ped file. If the individual is a population reference, the .pop file line should be a string
(beginning with an alphanumeric character) designating the population. If the individual
is of unknown ancestry, use “-” (or a blank line, or any non-alphanumeric character) to
indicate that the ancestry should be estimated.

文件中說要準備一個.pop為字尾的群體檔案,就是對個體進行分類(字元型),型別未知的可用“-”替代。不建議在windows中建立,因為換行符不同的問題。

如何驗證準備的.pop檔案?作者建議使用paste .fam .pop檢視個體數目是不是相等(用wc -l不是更簡單嗎?)。

問題來了,作者根本就沒說明到底怎麼執行?我嘗試了下,簡單記錄下。

實戰

下載官網示例資料:
http://dalexander.github.io/admixture/download.html

解壓後,有plink資料格式,配套的bed,bim,fam,但少了個ped,沒有和map配套。這個作者有點粗心,不過可以用plink轉一下:

wget http://dalexander.github.io/admixture/hapmap3-files.tar.gz
tar -xvf hapmap3-files.tar.gz
plink --bfile hapmap3 --recode --out hapmap3--noweb
wc -l hapmap3*

準備hapmap3.pop檔案(注意字首和pink資料保持一致,且在同一目錄),可用R、awk等工具,隨意模擬一個:

dat = data.frame(V1 = rep(c("A","-","B","-","C","-"),each=54))
write.table(dat,"hapmap3.pop",row.names=F,col.names=F,quote=F,sep="\t")

加上supervised,執行admixture即可:

admixture hapmap3.ped 3 --supervised

可以看看不加supervised和加了的區別,沒加的結果:

加了的結果:

還是有很大差異的。具體對後續結果的影響這裡就不研究了。