利用JAVA完成上述的操作
阿新 • • 發佈:2017-12-22
java stat tsv body lda time info lin imp
還是沒能忍住,想看一下用JAVA語言處理上一篇文章的任務能快多少,畢竟編譯語言遠快於腳本語言。廢話不多說,直接上代碼:
import java.io.FileReader; import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.FileWriter; import java.io.IOException; public class Split{ public static void main(String[] args) throws IOException {long startTime = System.currentTimeMillis(); BufferedReader read_line = new BufferedReader(new FileReader("head_10000000.vcf"), 5000000); BufferedWriter write_line = new BufferedWriter(new FileWriter("result.tsv"), 5000000); String current_line = read_line.readLine(); while(current_line != null) { while(current_line.startsWith("#")) { current_line = read_line.readLine(); } String[] split1 = current_line.split("\t"); String info = split1[7]; String[] split2 = info.split(";AF="); String str1= split2[1]; String[] split3 = str1.split(";"); write_line.write(current_line + " " + split3[0]); write_line.newLine(); current_line = read_line.readLine(); } write_line.flush(); write_line.close(); read_line.close(); long endTime = System.currentTimeMillis(); System.out.println("run time:"+(endTime-startTime)+"ms"); } }
程序運行結果:
run time:47473ms
檢驗結果:
$ wc -l result.tsv 10000000 result.tsv
$ sed -n ‘3435534p‘ result.tsv
2 29509274 rs114511873 C A 100 PASS AA=C;AN=2184;AVGPOST=0.9997;VT=SNP;THETA=0.0006;AC=14;SNPSOURCE=LOWCOV;LDAF=0.0065;ERATE=0.0003;RSQ=0.9798;AF=0.01;AFR_AF=0.03 0.01
$ sed -n ‘7546563p‘ result.tsv
3 84580386 rs191768644 T C 100 PASS RSQ=0.6088;AA=T;AN=2184;VT=SNP;AVGPOST=0.9991;SNPSOURCE=LOWCOV;AC=1;THETA=0.0007;ERATE=0.0002;LDAF=0.0008;AF=0.0005;AFR_AF=0.0020 0.0005
$ sed -n ‘987345p‘ result.tsv
1 74709013 rs185004386 A C 100 PASS AN=2184;LDAF=0.0018;THETA=0.0005;VT=SNP;AA=A;SNPSOURCE=LOWCOV;RSQ=0.7110;ERATE=0.0003;AVGPOST=0.9987;AC=3;AF=0.0014;ASN_AF=0.01 0.0014
我們檢查了文件的總行數以及隨機抽取了若幹行,發現結果正確。相比較於前面的R語言計算效率,這個結果表示十分震驚! 相差太遠!!!
Time(java代碼編寫 + 編譯 + 運行) < Time(R腳本運行)
利用JAVA完成上述的操作