1. 程式人生 > >生信指令碼練習(12)求fasta檔案各序列長度並統計作圖

生信指令碼練習(12)求fasta檔案各序列長度並統計作圖

題目要求是要從一個fasta檔案中統計出每條序列的長度分佈,並作圖。

程式碼如下:

import os
import getpass
import matplotlib.pyplot as plt
usr = getpass.getuser()
os.chdir('c:/Users/' + usr + '/Desktop')
seq_len = {}
# 把fasta檔案全部讀取做成字典,鍵是帶‘>’的那一行,值是序列
def readfasta(filename):
    fa = open(filename, 'r')
    res = {}
    ID = ''
    for
line in fa: if line.startswith('>'): ID = line.strip('\n') res[ID] = '' else: res[ID] += line.strip('\n') return res seq = readfasta('Test1.fa') # 做成另外一個字典字典,鍵是帶‘>’的那一行,值是序列長度: for k,v in seq.items(): seq_len[k] = len
(v) seq_len_sort = sorted(seq_len.items(), key=lambda x: x[1]) seq_len_sort = dict(seq_len_sort) location = {} lenth_a = 0 lenth_b = 500 bar = 500 # 通過迴圈的方式找到每個視窗區間內的長度分佈,這裡選的視窗是500 # 這裡的4500是排序之後根據最大值看出來的 while lenth_b < 4500: count = 0 ran = range(lenth_a,lenth_b) for k,v in seq_len_sort.items
(): if v in ran: count += 1 location[(str(lenth_a)+ '-'+ str(lenth_b))] = count lenth_a += bar lenth_b += bar #print(location) f = open('t1.txt','w') f.write('lenth' + '\t' + 'number'+ '\n') lenth = [] number = [] for k,v in location.items(): f.write(str(k)+ '\t' + str(v) + '\n') lenth.append(k) number.append(v) f.close() plt.bar(range(len(lenth)),number) # 最後matplotlib畫圖 #xlabels = [x[index] for index in lenth] #plt.xticks(number, lenth, rotation='vertical') plt.show()

然而我還不會更改x軸座標值。因為我用的字串的形式。。。
這裡寫圖片描述

畫圖程式碼參照matplotlib官網的barh_demo改了一下:

import matplotlib.pyplot as plt
plt.rcdefaults()
fig, ax = plt.subplots()

y_pos = np.arange(len(lenth))
error = np.random.rand(len(lenth))
ax.barh(y_pos, number, xerr=error, align='center',
        color='lightblue', ecolor='black')
ax.set_yticks(y_pos)
ax.set_yticklabels(lenth)
ax.invert_yaxis()  # labels read top-to-bottom
ax.set_xlabel('nummber')
ax.set_title('lenth location')
plt.show()

errorbar 怎麼也去不掉???
這裡寫圖片描述

下面是我用R畫的

這裡寫圖片描述

library(ggplot2)
x <- c('0-500', '500-1000', '1000-1500', '1500-2000', '2000-2500', '2500-3000', '3000-3500', '3500-4000')
y <- c( 4, 16, 10, 11, 3, 5, 0, 1)
dt = data.frame(length= x, number = y)
dt$length = factor(dt$length, levels=c('0-500', '500-1000', '1000-1500', '1500-2000', '2000-2500', '2500-3000', '3000-3500', '3500-4000'))
ggplot(dt, aes(x = length, y = number, fill = length)) + 
geom_bar(stat = "identity") +
theme_set(theme_bw())

# 誰能告訴我為什麼下面這樣的寫法就畫不出來呢??
library(ggplot2)
setwd("c:\\Desktop") 
dt <- read.table("t1.txt",header = T) 
print(dt)
dt$length =factor(dt$length, levels=c('0-500', '500-1000', '1000-1500', '1500-2000', '2000-2500', '2500-3000', '3000-3500', '3500-4000'))
ggplot(dt, aes(length = length, number = number, fill = length)) + 
geom_bar(stat = "identity") +
theme_set(theme_bw())

# 報錯:
Error in `$<-.data.frame`(`*tmp*`, "length", value = structure(integer(0), .Label = c("0-500", : replacement has 0 rows, data has 8
Traceback:

1. `$<-`(`*tmp*`, "length", value = structure(integer(0), .Label = c("0-500", 
 . "500-1000", "1000-1500", "1500-2000", "2000-2500", "2500-3000", 
 . "3000-3500", "3500-4000"), class = "factor"))
2. `$<-.data.frame`(`*tmp*`, "length", value = structure(integer(0), .Label = c("0-500", 
 . "500-1000", "1000-1500", "1500-2000", "2000-2500", "2500-3000", 
 . "3000-3500", "3500-4000"), class = "factor"))
3. stop(sprintf(ngettext(N, "replacement has %d row, data has %d", 
 .     "replacement has %d rows, data has %d"), N, nrows), domain = NA)