生信指令碼練習(12)求fasta檔案各序列長度並統計作圖
阿新 • • 發佈:2019-02-08
題目要求是要從一個fasta檔案中統計出每條序列的長度分佈,並作圖。
程式碼如下:
import os
import getpass
import matplotlib.pyplot as plt
usr = getpass.getuser()
os.chdir('c:/Users/' + usr + '/Desktop')
seq_len = {}
# 把fasta檔案全部讀取做成字典,鍵是帶‘>’的那一行,值是序列
def readfasta(filename):
fa = open(filename, 'r')
res = {}
ID = ''
for line in fa:
if line.startswith('>'):
ID = line.strip('\n')
res[ID] = ''
else:
res[ID] += line.strip('\n')
return res
seq = readfasta('Test1.fa')
# 做成另外一個字典字典,鍵是帶‘>’的那一行,值是序列長度:
for k,v in seq.items():
seq_len[k] = len (v)
seq_len_sort = sorted(seq_len.items(), key=lambda x: x[1])
seq_len_sort = dict(seq_len_sort)
location = {}
lenth_a = 0
lenth_b = 500
bar = 500
# 通過迴圈的方式找到每個視窗區間內的長度分佈,這裡選的視窗是500
# 這裡的4500是排序之後根據最大值看出來的
while lenth_b < 4500:
count = 0
ran = range(lenth_a,lenth_b)
for k,v in seq_len_sort.items ():
if v in ran:
count += 1
location[(str(lenth_a)+ '-'+ str(lenth_b))] = count
lenth_a += bar
lenth_b += bar
#print(location)
f = open('t1.txt','w')
f.write('lenth' + '\t' + 'number'+ '\n')
lenth = []
number = []
for k,v in location.items():
f.write(str(k)+ '\t' + str(v) + '\n')
lenth.append(k)
number.append(v)
f.close()
plt.bar(range(len(lenth)),number)
# 最後matplotlib畫圖
#xlabels = [x[index] for index in lenth]
#plt.xticks(number, lenth, rotation='vertical')
plt.show()
然而我還不會更改x軸座標值。因為我用的字串的形式。。。
畫圖程式碼參照matplotlib官網的barh_demo改了一下:
import matplotlib.pyplot as plt
plt.rcdefaults()
fig, ax = plt.subplots()
y_pos = np.arange(len(lenth))
error = np.random.rand(len(lenth))
ax.barh(y_pos, number, xerr=error, align='center',
color='lightblue', ecolor='black')
ax.set_yticks(y_pos)
ax.set_yticklabels(lenth)
ax.invert_yaxis() # labels read top-to-bottom
ax.set_xlabel('nummber')
ax.set_title('lenth location')
plt.show()
errorbar 怎麼也去不掉???
下面是我用R畫的
library(ggplot2)
x <- c('0-500', '500-1000', '1000-1500', '1500-2000', '2000-2500', '2500-3000', '3000-3500', '3500-4000')
y <- c( 4, 16, 10, 11, 3, 5, 0, 1)
dt = data.frame(length= x, number = y)
dt$length = factor(dt$length, levels=c('0-500', '500-1000', '1000-1500', '1500-2000', '2000-2500', '2500-3000', '3000-3500', '3500-4000'))
ggplot(dt, aes(x = length, y = number, fill = length)) +
geom_bar(stat = "identity") +
theme_set(theme_bw())
# 誰能告訴我為什麼下面這樣的寫法就畫不出來呢??
library(ggplot2)
setwd("c:\\Desktop")
dt <- read.table("t1.txt",header = T)
print(dt)
dt$length =factor(dt$length, levels=c('0-500', '500-1000', '1000-1500', '1500-2000', '2000-2500', '2500-3000', '3000-3500', '3500-4000'))
ggplot(dt, aes(length = length, number = number, fill = length)) +
geom_bar(stat = "identity") +
theme_set(theme_bw())
# 報錯:
Error in `$<-.data.frame`(`*tmp*`, "length", value = structure(integer(0), .Label = c("0-500", : replacement has 0 rows, data has 8
Traceback:
1. `$<-`(`*tmp*`, "length", value = structure(integer(0), .Label = c("0-500",
. "500-1000", "1000-1500", "1500-2000", "2000-2500", "2500-3000",
. "3000-3500", "3500-4000"), class = "factor"))
2. `$<-.data.frame`(`*tmp*`, "length", value = structure(integer(0), .Label = c("0-500",
. "500-1000", "1000-1500", "1500-2000", "2000-2500", "2500-3000",
. "3000-3500", "3500-4000"), class = "factor"))
3. stop(sprintf(ngettext(N, "replacement has %d row, data has %d",
. "replacement has %d rows, data has %d"), N, nrows), domain = NA)