1. 程式人生 > >Molecular Phylogenetics using Bio.Phylo

Molecular Phylogenetics using Bio.Phylo

Phylogenetic Trees

Phylogenetic trees represent evolutionary relationships between organisms or genes. The pattern of branching in a phylogenetic tree reflects how species or other groups have evolved from a series of common ancestors. An example of a phylogenetic tree is the Tree of Life which denotes how various species of organisms have evolved since the birth of Earth.

In a phylogenetic tree, the species or groups of interest are found at the tips of lines known as branches. The points where branches are divided are called branch points.

Two species are more related if they have a more recent common ancestor and less related if they have a less recent common ancestor.

Given here is a phylogenetic tree for primates based on their genetic data. Gorillas and Orangutans have diverged earlier than other primate groups. The Homo lineage (humans) has moved along one path where as the Pan lineage has moved along another path. Later on, the Pan lineage has divided, yielding Chimpanzees and Bonobos.

Algorithms used for Phylogenetic Inference

There are three main categories of algorithms that are used for phylogenetic inference from any type of biological data. They are,

  1. Distance-based methods
  2. Maximum Parsimony (MP) methods
  3. Probabilistic methods

1. Distance-based Methods

Distance-based methods compute an evolutionary distance, which is the number of changes that have occurred for two species considered to diverge from a common ancestor. However, these methods face problems with accuracy when it comes to dealing with large volumes of data which have very distant relationships.

2. Maximum Parsimony (MP) methods

MP methods infer a tree that minimizes the total number of changes, known as mutations, required to explain the data. Under the maximum parsimony criterion, the shortest possible tree that explains the data is considered as the best tree. This best tree is known as the most-parsimonious tree. Heuristic search is performed to quickly generate the most-parsimonious tree. Since this methods considers the shortest possible tree as the best tree, actual evolutionary changes that have occurred may be underestimated.

3. Probabilistic methods

Probabilistic methods, such as Maximum Likelihood (ML)andBayesian inference, attempt to find a tree that maximizes the conditional or posterior probability of observing the data. Phylogenetic studies at present, widely utilize Bayesian frameworks due to the possibility of account for the phylogenetic uncertainty, availability of efficient algorithms and their implementation as various computer programs.

Bio.Phylo — Time to Practice

Since we have a basic idea about phylogenetic trees, it is time to try out some coding. I have introduced a set of Python tools named Biopython in one of my previous articles, which can be used to analyze biological data. If you haven’t gone through it make sure to check it out as well.

I will be using the Bio.Phylo module which provides classes, functions and I/O support for working with phylogenetic trees. You can go through the official documentation to get more details about this module.

Task — Construct the phylogenetic tree for the given DNA sequences

Consider you are provided five DNA sequences with their labels in the beginning of each line. You can find these sequences in a file named as msa.phy in the official biopython test material for tree construction. The sequences considered are given below.

  1. Alpha AACGTGGCCACAT
  2. Beta AAGGTCGCCACAC
  3. Gamma CAGTTCGCCACAA
  4. Delta GAGATTTCCGCCT
  5. Epsilon GAGATCTCCGCCC

We are given the task of constructing the phylogenetic tree to represent these sequences based on distance-based phylogenetic inference methods.

Currently, Bio.Phylo module has two types of tree constructors: DistanceTreeConstructor and ParsimonyTreeConstructor. We will be using DistanceTreeConstructor for this task.

Furthermore, the DistanceTreeConstructor supports two heuristic algorithms: UPGMA (Unweighted Pair Group Method with Arithmetic Mean) and NJ (Neighbor Joining). We will be using the UPGMA algorithm. You can read more about the UPGMA algorithm from this link.

Solution

Firstly, make sure you have downloaded the msa.phy file which contains the input sequences and include it in your current working directory.

Given below is the python code to create the phylogenetic tree for the given DNA sequences. Note how we have used Bio.Phylo module and its functionality.

By running the code, we can get the phylogenetic tree as a graphical visualization as well as get it printed in the terminal as shown below.

Graphical visualization of the phylogenetic tree using UPGMA
The phylogenetic tree using UPGMA printed in the terminal at the end

If you use NJ algorithm instead of UPGMA algorithm, the resulting tree will be changed as shown below.

Graphical visualization of the phylogenetic tree using NJ
The phylogenetic tree using NJ printed in the terminal at the end

Hope you enjoyed reading this article and learned useful and interesting things about molecular genetics and how to use Biopython to construct phylogenetic trees from a given set of sequences. I would love to hear your thoughts and ideas.

Thanks for reading…