This part is based on Introduction to Bioinformatics by Lesk, especially chapters 1 and 5. This book shares some of the playful character of Hello World, and the emphasis on general principles. Lesk uses Perl for the programming examples, but the algorithms are very clearly explained and it is not difficult to write equivalents in Python.
Bioinformatics is all about sequences of bases and amino acids in different organisms, but before trying that, we can get a feel for what is involved by examining a more familiar kind of sequence: texts in different languages.
Consider this well- known short text. On each of these web pages, the part between
font: 24.0px Times">and
<p>----needs be extracted, for it is the actual text.
Then some html code like <h1> (heading 1), <p> (paragraph), <ol> (ordered list), <li> (list item), <b> (bold), and/or their respective terminations (eg. </h1>) need to be removed.
Then the text needs to have its punctuation removed with
s.translate(None,".,;()='")and then turned into lower case, and split into words.
For each word list, one can construct a dictionary of words with their locations. Then one can find the common words between two word lists, such that
for w in common: print w, dic1[w], dic2[w]
gives all the common words and their locations.
Implement this, so as to output the common words between all pairs of texts.
We can try and infer the relationship between languages from the lists of common words. There are all sorts of statistical techniques one can use, but simple examination already gives some clues. We can type our guessed relationship into a text file.
(((Romansh, Italian), French), (English, German))
called langs.txt
(say). This is the Newick format.
Then
from Bio import Phylo tree = Phylo.read('langs.txt','newick') tree.rooted = True Phylo.draw(tree)
will draw a tree for us.
The comparison of gene sequences (or protein sequences) in different species is a fundamental problem in bioinformatics. We will consider the simplest form of sequence comparison: pairwise dot plots of matching sub-sequences. To see how this works, let us consider the following strings:
x = "My care is loss of care, by old care done" y = "Your care is gain of care, by new care won"
and then plot a dot whenever a region of five characters matches
between x
and y
. We get the following.
We can see how the five-character
substring " care"
produces a dot at
(x=2,y=4), and then more dots as it appears again.
Dot plots provide a convenient visual representation of the degree of mismatch between sequences. They also help understand sequence alignment: an alignment is a path along a dot plot that optimizes some measure of matching.
Write a program to produce a similar dot plot. The method is easily applied to gene and protein sequences, we
just need to get a sequence into a string. We already know how to
write a GenBank file with sequence data. Let fname
be a
string with the name of such a file, then try the following.
from Bio import SeqIO f = open(fname) p = SeqIO.parse(f,'genbank').next() f.close() print p.annotations['organism'] print p.seq
The sequence is now in the string seq
and ready to be
analyzed.
The insulin receptor is part of the signalling pathway that mediates blood glucose levels. It is found in many organisms, including humans, mice, xenopus and rats. Download the linked FASTA files and make dot plots comparing the corresponding amino-acid sequence in different pairs of species. Try and infer the phylogenetic relationships of the species from the dot plots, express it in Newick, and hence produce a phylogenetic tree.
Multiple sequence alignment is a famously hard problem. The computer resources needed to guarantee finding the optimal alignment grow like
[Sequence length][number of sequences]
Basically, it is the product of all the sequence lengths. As the number of sequences increases, finding an optimal solution soon becomes infeasible. (Checking a proposed optimal alignment can, however, be done more efficiently.) Computational questions of such complexity are said to be NP-complete. Practical algorithms aim to find a near-optimal solution, or to have a good (but not perfect) chance of finding the optimal solution.
This section is optional.
Chapter 1 of Lesk has an interesting section on the algorithmic problems in shotgun sequencing. Lesk takes some text passages from Shakespeare and fragments them in different ways. Here is his first example.
the men and women merely players;\n one man in his time All the world's their entrances,\nAnd one man stage,\nAnd all the men and women They have their exits and their entrances,\n world's a stage,\nAnd all their entrances,\nAnd one man in his time plays many parts. merely players;\nThey have
Here the last 16 characters of the 0th fragment match the first 16 characters of the 9th fragment. We can express the various overlaps as follows.
0 [(9, 17)] 1 [(8, 11), (9, 2)] 2 [(6, 7), (4, 1)] 3 [(1, 7)] 4 [(0, 17)] 5 [(3, 17), (7, 17)] 6 [(4, 14)] 7 [(1, 7)] 8 [] 9 [(5, 9)]Write a function that will take a list of strings and work out the overlaps.
In this example, concatenating the fragments according to the longest overlaps will reconstruct the original text. But that is not always the case. The following challenges given by Lesk are more difficult.
Kate, when France France is mine is mine and and I am\nyours yours then then yours is France France and you are mine
One woman is woman is fair, is fair, yet I am yet I am well; I am well; another another is wise, yet I am well; yet I am well; another virtuous, another virtuous, yet I am well; well; but till all all graces be be in one woman one woman, one one woman shall shall not come in my grace.