Home

Bioinformatics

This part is based on Introduction to Bioinformatics by Lesk, especially chapters 1 and 5. This book shares some of the playful character of Hello World, and the emphasis on general principles. Lesk uses Perl for the programming examples, but the algorithms are very clearly explained and it is not difficult to write equivalents in Python.

Phylogeny of Languages

Bioinformatics is all about sequences of bases and amino acids in different organisms, but before trying that, we can get a feel for what is involved by examining a more familiar kind of sequence: texts in different languages.

Consider this well- known short text. On each of these web pages, the part between

font: 24.0px Times">

and

<p>----

needs be extracted, for it is the actual text.

Then some html code like <h1> (heading 1), <p> (paragraph), <ol> (ordered list), <li> (list item), <b> (bold), and/or their respective terminations (eg. </h1>) need to be removed.

Then the text needs to have its punctuation removed with

s.translate(None,".,;()='")

and then turned into lower case, and split into words.

For each word list, one can construct a dictionary of words with their locations. Then one can find the common words between two word lists, such that

for w in common:
    print  w, dic1[w], dic2[w]

gives all the common words and their locations.

Implement this, so as to output the common words between all pairs of texts.

Submit the program

We can try and infer the relationship between languages from the lists of common words. There are all sorts of statistical techniques one can use, but simple examination already gives some clues. We can type our guessed relationship into a text file.

(((Romansh, Italian), French), (English, German))

called langs.txt (say). This is the Newick format. Then

from Bio import Phylo
tree = Phylo.read('langs.txt','newick')
tree.rooted = True
Phylo.draw(tree)

will draw a tree for us.

Sequence Alignment

The comparison of gene sequences (or protein sequences) in different species is a fundamental problem in bioinformatics. We will consider the simplest form of sequence comparison: pairwise dot plots of matching sub-sequences. To see how this works, let us consider the following strings:

x = "My care is loss of care, by old care done"
y = "Your care is gain of care, by new care won"

and then plot a dot whenever a region of five characters matches between x and y. We get the following.

We can see how the five-character substring " care" produces a dot at (x=2,y=4), and then more dots as it appears again.

Dot plots provide a convenient visual representation of the degree of mismatch between sequences. They also help understand sequence alignment: an alignment is a path along a dot plot that optimizes some measure of matching.

Write a program to produce a similar dot plot.

The method is easily applied to gene and protein sequences, we just need to get a sequence into a string. We already know how to write a GenBank file with sequence data. Let fname be a string with the name of such a file, then try the following.

from Bio import SeqIO
f = open(fname)
p = SeqIO.parse(f,'genbank').next()
f.close()
print p.annotations['organism']
print p.seq

The sequence is now in the string seq and ready to be analyzed.

The insulin receptor is part of the signalling pathway that mediates blood glucose levels. It is found in many organisms, including humans, mice, xenopus and rats. Download the linked FASTA files and make dot plots comparing the corresponding amino-acid sequence in different pairs of species. Try and infer the phylogenetic relationships of the species from the dot plots, express it in Newick, and hence produce a phylogenetic tree.

Submit the program

Multiple sequence alignment is a famously hard problem. The computer resources needed to guarantee finding the optimal alignment grow like

[Sequence length]^{[number of sequences]}

Basically, it is the product of all the sequence lengths. As the number of sequences increases, finding an optimal solution soon becomes infeasible. (Checking a proposed optimal alignment can, however, be done more efficiently.) Computational questions of such complexity are said to be NP-complete. Practical algorithms aim to find a near-optimal solution, or to have a good (but not perfect) chance of finding the optimal solution.

Unscrambling Shakespeare

This section is optional.

Chapter 1 of Lesk has an interesting section on the algorithmic problems in shotgun sequencing. Lesk takes some text passages from Shakespeare and fragments them in different ways. Here is his first example.

the men and women merely players;\n
one man in his time
All the world's
their entrances,\nAnd one man
stage,\nAnd all the men and women
They have their exits and their entrances,\n
world's a stage,\nAnd all
their entrances,\nAnd one man
in his time plays many parts.
merely players;\nThey have

Here the last 16 characters of the 0th fragment match the first 16 characters of the 9th fragment. We can express the various overlaps as follows.

0 [(9, 17)]
1 [(8, 11), (9, 2)]
2 [(6, 7), (4, 1)]
3 [(1, 7)]
4 [(0, 17)]
5 [(3, 17), (7, 17)]
6 [(4, 14)]
7 [(1, 7)]
8 []
9 [(5, 9)]

Write a function that will take a list of strings and work out the overlaps.

Submit the program

In this example, concatenating the fragments according to the longest overlaps will reconstruct the original text. But that is not always the case. The following challenges given by Lesk are more difficult.

Kate, when France
France is mine
is mine and
and I am\nyours
yours then
then yours is France
France and you are mine

One woman is
woman is fair,
is fair, yet I am
yet I am well;
I am well; another
another is wise, yet I am well;
yet I am well; another virtuous,
another virtuous, yet I am well;
well; but till all
all graces be
be in one woman
one woman, one
one woman shall
shall not come in my grace.