Primate phylogeny practical
Goals
- Investigate whether humans are, indeed, related to apes
- Brief exposure to phylogenetic tree-building software
Procedure
- The first part of this practical involves making trees from a multiple alignment of amino acid sequences for a subunit of the ATP synthase enzyme.
- Examine the ATP6 alignment.
- use a program like Jalview or belvu to visualize the alignment, or just examine it using a Unix program to look at text files (like
more)
- here is the alignment file: ATP6.stockholm
more ATP6.stockholm
- Identify (and describe) one highly conserved column, and one variable column.
- Extract human and chimp sequences, either manually or with this perl one-liner
cat ATP6.stockholm | perl -e 'while(<>){if(/(homo|chimp)(\S+)\s+(.*)/){$seq{$1.$2}.=$3}}
while(($name,$seq)=each%seq){print"$name $seq\n"}' >human-chimp.stockholm
- What proportion of the amino acid sites are
- Assume each different amino acid site indicates one or more mutations, while each identical amino acid site indicates no mutations.
- Why might this assumption not be valid? How could this bias your estimates of evolutionary divergence?
- Suppose that the number of mutations at a site follow a Poisson distribution (Wikipedia) , with (on average) 1 mutation per site per "unit" of time. Write down expressions for
- the probability that a site experiences no mutations after time t
- the probability that a site experiences one or more mutations after time t
- Suppose that there are N sites. What is the probability that, after time t, K of these sites have not experienced a mutation (as a function of t)?
- Sketch this function (likelihood vs time) for the case N=100 and K=50.
- If a site "has not experienced a mutation", is this the same as saying that "the human and chimp amino acids are identical"? If not, why not -- and how would this affect your analysis?
- Estimate a "distance matrix" for these species:
~be131/dart/bin/tkfdistance --nocountindel -log 6 ATP6.stockholm >ATP6.distance
- What do the entries of the distance matrix represent?
- (See if you can answer this by e.g. searching for "phylogenetic distance matrix" on Google; if not, ask your neighbor, the professor or the GSI.)
- Estimate a tree by "weighted neighbor-joining"
~be131/dart/bin/weighbor -i ATP6.distance -o ATP6.tree -vvv
- Draw the tree
- e.g. using Phylodendron (select the "phenogram" option) or ATV. Remove /'s in the sequence length (eg, "/1-226") from the sequence names in the tree file first.
- You can remove the numbers with
cat ATP6.tree | perl -e 'while(<>){s/\/\d+-\d+//g; print;}' >ATP6_2.tree
- Does the tree appear to "make sense"?
- Try building some trees from other protein (or DNA) alignments (but make sure you don't use a huge number of sequences!). For example...
- any of the Pfam top twenty
- the globins in the PAML examples directory: abglobin.aa, abglobin.nuc
- Globins are often used as canonical proteins for drawing phylogenetic trees. Why do you think this might be?
- Mitochondrial DNA - try e.g.
- The kind of phylogenetic analysis that you have been doing here assumes a random model of evolution (see e.g. the Poisson analysis). Sometimes one also assumes a neutral model (essentially ruling out natural selection; see also here). This has always been a source of controversy among evolutionists, most recently providing ammunition to the advocates of "intelligent design". What do you think about this? In what sense might evolution be thought to be random or deterministic? Leaving aside the somewhat politicised issue of intelligent design, is there any way you could prove evolution to be nonrandom?
Software
|