click on the Biowiki logo to go to homepage
Edit Raw Print
Links Diffs RSS
About Stats Recent


Research Teaching Blog
Fall10 | Sandbox
Biowiki > Teaching > Bio E 131 > Primate Phylogeny > PhylogenyHomework

Search

Advanced search...

Topics

PageRank Checker

Phylogeny homework exercise

The homework exercise for this week is to implement (in Perl) the UPGMA algorithm for estimating phylogenetic trees (as covered in class). You will submit the code for this exercise as hw7.pl. You will also apply your program to the ATP6 alignment used in the Primate Phylogeny lab, print out and visualize the tree, and compare it to the tree generated by weighbor.

To make things (slightly) easier, you can assume that the input to your program is a "tidy" version of the PHYLIP "distance matrix" format that we used in this lab. (For an explanation of the "tidy distance matrix" format, see Input files section, below.)

Your program should output the estimated phylogenetic tree in NewickFormat (the same output format used by the weighbor program used in the lab).

Grading scheme

Grading for the exercise will take the following into consideration (marks shown in red italics ):

  • Correctness of your UPGMA implementation (40)
  • Correctly outputting a tree in Newick format (20)
  • Programming style (10)
  • Comparison of the trees generated by UPGMA versus weighbor (20)

The tree comparison should be an electronic text file that comments on the similarities/differences between the trees (roughly 1-2 paragraphs should be plenty). You may include visualizations of the trees if they aid your explanation, but they are not mandatory (you'll need them to come up with your explanation, though). Submit this file as hw7_text in your favorite format (doc, pdf, ps, txt, rtf, etc).

If you're unable to finish the UPGMA implementation you can still get credit for the tree comparison by using a pre-existing implementation of UPGMA (it's up to you to find this).

You are free to re-use third-party Perl code libraries or snippets, such as BioPerl, to achieve part (but not all) of your project. If you incorporate third-party code into your program, please clearly designate which code is imported/derived from third-party code (and the nature of any changes you have made) and which code is entirely original.

Practice exercises

As an extra/practice exercise (for which you can be awarded an alternate/discretionary grade; see Alternate grading scheme, below) you may wish to implement one of the other algorithmically constructed trees we've mentioned in passing, e.g.

The GSI is recommended to be much more detailed in assisting with troubleshooting of these practice exercises than of the homework exercise.

Alternate grading scheme

If you are having difficulty implementing UPGMA, you can be awarded partial credit for that section if you can demonstrate working code that both

  1. constructs a tree data structure;
  2. traverses the tree structure in some well-defined order, to print it out to the screen in a nested format (like Newick Format, or otherwise).

The tree can be another algorithmic tree that we have covered in class (see Practice exercises section, above).

Input files

The "tidy distance matrix" format is as follows. Suppose that there were N sequences in the original alignment. In the tidy version of the distance matrix, there are exactly N lines in the file, i.e. one per sequence. Each line contains the name of the corresponding sequence, followed by N numbers representing evolutionary distances to all the other sequences. If you click on the example file below, you'll see that this essentially corresponds to a square N*N matrix with row labels. (Note that the diagonal elements of the matrix are zero, because the "distance" from a sequence to itself is zero.)

Here is the input file to use for your program:

And just for reference, here is the Perl one-liner that was used to convert the standard PHYLIP DistanceFormat into the one-row-per-line form:

cat ATP6.distance | perl -e '$dummy=<>;while(<>){@f=split;if(/^\S/){print"\n"if$n++>0;
 $name=shift@f;printf"% 20s",$name}print map(sprintf("% 10s",$_),@f)}print"\n"' > ATP6.distmat

Attachment sort Action Size Date Who Comment
ATP6.distmat manage 7.7 K 04 Nov 2007 - 08:34 IanHolmes ATP6 distance matrix (one row per line)

Actions: Edit | Attach | New | Ref-By | Printable view | Raw view | Normal view | See diffs | Help | More...