UPGMA

From Biowiki
Jump to: navigation, search

UPGMA (Unweighted Pair Group Mean Average) is an algorithm for constructing phylogenetic trees.

The algorithm only works well if the input distance matrix is Wikipedia:ultrametric.

-- Ian Holmes - 04 Nov 2007

The UPGMA algorithm can be summarized in the following pseudocode:

  • Inputs:
    • N, a list of nodes to be joined into a tree (initially containing all leaf nodes);
    • D(X,Y), the distance between nodes X and Y, which must be symmetric (so D(X,Y) = D(Y,X)).
  • Let H(X) be the height of node X.
    • Initialize H(X) = 0 for all X in N.
  • Repeat while there are two or more nodes in N:
    • Find I and J, the closest two nodes in N.
    • Remove I and J from N.
    • Create a new node, K, that will be the parent of I and J in the tree.
    • Define the height of node K to be H(K) = .5 * (H(I) + H(J) + D(I,J))
      • Create a branch from K to I of length H(K) - H(I)
      • Create a branch from K to J of length H(K) - H(J)
    • Define distances from K to all other nodes as follows:
      • D(K,L) = (D(I,L) + D(J,L)) / 2 for all nodes L in N;
    • Insert K into N.
  • Outputs:
    • The final node in N is the root node of the phylogenetic tree to be returned.

The key data structure that is being created here is the Wikipedia:tree_data_structure. In fact, the only operation that creates the tree structure here is the following:

  • Create a new node, K, that is the parent of I and J in the tree.

One typically also needs to implement some kind of Wikipedia:tree_traversal algorithm, in order to print out all the nodes.

If the output tree is to be returned as a Newick Format string, there is a particularly succinct implementation of UPGMA as an Wikipedia:In-place_algorithm, as follows.

Since each node in N is represented as a string, the "create node K" operation amounts to creating the following string ( red text indicates string literals; black italics indicate variable names):

( I  : B(K,I) , J  : B(K,J) )

where B(K,I) is the branch length from K to I.

-- Ian Holmes - 05 Nov 2007