UPGMA
UPGMA (Unweighted Pair Group Mean Average) is an algorithm for constructing phylogenetic trees.
The algorithm only works well if the input distance matrix is Wikipedia:ultrametric.
-- Ian Holmes - 04 Nov 2007
The UPGMA algorithm can be summarized in the following pseudocode:
- Inputs:
- N, a list of nodes to be joined into a tree (initially containing all leaf nodes);
- D(X,Y), the distance between nodes X and Y, which must be symmetric (so D(X,Y) = D(Y,X)).
- Let H(X) be the height of node X.
- Initialize H(X) = 0 for all X in N.
- Repeat while there are two or more nodes in N:
- Find I and J, the closest two nodes in N.
- Remove I and J from N.
- Create a new node, K, that will be the parent of I and J in the tree.
- Define the height of node K to be H(K) = .5 * (H(I) + H(J) + D(I,J))
- Create a branch from K to I of length H(K) - H(I)
- Create a branch from K to J of length H(K) - H(J)
- Define distances from K to all other nodes as follows:
- D(K,L) = (D(I,L) + D(J,L)) / 2 for all nodes L in N;
- Insert K into N.
- Outputs:
- The final node in N is the root node of the phylogenetic tree to be returned.
The key data structure that is being created here is the Wikipedia:tree_data_structure. In fact, the only operation that creates the tree structure here is the following:
- Create a new node, K, that is the parent of I and J in the tree.
One typically also needs to implement some kind of Wikipedia:tree_traversal algorithm, in order to print out all the nodes.
If the output tree is to be returned as a Newick Format string, there is a particularly succinct implementation of UPGMA as an Wikipedia:In-place_algorithm, as follows.
Since each node in N is represented as a string, the "create node K" operation amounts to creating the following string ( red text indicates string literals; black italics indicate variable names):
( I : B(K,I) , J : B(K,J) )
where B(K,I) is the branch length from K to I.
-- Ian Holmes - 05 Nov 2007