# UPGMA

UPGMA (Unweighted Pair Group Mean Average) is an algorithm for constructing phylogenetic trees.

The algorithm only works well if the input distance matrix is Wikipedia:ultrametric.

-- Ian Holmes - 04 Nov 2007

The UPGMA algorithm can be summarized in the following pseudocode:

• Inputs:
• N, a list of nodes to be joined into a tree (initially containing all leaf nodes);
• D(X,Y), the distance between nodes X and Y, which must be symmetric (so D(X,Y) = D(Y,X)).
• Let H(X) be the height of node X.
• Initialize H(X) = 0 for all X in N.
• Repeat while there are two or more nodes in N:
• Find I and J, the closest two nodes in N.
• Remove I and J from N.
• Create a new node, K, that will be the parent of I and J in the tree.
• Define the height of node K to be H(K) = .5 * (H(I) + H(J) + D(I,J))
• Create a branch from K to I of length H(K) - H(I)
• Create a branch from K to J of length H(K) - H(J)
• Define distances from K to all other nodes as follows:
• D(K,L) = (D(I,L) + D(J,L)) / 2 for all nodes L in N;
• Insert K into N.
• Outputs:
• The final node in N is the root node of the phylogenetic tree to be returned.

The key data structure that is being created here is the Wikipedia:tree_data_structure. In fact, the only operation that creates the tree structure here is the following:

• Create a new node, K, that is the parent of I and J in the tree.

One typically also needs to implement some kind of Wikipedia:tree_traversal algorithm, in order to print out all the nodes.

If the output tree is to be returned as a Newick Format string, there is a particularly succinct implementation of UPGMA as an Wikipedia:In-place_algorithm, as follows.

Since each node in N is represented as a string, the "create node K" operation amounts to creating the following string ( red text indicates string literals; black italics indicate variable names):

( I  : B(K,I) , J  : B(K,J) )

where B(K,I) is the branch length from K to I.

-- Ian Holmes - 05 Nov 2007