Distance Format

From Biowiki
Jump to: navigation, search

The Distance Format is a file format for specifying phylogenetic "distance matrices" used by distance and related programs in PHYLIP.

A distance matrix can be viewed as a summary (albeit an incomplete one) of the probability distribution over phylogenetic trees. Entry (i,j) corresponds to the estimated evolutionary "distance" (or time) separating species i and j. Due to symmetry and the diagonal elements being zero, only the lower-triangular matrix need be specified.

Distance matrices are typically used to make quick guesstimates of the phylogeny of a bunch of sequences, using heuristic algorithms like Neighbor Joining or UPGMA that approximate more rigorous likelihood-based methods.

The input format for distance data is straightforward. The first line of the input file contains the number of species. There follows species data, starting, as with all other programs, with a species name. The species name is ten characters long, and must be padded out with blanks if shorter. For each species there then follows a set of distances to all the other species (options selected in the programs' menus allow the distance matrix to be upper or lower triangular or square). The distances can continue to a new line after any of them. If the matrix is lower-triangular, the diagonal entries (the distances from a species to itself) will not be read by the programs. If they are included anyway, they will be ignored by the programs, except for the case where one of them starts a new line, in which case the program will mistake it for a species name and get very confused.

From http://evolution.genetics.washington.edu/phylip/doc/distance.html

-- Ian Holmes - 05 Apr 2005