Gff Tools

From Biowiki
Jump to: navigation, search

GFF tools

This page describes tools for working with the GFF format for genome feature annotation.

The tools can be downloaded as a tarball from github, here:

The most common operations that one tends to want to perform

on sets of GFF records include intersection, exclusion, union, filtration, sorting, transformation
(to a new co-ordinate system) and dereferencing (access to the described sequence).

These operations form a basis for more

sophisticated algorithms like clustering and joining-together by dynamic programming.

Programs to perform all of these tasks are described below, with links to local copies.

An older, probably obsolete format for representing NSE pairs that is used by several of the programs listed below

is EXBLX, as used by MSPCrunch (Sonnhammer & Durbin: An expert system for processing sequence homology data. Proc Int Conf Intell Syst Mol Biol 1994;2:363-8.).

The tools are now rather outdated, and probably unnecessary for anyone who knows how to use a relational database, and can therefore use something like Chado. Nonetheless, they still find occasional use in our lab. YMMV.

GFF tools

(The following tools were written by Ian Holmes. Please email if you require documentation for these programs.)

Several of these scripts duplicate functionality provided

by Tim Hubbard's module, but are more efficient (or at least they were at one point).

This is a significant consideration for chromosome-sized GFF files!

The programs fall into several categories:

  • Perl scripts that we still use reasonably often:
    • - efficiently finds the intersection (or exclusion) of two GFF streams, reporting intersection information in the Group field. Definition of "intersection" allows for near-neighbours and minimum-overlap
      • - used with to do reverse lookups and other manipulations on the results of an intersection test. Useful for e.g. pruning the lowest-scoring redundant entries from a GFF file
    • - uses a GFF file to mask out specified sections of a FASTA-format DNA database with "n"'s (or any other character)
    • - given chromosome co-ordinates, a clone database and a physical map co-ordinate file, returns the specified section of chromosomal sequence, even if it spans multiple clones. Requires and
    • - filters lines out of a GFF stream according to user-specified criteria
    • - sorts GFF streams by sequence name and startpoint
    • - merges sorted GFF streams
    • - cuts out everything in one GFF file that overlaps with features in a second. Useful e.g. to find all unannotated regions in a genome.
    • - a very simple script that returns the enclosing region for all GFF features
  • Programs that are only tangentially related to GFF, but complement the GFF tools well & are still useful:
    • - flags low-complexity regions in a FASTA DNA database. The complexity is calculated as the entropy of variable-length oligomer composition in a variable-length sliding window
    • Window Licker (part of DART)
    • Mercator Perl (a separate CVS distribution for working with the output of the Mercator Program)
  • Perl scripts that we hardly use at all these days, if ever:
  • Programs in languages other than Perl (gasp) -- these are also rarely used:
    • gffhitcount - a C++ program that counts the number of times each base in a set of sequences is spanned by a GFF record and returns the results in GFF format.
    • EXBLX dynamic programming: bigdp - a C++ program that assembles EXBLX segments using an affine gap penalty by doing linear-space divide-and-conquer dynamic programming, written by Ian Holmes. The program does not examine the sequences to which the EXBLX data refer, but finds optimal connections between the segments given their co-ordinates. GFF pair format can be converted to EXBLX using

A note regarding bigdp and the format it uses: EXBLX records are single lines comprising eight whitespace-delimited fields:


bigdp requires that the two NSEs are the same length (i.e. END#1- START#1= END#2- START#2). The output of bigdp is modified EXBLX. Each line of the ouput describes a set of several input segments joined together;

the percent-ID field is replaced by the number of input segments that were used
and a ninth field, compactly describing the co-ordinates of the input segments, is added.

The algorithm used by the program is documented more fully in Ian Holmes' PhD thesis.

  • More tangentially related programs that are rarely used:
    • - symmetrises an EXBLX file (ensures that for every A:B pair there is a single corresponding pair B:A)
    • - asymmetrises an EXBLX file (filters through only those pairs A:B for which B>A)
    • - builds optimal clusters from an EXBLX stream
    • - builds clusters from an EXBLX stream using a fast incremental heuristic
    • - builds optimal clusters from an EXBLX stream, ignoring sequence start and endpoint
    • - builds a quick lookup index for an EXBLX file
    • - filters through only non-overlapping entries from an EXBLX stream
    • - sorts an EXBLX stream
    • - tidies up an EXBLX stream (joins overlapping matches, prunes out lines corresponding to BLAST errors, etc.)
    • - transforms from one co-ordinate system to another (e.g. clones to chromosomes). Requires
    • - BLASTs a clone database against itself then transforms, sorts and merges the results into chromosome co-ordinates according to a physical (sequence) map file, which is in GFF format. Requires
    • - module to assist iterations on FASTA DNA databases; creates temporary files for each sequence

Repository access

The bleeding-edge way to get gfftools is from github, here:

You can also download the entire repository as a tarball:

Tools from other places

Some history

GFF, a general-purpose genome annotation format, was conceived during a 1997 meeting on computational genefinding at the Isaac Newton Institute, Cambridge, UK.

GFF was designed to hold predicted subfeatures

(exons, introns, splice sites, polypyrimidine tracts, promoters, etc.)

in a common input format for multi-component genefinders, such as Reese et al.: Genie--gene finding in Drosophila melanogaster. Genome Res. 2000;10:529-38. or Howe et al.: GAZE: a generic framework for the integration of gene-prediction data by dynamic programming. Genome Res. 2002;12:1418-27. (downloadable from Sanger), that stitch together the results of specialized "sensor" or "predictor" programs in a Dynamic Programming (or particularly a Hidden Markov Model) framework.

The 9 fields of a GFF record are

1 2 3 4 5 6 7 8 9

At least the first three of these are useful enough, and the GFF syntax for null fields compact enough (1 byte), that GFF also caught on as a format for quick-and-dirty Perl-fuelled data mining at the Unix command line. Scorned by those who worship databases and ontologies, GFF gained a rep as a lo-tech genome-hackers' format, favored by sequence-mungers too busy for SQL.

Several other tools for these sorts of purposes exist (aside from, um, using a proper SQL database). For example, Tim Hubbard's perl modules. However the gfftools have a small underground following because they were built explicitly to be user-driven, interactive and fast, with time-consuming steps (like sorting, indexing or I/O) directly under the analyst's control.

These scripts are flawed, and by no means the most elegant or algorithmically thoughtful thing I ever wrote, but some people still use them. For these reasons, I've finally got around to putting them up on a public CVS server.

This page was moved from the Sanger GFF page by Ian Holmes on 21 March 2005.