This page describes tools for working with the GFF format
for genome feature annotation.
The tools can be downloaded as a tarball from github, here: https://github.com/ihh/gfftools/tarball/master
The most common operations that one tends to want to perform
on sets of GFF records include intersection, exclusion, union, filtration, sorting, transformation
(to a new co-ordinate system) and dereferencing (access to the described sequence).
These operations form a basis for more
sophisticated algorithms like clustering and joining-together by dynamic programming.
Programs to perform all of these tasks are described below, with links to local copies.
An older, probably obsolete format for representing NSE pairs that is used by several of the programs listed below
is EXBLX, as used by MSPCrunch (Sonnhammer EL, Durbin R. An expert system for processing sequence homology data.
Proc Int Conf Intell Syst Mol Biol. 1994;2:363-8.).
The tools are now rather outdated, and probably unnecessary for anyone who knows how to use a relational database, and can therefore use something like Chado.
Nonetheless, they still find occasional use in our lab. YMMV.
(The following tools were written by IanHolmes
Please email if you require documentation for these programs.)
Several of these scripts duplicate functionality provided
by Tim Hubbard's GFF.pm
module, but are more efficient (or at least they were at one point).
This is a significant consideration for chromosome-sized GFF files!
The programs fall into several categories:
- Perl scripts that we still use reasonably often:
gffintersect.pl - efficiently finds the intersection (or exclusion) of two GFF streams, reporting intersection information in the Group field. Definition of "intersection" allows for near-neighbours and minimum-overlap
intersectlookup.pl - used with
gffintersect.pl to do reverse lookups and other manipulations on the results of an intersection test. Useful for e.g. pruning the lowest-scoring redundant entries from a GFF file
gffmask.pl - uses a GFF file to mask out specified sections of a FASTA-format DNA database with "n"'s (or any other character)
gff2seq.pl - given chromosome co-ordinates, a clone database and a physical map co-ordinate file, returns the specified section of chromosomal sequence, even if it spans multiple clones. Requires
gfffilter.pl - filters lines out of a GFF stream according to user-specified criteria
gffsort.pl - sorts GFF streams by sequence name and startpoint
gffmerge.pl - merges sorted GFF streams
gffsubtract.pl - cuts out everything in one GFF file that overlaps with features in a second. Useful e.g. to find all unannotated regions in a genome.
gffspan.pl - a very simple script that returns the enclosing region for all GFF features
- Format conversion utilities:
- Programs that are only tangentially related to GFF, but complement the GFF tools well & are still useful:
cfilter.pl - flags low-complexity regions in a FASTA DNA database. The complexity is calculated as the entropy of variable-length oligomer composition in a variable-length sliding window
- WindowLicker (part of DART)
- MercatorPerl (a separate CVS distribution for working with the output of the MercatorProgram)
- Perl scripts that we hardly use at all these days, if ever:
- Programs in languages other than Perl (gasp) -- these are also rarely used:
gffhitcount - a C++ program that counts the number of times each base in a set of sequences is spanned by a GFF record and returns the results in GFF format.
- EXBLX dynamic programming:
bigdp - a C++ program that assembles EXBLX segments using an affine gap penalty by doing linear-space divide-and-conquer dynamic programming, written by Ian Holmes. The program does not examine the sequences to which the EXBLX data refer, but finds optimal connections between the segments given their co-ordinates. GFF pair format can be converted to EXBLX using
A note regarding
and the format it uses:
EXBLX records are single lines comprising eight whitespace-delimited fields:
(SCORE, PERCENT-ID, START#1, END#1, NAME#1, START#2, END#2, NAME#2).
requires that the two NSEs are the same length (i.e. END#1- START#1= END#2- START#2).
The output of
is modified EXBLX.
Each line of the ouput describes a set of several input segments joined together;
the percent-ID field is replaced by the number of input segments that were used
and a ninth field, compactly describing the co-ordinates of the input segments, is added.
The algorithm used by the program is documented more fully in Ian Holmes' PhD thesis
- More tangentially related programs that are rarely used:
exblxsym.pl - symmetrises an EXBLX file (ensures that for every A:B pair there is a single corresponding pair B:A)
exblxasym.pl - asymmetrises an EXBLX file (filters through only those pairs A:B for which B>A)
exblxcluster.pl - builds optimal clusters from an EXBLX stream
exblxfastcluster.pl - builds clusters from an EXBLX stream using a fast incremental heuristic
seqcluster.pl - builds optimal clusters from an EXBLX stream, ignoring sequence start and endpoint
exblxindex.pl - builds a quick lookup index for an EXBLX file
exblxsingles.pl - filters through only non-overlapping entries from an EXBLX stream
exblxsort.pl - sorts an EXBLX stream
exblxtidy.pl - tidies up an EXBLX stream (joins overlapping matches, prunes out lines corresponding to BLAST errors, etc.)
exblxtransform.pl - transforms from one co-ordinate system to another (e.g. clones to chromosomes). Requires
blasttransform.pl - BLASTs a clone database against itself then transforms, sorts and merges the results into chromosome co-ordinates according to a physical (sequence) map file, which is in GFF format. Requires
SequenceIterator.pm - module to assist iterations on FASTA DNA databases; creates temporary files for each sequence
The bleeding-edge way to get gfftools
is from github, here: https://github.com/ihh/gfftools
You can also download the entire repository as a tarball: https://github.com/ihh/gfftools/tarball/master
Tools from other places
a general-purpose genome annotation format,
was conceived during a 1997 meeting on computational genefinding at the Isaac Newton Institute, Cambridge, UK.
GFF was designed to hold predicted subfeatures
(exons, introns, splice sites, polypyrimidine tracts, promoters, etc.)
in a common input format for multi-component genefinders, such as
(downloadable from Sanger
that stitch together the results of specialized "sensor" or "predictor" programs
in a DynamicProgramming
(or particularly a HiddenMarkovModel
The 9 fields of a GFF record are
At least the first three of these are useful enough,
and the GFF syntax for null fields compact enough (1 byte),
that GFF also caught on as a format for quick-and-dirty
Perl-fuelled data mining at the Unix command line.
Scorned by those who worship databases and ontologies,
GFF gained a rep as a lo-tech genome-hackers' format,
favored by sequence-mungers too busy for SQL.
Several other tools for these sorts of purposes exist (aside from, um, using a proper SQL database).
For example, Tim Hubbard's perl modules.
have a small underground following because they were built explicitly to
, with time-consuming steps (like sorting, indexing or I/O) directly under the analyst's control.
These scripts are flawed, and by no means the most elegant or algorithmically thoughtful thing I ever wrote,
but some people still use them.
For these reasons, I've finally got around to putting them up on a public CVS server.
This page was moved from the
Sanger GFF page
on 21 March 2005.