See Mercator Format for notes on the Mercator format.
8-) Install DART
setenv DARTDIR /mnt/nfs/src/dart setenv PERL5LIB setenv PERL5LIB $DARTDIR/perl:$PERL5LIB
8-) Install gfftools
See Gff Tools for details (scroll down to "CVS access").
These scripts are in (surprise) /mnt/nfs/src/gfftools.
8-) Check out the mercator-perl CVS directory
setenv CVSROOT :ext:cvs.biowiki.org:/usr/local/cvs cvs co mercator-perl
8-) Check out Ian's ragtag collection of perl scripts
You probably don't actually need any of these scripts for this tutorial, but some of them are occasionally useful.
setenv CVSROOT :ext:email@example.com:/usr/local/cvs cvs co perl
8-) Ensure dart, gfftools, perl and mercator-perl script directories are in your path
e.g. add the following to .cshrc
setenv PATH $PATH:$DARTDIR/perl:/mnt/nfs/src/gfftools:/mnt/nfs/src/perl:/mnt/nfs/src/mercator-perl
8-) Locate the MERCATOR "genomes" and "map" files, and the top-level alignment directory
/mnt/nfs/data/genome/fly/align/12fly/mercator/genomes /mnt/nfs/data/genome/fly/align/12fly/mercator/map /mnt/nfs/data/genome/fly/align/12fly/mavid/
See Fruitfly Bioinformatics Resources for source links.
8-) Locate the Flybase GFF annotation files
8-) Hack the MERCATOR map file
The MERCATOR map file has slightly different names for the D.mel. chromosomes than does the Flybase annotation (MERCATOR prepends chr to every chromosome name, e.g. MERCATOR's chr2L is Flybase's 2L). To correct for this, pipe the MERCATOR map file through the throwaway hackmap.pl script in the mercator-perl dir, e.g.
cd /mnt/nfs/data/genome/fly/align/12fly mercator-perl/hackmap.pl mercator/map > mercator/hacked-map
You only need to do this once. If you look at the script you'll see it could easily be a one-liner. Note that the script currently assumes D.mel is the first entry in the map file.
8-) Use featurevole.pl to map the Flybase annotation GFF from genomic to alignment coords
e.g. for the noncoding RNAs:
mercator-perl/featurevole.pl [[Dro Mel]]_CAF1 -gff /mnt/nfs/data/genome/fly/annot/fb4.3/Dmel/AllChr/ncRNA.gff -out remapped-ncRNA.gff -genomes mercator/genomes -map mercator/hacked-map -align mercator/alignments/ >& remapped-ncRNA.log
A few things to note:
- The mandatory argument DroMel_CAF1 is the name of the species in the MERCATOR genomes file
- A GFF file, remapped-ncRNA.gff, is created in the current working directory
- The entries of the GFF file are in alignment co-ordinates; that is,
- the sequence name (field 0 of the GFF) is now the MERCATOR alignment ID;
- the start and endpoints of the feature (fields 3 and 4 of the GFF) now refer to alignment column numbers;
- the "strandedness" of the feature (field 6 of the GFF) may be flipped if the MAVID alignment is reverse-stranded with respect to the annotated species (though this is probably unlikely
- the original sequence name, startpoint, endpoint and strand (in genomic coords) are preserved at the end of the final "group" field in the GFF line
- The Flybase GFF entries contain quite a lot of additional annotation info in the "group" field, e.g. the Flybase gene number (FBgn); this is preserved by the mapping
- Several features are truncated because they only partially overlap MAVID alignments
- Some mapfile lines do not involve the reference species and are skipped with an error message
- As with most of these scripts, featurevole.pl prints a help message if you call it with the -h option
- It may be helpful to sort the GFF file by feature length. You can use the gfftools/gffsort.pl script to do this.
(Holmes Lab-specific footnote): I've started adding rules to /mnt/nfs/data/genome/fly/Makefile to automate some of these steps. Try e.g. make -n annot/featurevole/.signal_peptide.gff2dirs and see what happens.
8-) Split the remapped GFF file by alignment ID
The gff2dirs.pl script simply splits up the GFF files, putting a file named mavid.gff (or another filename that you specify) in each alignment subdirectory. This newly-created GFF file contains the annotations for that alignment.
mercator-perl/gff2dirs.pl -gff remapped-ncRNA.gff -align mavid/
Then examine the created files
8-) Convert MAVID alignments from FASTA to Stockholm format
You can use the fasta2stockholm.pl script for this (in the dart/perl directory). Alternatively, there are a couple of Makefile rules that are set up to auto-convert all the MAVID files from Fasta Format to Stockholm Format. The following command will convert all the CAF1 MAVID alignments:
NB I've already done this so you can skip this step (but it's good to understand it, as file format conversion is the daily grind of bioinformatics, and makefiles are a great way to automate this and other parts of the analysis).
NB for makefiles to be seamlessly useful, you need synchronized UIDs (so the NFS permissions work) and synchronized clocks on all machines (so you don't get "modification time in the future" errors and other mysterious behaviour from make). To work around such errors, you can use options like make -n which prints the commands that would be executed, but doesn't actually execute them (type man make for more useful options).
8-) Extract subalignments corresponding to the features, using the alignmonkey.pl script
This just involves passing alignmonkey the split-up GFF file and the converted-to-Stockholm alignment that you created in previous steps:
mercator-perl/alignmonkey.pl -gff mavid/2147/mavid.gff -align mavid/2147/mavid.stock
NB alignmonkey doesn't check the "seq" field of the GFF file, so it's essential that you split up the remapped GFF file by alignment ID, prior to this step. Note also that alignmonkey doesn't "speak" FASTA format so you also have to do the FASTA-to-Stockholm conversion.
8-) Extract subalignments corresponding to features plus some flanking context
This is slightly more involved, because you don't just want to extract a slightly bigger subalignment; you probably also want to annotate the alignment showing which columns are part of the feature, and which are flanking columns.
To annotate the alignment, you use the paintstock.pl script. To "stretch" the GFF features and add flanking context, use addflank.pl. To extract the subalignment for the feature+context, use alignmonkey.pl as before.
mercator-perl/paintstock.pl -gff mavid/2147/mavid.gff -align mavid/2147/mavid.stock >mavid/2147/painted.stock cat mavid/2147/mavid.gff | mercator-perl/addflank.pl -flank 100 | mercator-perl/alignmonkey.pl -align mavid/2147/painted.stock
The second line dumps the annotated alignment, plus context, to standard output. Note how the location of the ncRNA feature is marked up in the #=GC line. This dovetails nicely with the syntax for supervised learning with xgram.
8-) Extract and annotate multiple kinds of features in a single alignment
This can be done using tools like gffintersect.pl (in Gff Tools) combined with visual inspection, manual selection and hand-editing of co-ordinates for subalignments to extract.
For now, I will leave this section unfinished...... feel free to flesh it out!
Ian Holmes, 3/26/2006