Mercator Perl

From Biowiki
Jump to: navigation, search

Mercator Perl

Step-by-step instructions for extracting "training alignments" from MERCATOR/MAVID output using Flybase annotations on the cluster at /mnt/nfs...

See Mercator Tutorial for a tutorial on how to use Mercator and MAVID.

See Mercator Format for notes on the Mercator format.

8-) Install DART

See Downloading Dart and Building Dart. Note especially the section in Building Dart concerning the dart perl modules: you need to add something like the following to your .cshrc

setenv DARTDIR /mnt/nfs/src/dart

setenv PERL5LIB
setenv PERL5LIB $DARTDIR/perl:$PERL5LIB

8-) Install gfftools

See Gff Tools for details (scroll down to "CVS access").

These scripts are in (surprise) /mnt/nfs/src/gfftools.

8-) Check out the mercator-perl CVS directory

Like this:

setenv CVSROOT :ext:cvs.biowiki.org:/usr/local/cvs
cvs co mercator-perl

8-) Check out Ian's ragtag collection of perl scripts

You probably don't actually need any of these scripts for this tutorial, but some of them are occasionally useful.

setenv CVSROOT :ext:yam@cvs.biowiki.org:/usr/local/cvs
cvs co perl

8-) Ensure dart, gfftools, perl and mercator-perl script directories are in your path

e.g. add the following to .cshrc

setenv PATH $PATH:$DARTDIR/perl:/mnt/nfs/src/gfftools:/mnt/nfs/src/perl:/mnt/nfs/src/mercator-perl

8-) Locate the MERCATOR "genomes" and "map" files, and the top-level alignment directory

/mnt/nfs/data/genome/fly/align/12fly/mercator/genomes
/mnt/nfs/data/genome/fly/align/12fly/mercator/map
/mnt/nfs/data/genome/fly/align/12fly/mavid/

See Fruitfly Bioinformatics Resources for source links.

8-) Locate the Flybase GFF annotation files

ls /mnt/nfs/data/genome/fly/annot/fb4.3/Dmel/AllChr/

Again, see Fruitfly Bioinformatics Resources for source links. The GFF files were sorted by type using the script gffbytype.pl in the Gff Tools directory.

8-) Hack the MERCATOR map file

The MERCATOR map file has slightly different names for the D.mel. chromosomes than does the Flybase annotation (MERCATOR prepends chr to every chromosome name, e.g. MERCATOR's chr2L is Flybase's 2L). To correct for this, pipe the MERCATOR map file through the throwaway hackmap.pl script in the mercator-perl dir, e.g.

cd /mnt/nfs/data/genome/fly/align/12fly
mercator-perl/hackmap.pl mercator/map > mercator/hacked-map

You only need to do this once. If you look at the script you'll see it could easily be a one-liner. Note that the script currently assumes D.mel is the first entry in the map file.

8-) Use featurevole.pl to map the Flybase annotation GFF from genomic to alignment coords

e.g. for the noncoding RNAs:

mercator-perl/featurevole.pl [[Dro Mel]]_CAF1 
 -gff /mnt/nfs/data/genome/fly/annot/fb4.3/Dmel/AllChr/ncRNA.gff 
 -out remapped-ncRNA.gff -genomes mercator/genomes 
 -map mercator/hacked-map -align mercator/alignments/ >& remapped-ncRNA.log

A few things to note:

  • The mandatory argument DroMel_CAF1 is the name of the species in the MERCATOR genomes file
  • A GFF file, remapped-ncRNA.gff, is created in the current working directory
  • The entries of the GFF file are in alignment co-ordinates; that is,
    • the sequence name (field 0 of the GFF) is now the MERCATOR alignment ID;
    • the start and endpoints of the feature (fields 3 and 4 of the GFF) now refer to alignment column numbers;
    • the "strandedness" of the feature (field 6 of the GFF) may be flipped if the MAVID alignment is reverse-stranded with respect to the annotated species (though this is probably unlikely
    • the original sequence name, startpoint, endpoint and strand (in genomic coords) are preserved at the end of the final "group" field in the GFF line
  • The Flybase GFF entries contain quite a lot of additional annotation info in the "group" field, e.g. the Flybase gene number (FBgn); this is preserved by the mapping
  • Several features are truncated because they only partially overlap MAVID alignments
  • Some mapfile lines do not involve the reference species and are skipped with an error message
  • As with most of these scripts, featurevole.pl prints a help message if you call it with the -h option
  • It may be helpful to sort the GFF file by feature length. You can use the gfftools/gffsort.pl script to do this.

TWikiDocGraphics.warning.gif (Holmes Lab-specific footnote): I've started adding rules to /mnt/nfs/data/genome/fly/Makefile to automate some of these steps. Try e.g. make -n annot/featurevole/.signal_peptide.gff2dirs and see what happens.

8-) Split the remapped GFF file by alignment ID

The gff2dirs.pl script simply splits up the GFF files, putting a file named mavid.gff (or another filename that you specify) in each alignment subdirectory. This newly-created GFF file contains the annotations for that alignment.

mercator-perl/gff2dirs.pl -gff remapped-ncRNA.gff -align mavid/

Then examine the created files

more mavid/2147/mavid.gff

8-) Convert MAVID alignments from FASTA to Stockholm format

You can use the fasta2stockholm.pl script for this (in the dart/perl directory). Alternatively, there are a couple of Makefile rules that are set up to auto-convert all the MAVID files from Fasta Format to Stockholm Format. The following command will convert all the CAF1 MAVID alignments:

make all-stock

NB I've already done this so you can skip this step (but it's good to understand it, as file format conversion is the daily grind of bioinformatics, and makefiles are a great way to automate this and other parts of the analysis).

NB for makefiles to be seamlessly useful, you need synchronized UIDs (so the NFS permissions work) and synchronized clocks on all machines (so you don't get "modification time in the future" errors and other mysterious behaviour from make). To work around such errors, you can use options like make -n which prints the commands that would be executed, but doesn't actually execute them (type man make for more useful options).

8-) Extract subalignments corresponding to the features, using the alignmonkey.pl script

This just involves passing alignmonkey the split-up GFF file and the converted-to-Stockholm alignment that you created in previous steps:

mercator-perl/alignmonkey.pl -gff mavid/2147/mavid.gff -align mavid/2147/mavid.stock

NB alignmonkey doesn't check the "seq" field of the GFF file, so it's essential that you split up the remapped GFF file by alignment ID, prior to this step. Note also that alignmonkey doesn't "speak" FASTA format so you also have to do the FASTA-to-Stockholm conversion.

8-) Extract subalignments corresponding to features plus some flanking context

This is slightly more involved, because you don't just want to extract a slightly bigger subalignment; you probably also want to annotate the alignment showing which columns are part of the feature, and which are flanking columns.

To annotate the alignment, you use the paintstock.pl script. To "stretch" the GFF features and add flanking context, use addflank.pl. To extract the subalignment for the feature+context, use alignmonkey.pl as before.

mercator-perl/paintstock.pl -gff mavid/2147/mavid.gff -align mavid/2147/mavid.stock >mavid/2147/painted.stock
cat mavid/2147/mavid.gff | mercator-perl/addflank.pl -flank 100 | mercator-perl/alignmonkey.pl -align mavid/2147/painted.stock

The second line dumps the annotated alignment, plus context, to standard output. Note how the location of the ncRNA feature is marked up in the #=GC line. This dovetails nicely with the syntax for supervised learning with xgram.

8-) Extract and annotate multiple kinds of features in a single alignment

This can be done using tools like gffintersect.pl (in Gff Tools) combined with visual inspection, manual selection and hand-editing of co-ordinates for subalignments to extract.

For now, I will leave this section unfinished...... feel free to flesh it out!

Ian Holmes, 3/26/2006