Step-by-step instructions for extracting "training alignments" from
output using Flybase annotations
on the cluster at /mnt/nfs...
for a tutorial on how to use Mercator
for notes on the Mercator format.
Note especially the section in BuildingDart
concerning the dart perl modules:
you need to add something like the following to your
setenv DARTDIR /mnt/nfs/src/dart
setenv PERL5LIB $DARTDIR/perl:$PERL5LIB
for details (scroll down to "CVS access").
These scripts are in (surprise)
Check out the
mercator-perl CVS directory
setenv CVSROOT :ext:cvs.biowiki.org:/usr/local/cvs
cvs co mercator-perl
Check out Ian's ragtag collection of perl scripts
You probably don't actually need any of these scripts for this tutorial,
but some of them are occasionally useful.
setenv CVSROOT :ext:email@example.com:/usr/local/cvs
cvs co perl
Ensure dart, gfftools, perl and mercator-perl script directories are in your path
e.g. add the following to
setenv PATH $PATH:$DARTDIR/perl:/mnt/nfs/src/gfftools:/mnt/nfs/src/perl:/mnt/nfs/src/mercator-perl
Locate the MERCATOR "genomes" and "map" files, and the top-level alignment directory
for source links.
Locate the Flybase GFF annotation files
Again, see FruitflyBioinformaticsResources
for source links.
The GFF files were sorted by type using the script
in the GffTools
Hack the MERCATOR map file
The MERCATOR map file has slightly different names for the D.mel.
than does the Flybase annotation (MERCATOR prepends
to every chromosome
name, e.g. MERCATOR's
To correct for this, pipe the MERCATOR
file through the
script in the
mercator-perl/hackmap.pl mercator/map > mercator/hacked-map
You only need to do this once.
If you look at the script you'll see it could easily be a one-liner. Note that the script currently assumes D.mel is the first entry in the map file.
featurevole.pl to map the Flybase annotation GFF from genomic to alignment coords
e.g. for the noncoding RNAs:
-out remapped-ncRNA.gff -genomes mercator/genomes
-map mercator/hacked-map -align mercator/alignments/ >& remapped-ncRNA.log
A few things to note:
- The mandatory argument
DroMel_CAF1 is the name of the species in the MERCATOR
- A GFF file,
remapped-ncRNA.gff, is created in the current working directory
- The entries of the GFF file are in alignment co-ordinates; that is,
- the sequence name (field 0 of the GFF) is now the MERCATOR alignment ID;
- the start and endpoints of the feature (fields 3 and 4 of the GFF) now refer to alignment column numbers;
- the "strandedness" of the feature (field 6 of the GFF) may be flipped if the MAVID alignment is reverse-stranded with respect to the annotated species (though this is probably unlikely
- the original sequence name, startpoint, endpoint and strand (in genomic coords) are preserved at the end of the final "group" field in the GFF line
- The Flybase GFF entries contain quite a lot of additional annotation info in the "group" field, e.g. the Flybase gene number (FBgn); this is preserved by the mapping
- Several features are truncated because they only partially overlap MAVID alignments
- Some mapfile lines do not involve the reference species and are skipped with an error message
- As with most of these scripts,
featurevole.pl prints a help message if you call it with the
- It may be helpful to sort the GFF file by feature length. You can use the gfftools/gffsort.pl script to do this.
I've started adding rules to
to automate some of these steps. Try e.g.
make -n annot/featurevole/.signal_peptide.gff2dirs
and see what happens.
Split the remapped GFF file by alignment ID
script simply splits up the GFF files, putting a file named
(or another filename that you specify) in each alignment subdirectory.
This newly-created GFF file contains the annotations for that alignment.
mercator-perl/gff2dirs.pl -gff remapped-ncRNA.gff -align mavid/
Then examine the created files
Convert MAVID alignments from FASTA to Stockholm format
You can use the
script for this (in the dart/perl directory).
Alternatively, there are a couple of Makefile rules
that are set up to auto-convert all the MAVID files from FastaFormat
The following command will convert all the CAF1 MAVID alignments:
NB I've already done this so you can skip this step
(but it's good to understand it, as file format conversion is the daily grind of bioinformatics,
and makefiles are a great way to automate this and other parts of the analysis).
NB for makefiles to be seamlessly useful, you need synchronized UIDs (so the NFS permissions work)
synchronized clocks on all machines (so you don't get "modification time in the future"
errors and other mysterious behaviour from
To work around such errors, you can use options like
which prints the commands that
would be executed, but doesn't actually execute them (type
for more useful options).
Extract subalignments corresponding to the features, using the
This just involves passing
the split-up GFF file and the converted-to-Stockholm alignment
that you created in previous steps:
mercator-perl/alignmonkey.pl -gff mavid/2147/mavid.gff -align mavid/2147/mavid.stock
doesn't check the "seq" field of the GFF file, so it's essential
that you split up the remapped GFF file by alignment ID, prior to this step.
Note also that
doesn't "speak" FASTA format so you also have to do the
Extract subalignments corresponding to features plus some flanking context
This is slightly more involved, because you don't just want to extract a slightly bigger subalignment;
you probably also want to annotate
the alignment showing which columns are part of the feature,
and which are flanking columns.
To annotate the alignment, you use the
To "stretch" the GFF features and add flanking context, use
To extract the subalignment for the feature+context, use
mercator-perl/paintstock.pl -gff mavid/2147/mavid.gff -align mavid/2147/mavid.stock >mavid/2147/painted.stock
cat mavid/2147/mavid.gff | mercator-perl/addflank.pl -flank 100 | mercator-perl/alignmonkey.pl -align mavid/2147/painted.stock
The second line dumps the annotated alignment, plus context, to standard output.
Note how the location of the ncRNA feature is marked up in the
This dovetails nicely with the syntax for supervised learning with xgram.
Extract and annotate multiple kinds of features in a single alignment
This can be done using tools like
combined with visual inspection, manual selection and hand-editing of co-ordinates for subalignments to extract.
For now, I will leave this section unfinished......
feel free to flesh it out!