Fly Base
---
Contents
Fly Base notes
A good resource for Drosophila genomic data. Located here: [[1]]
CAUTION: Some of the data on FlyBase is not documented well. When in doubt, contact FlyBase with questions. They have been quite helpful to me in the past, and some notes on this page come directly from their e-mails to me. [AVU]
Nomenclature, notation, conventions
This was written based on FlyBase release 5.3 (September 2007). It seems to apply to earlier releases 5.1 and 4.3 also, but there may be exceptions.
Release numbering
dmel releases are numbered X.Y, where X is the sequence version and Y is the annotation version. So, for example, dmel annotations 5.1 and 5.3 are in the same coordinate system (dmel release 5 coords), whereas 4.3 and 5.1 are not.
GFF syntax and organization
See also:
Genes
GFF type gene (column 3) is the top-level annotation. It has no parents (no Parent attribute).
These annotations appear in the "Gene Span" track of FlyBase GBrowse. Their FlyBase IDs start with FBgn.
To figure out what kind of gene it is, you have to look in the attributes (column 9), since the type column obviously won't tell you, or look up the gene on FlyBase by its ID.
The ontology terms describing the gene are stored in the Ontology_term attribute, but not every gene has an ontology term. Ontology_term is only used with annotations of GFF type gene.
Transcripts
A gene can have multiple transcripts and often does. Transcripts are children of gene - each transcript annotation in the GFF links back to its parent gene via the Parent attribute.
Transcript IDs start with FBtr. Unlike genes, the GFF type column of transcripts tells you what kind of transcript it is.
The following FlyBase types are used to describe transcripts: miRNA mRNA ncRNA pseudogene rRNA snoRNA snRNA tRNA
Note that are only 7 types of transcripts relevant to the Twelve Fly Screen: miRNA ncRNA pseudogene rRNA snoRNA snRNA tRNA (every FBtr except for mRNA). Things like tmRNA, SRP RNA, etc. are aggregated under ncRNA.
The FBtr annotation is not solely for experimentally-verified transcription: http://flybase.org/static_pages/newhelp/transcript_help.html
Where to get more info
So, you see a term like "splign:na_dbEST_ncbi" used in a GFF dump. What does it mean? FlyBase doesn't define these sometimes, so you have to contact them and ask.
Sometimes going to the database where they pulled their data from is useful. For example, BDGP (Berkeley Drosophila Genome Project) nomenclature seems to be used sometimes (some terms seem to match those used here).
Release-specific notes
dmel release 5 (assembly)
What is Chromosome U (chrU)?
From the release notes:
The file na_armUextra.dmel.RELEASE5 contains 34,630 small scaffolds produced by the Celera shotgun assembler which could not be consistently joined with larger scaffolds.
So it appears that chrU is just a bunch of hodge-podge sequence.
It's probably best to stay away from analysis of chrU sequence. Again, choice excerpts from the above document (highlighting mine):
Nor can we exclude the possibility of contaminations from other organisms. We are making this data available as a resource for analysis of region which cannot be assembled well, such as satelites or simple repeats. Since some of this data is low quality, researchers are encouraged to contact either BDGP or DHGP for further details on this resource.
This indicates that it might be wise to stay away from chrU.
dmel release 5.1 (annotations)
How do we get a list of features in "cDNA and other aligned sequences" and "EST" GBrowse tracks?
How do we filter the FlyBase GFF dump down to only annotations that appear in the "cDNA and other aligned sequences" and "EST" GBrowse tracks?
Some notes about the tracks here (look under GENOME REAGENTS AND DATA). Apparently sim4 and splign are used to align sequences to the genome.
But what do we grep for in the GFF dump? This is what Andy Schroeder from FlyBase recommended:
For ESTs: splign:na_dbEST_ncbi from ncbi alignments to euchromatin [AVU note: this source doesn't occur in dmel r5.1 annotations)
sim4:na_dbEST.diff.dmel for older BDGP alignments euchromatin (and heterochromatin?) from other than the sequenced strain
sim4:na_dbEST.same.dmel for older BDGP alignments to euchromatin(and heterochromatin?) from the sequenced strain
For cDNAs and or other submitted mRNA type sequences: splign:na_cDNA_ncbi for ncbi cDNA FLI_CDNA alignments to euchromatin
sim4tandem:na_gb.dmel for older BDGP alignments to euchromatin
sim4:na_cDNA.dros for cDNAs from ncbi aligned to heterochromatin
I noticed something he missed, which must also be grepped for:
By the way, I also notice that in addition to sources you mentioned, features from source sim4_na_gb.tpa.dmel appear in the cDNA track also, e.g. take a look at feature BK002591.1 here:
http://flybase.org/cgi-bin/gbrowse/dmel/?name=3R:12713418..12717483;h_feat=_clear_;h_region=_clear_
Judging from the Gen Bank entry, I am guessing "tpa" stands for "third-party annotation."
The colons get replaced with underscores, so you are selecting using the following conditional:
...WHERE TYPE = 'match' AND ( SOURCE = 'sim4_na_dbEST.diff.dmel' OR SOURCE = 'sim4_na_dbEST.same.dmel' OR SOURCE = 'splign_na_cDNA_ncbi' OR SOURCE = 'sim4tandem_na_gb.dmel' OR SOURCE = 'sim4_na_cDNA.dros' OR SOURCE = 'source sim4_na_gb.tpa.dmel' OR SOURCE = 'sim4_na_gb.tpa.dmel' );
We use "match" instead of "match_part" to capture the introns, also.
Sources for computationally predicted annotations
When downloading GFF dumps of predicted annotations, use the following to decipher the "source" column (col 2).
<noautolink>
program | sourcename |
BATZ_Contrast | caf1 |
BATZ_Contrast_NA | caf1 |
BREN_N-Scan | caf1 |
CONGO | Dmel r4.3 |
DGIL_snap | caf1 |
DGIL_snap_homology | caf1 |
EISE_exonerate | caf1 |
EISE_genemapper | caf1 |
EISE_genewise | caf1 |
GLEANR | caf1 |
Mc Promoter3.0 | dummy |
NCBI_gnomon | caf1 |
OXFD_exonerate | caf1 |
PACH_genemapper | caf1 |
RGUI_geneid_v1.2 | caf1 |
RGUI_geneid_v1.2_u12 | caf1 |
ROBI_manual | caf1 |
Tandem_Repeat_Finder_75-20 | dummy |
assembly | path |
aubrey_cytolocator | cytology |
augustus | dummy |
bdgp_unknown_clonelocator | scaffoldBACs |
blastn | na_dbEST.dpse |
blastp | Dmel_proteomic |
blastx_masked | aa_SPTR.dmel |
blastx_masked | aa_SPTR.insect |
blastx_masked | aa_SPTR.othinv |
blastx_masked | aa_SPTR.othvert |
blastx_masked | aa_SPTR.plant |
blastx_masked | aa_SPTR.primate |
blastx_masked | aa_SPTR.rodent |
blastx_masked | aa_SPTR.worm |
blastx_masked | aa_SPTR.yeast |
blastx_masked | dmel-all-translation-r4.3.fasta |
blastx_masked | eisen_v2_orthologs_aa.fasta |
blastx_masked | inparanoid_dmel_orthologs_aa.fas |
blastz | rui chen |
dmel_r3_to_dmel_r4_migration | dmel_r3_affy_oligos |
genewise | Brian Bettencourt |
genie_masked | dummy |
genscan | Brian Bettencourt |
genscan_masked | dummy |
promoter | dummy |
prosplign | aa_ncbi_dmel |
prosplign | aa_ncbi_other |
repeat_runner_seg | dummy |
repeatmasker | dummy |
sim4 | all_r32_subject_ortho_exons_nuc. |
sim4 | dmel-all-ncRNA-r4.3.fasta |
sim4 | dmel-all-pseudogene-r4.3.fasta |
sim4 | dmel-all-transcript-r4.0.fasta |
sim4 | dmel-all-transcript-r4.2.fasta |
sim4 | dmel-all-transcript-r4.3.fasta |
sim4 | dmel-all-transposon-r4.3.fasta |
sim4 | eisen_v2_orthologs_nt.fasta |
sim4 | na_ARGs.dros |
sim4 | na_ARGsCDS.dros |
sim4 | na_DGC.in_process.dros |
sim4 | na_HDP_RNAi.dmel |
sim4 | na_HDP_mRNA.dmel |
sim4 | na_cDNA.dros |
sim4 | na_dbEST.diff.dmel |
sim4 | na_dbEST.same.dmel |
sim4 | na_gadfly.dros.RELEASE2 |
sim4 | na_gb.dmel |
sim4 | na_gb.tpa.dmel |
sim4 | na_het_transcript.dmel.RELEASE32 |
sim4 | na_re2.dros |
sim4 | na_smallRNA.dros |
sim4 | na_transcript.dmel.RELEASE31 |
sim4 | na_transcript.dmel.RELEASE32 |
sim4 | preR5_gadfly4U_transcripts.fasta |
sim4 | stencil_annotCDS.fa |
sim4tandem | na_gb.dmel |
splign | na_cDNA_ncbi |
splign | na_dbEST_ncbi |
tRNAscan-SE | dummy |
tblastn | Dmel r3.1 |
tblastx_masked | na_dbEST.insect |
tblastxwrap_masked | na_baylorf1_scfchunk.dpse |
tblastxwrap_masked | na_scf_chunk_agambiae.fa |
twinscan | Brian Bettencourt |
</noautolink>
dmel release 5.3 (annotations)
Apparently contains many more snoRNAs than prior releases.
Notes from release 5.1 probably still apply, but there are some clear changes/improvements.
ribosomal RNA
There are a couple of issues revealed by the Twelve Fly Screen regarding Fly Base's ribosomal RNA (rRNA) annotations that are good to know for figuring out sensitivity in a ncRNA screen.
With the exception of Dmel, rRNA sequence is unassembled in the CAF1 datasets. This is why the Dmel rRNA sequence is unaligned. See this paper for an analysis of rRNA sequence from the sequencing reads.
The first issue: the only rRNA annotations in Fly Base (release 5.3) are for 5S rRNA and 2 mitochondrial rRNAs (lrRNA-RA and srRNA-RA).
5.8S, 18S, and 28S are not annotated; however, they are large (>150nt) and should give a good signal due to their strongly conserved structure and because at least some of their substructures should be smaller than 120-130nt, therefore discoverable even with our size-constrained phylo-grammars. It is conceivable we are picking them up, but we simply don't know.
The second issue is that we know for a fact that Twelve Fly Screen's sensitivity for 5S rRNA is very low. To track down why, let's examine where these rRNA's are located and how well they align to the other 11 flies.
The 5S rRNA genomic distribution in release 5.3 is:
chromosome | number of 5S rRNA annotations |
2R | 96 |
dmel_mitochondrion_genome | 2 |
U | 64 |
XHet | 2 |
These annotations were put here as follows:
ssh lorien cd /home/projects/caf1screen/source/flybase/ awk '$3=="rRNA" {print}' dmel-rRNA-r5.3.gff > dmel-rRNA-r5.3.gff
Interestingly, all 96 are clustered very close to each other in the range [15617067,15653783] (36,717 nt), all on the same strand. See them in the genome browser: http://flybase.org/cgi-bin/gbrowse/dmel/?name=2R:15617067..15653783
Clearly, there are 96 5S rRNAs on chr2R that we can pick up. But are in they in the Pecan/Mercator alignment? Mitch Skinner has extracted this set from the alignment (after doing the release 5 -> release 4 coordinate transformation), which is here: http://biowiki.org/~mitch/rRNA-subalignments.stock
These alignments are very poor for our purposes:
- dmel sequence aligns to all gaps in other genomes
- at least one case of barely any dmel sequence in an alignment
which explains the poor sensitivity.
But why are these 5S rRNAs not aligning to the dmel sequence? Rob noticed they are mostly (always?) repeat-masked, but we don't know how that affects Pecan's alignment algorithm; however, it's an idea to investigate.
Let's BLAST some dmel 5S rRNA sequence against the other flies, starting from the top of Mitch's file, to see if they are even in the assembly:
- FBtr0086444
- against dpse (release 2.0, seems more complete than any of the other non-_dmel_ flies)
- Top hits are against unassembled regions:
- against dpse (release 2.0, seems more complete than any of the other non-_dmel_ flies)
- but a promising one in chromosome 2:
- FBtr0086442
- against dpse (release 2.0)
- Once again, top hits are against unassembled regions, but here's another one on chr 2:
- against dpse (release 2.0)
- But wait? That's the same hit from earlier... well of course, since the 5S rRNA are all highly homologous. So we don't even need to BLAST different dmel sequences, one should do.
Having learned that, let's try BLASTing FBtr0086442 against some other species:
- dsim: all matches are to chrU except for one, which is on 2R (see in genome browser)
- dyak: just under half of the matches are to chr2L, chr2R, and chrX, with the rest being to variants of chrU; interesting that it fragmented across multiple chromosomes like that
The "unrealistically long ncRNA annotation" problem (release 5.1 and 5.3)
There are some features of type ncRNA in FlyBase that simply cannot be real ncRNAs because they are too long. UPDATE: It is better to trust FlyBase on this and not throw them out. They seem odd, but valid. I'm keeping these notes here for the time being. [AVU]
The longest ncRNA in Rfam 8.0 is 800nt.
TODO: put in a makefile:
# get Rfam sequence lengths, sorting them wget ftp://ftp.sanger.ac.uk/pub/databases/Rfam/CURRENT/Rfam.fasta.gz gunzip Rfam.fasta.gz cat Rfam.fasta | perl -pe 'if (/^>/){$_="\n"}else{chomp}' | perl -pe 'chomp; $_=length($_)."\n"' | awk '$1>0 {print}' | sort -n > Rfam.lengths
Then again, maybe really long ncRNAs are plausible:
- Seidl et al.: The imprinted Air ncRNA is an atypical RNAPII transcript that evades splicing and escapes nuclear export. EMBO J. 2006;25:3565-75.
- DeLisi et al.: Left ventricular enlargement associated with diagnostic outcome of schizophreniform disorder. Biol. Psychiatry 1992;32:199-201.
FlyBase GFF dumps:
- /nfs/projects/caf1screen/source/flybase/dmel-all-r5.1.gff
- has 60 ncRNA features with length > 800 (above max Rfam length)
- has 6 ncRNA features with length >= 10000
- /nfs/projects/caf1screen/source/flybase/dmel-all-r5.3.gff
- has 66 ncRNA features with length > 800 (above max Rfam length)
- has 11 ncRNA features with length >= 10000
CAUTION: These lengths are FlyBase annotation lengths, but the mature transcript may be shorter due to splicing!
TODO: put in a makefile:
# find nonsensically-long ncRNA annotations awk '($3 == "ncRNA") && ($5-$4+1 > 800) {print "# length = " $5-$4+1; print}' dmel-all-r5.3.gff > too-long-ncRNA.r5.3.gff # which ones are '''really''' bad? (>= 10,000) grep -A 1 "length = ....." too-long-ncRNA.r5.3.gff | grep ncRNA
Examples (from r5.3):
- two transcripts of FBgn0041606
- FBtr0080319
- FBtr0080320
- FlyBase page
- GenBank source, although there are other sources on the FBgn0041606 FlyBase page
* the type is listed as mRNA * excerpt: alternatively spliced; this transcript does not appear to have a protein coding region; may function at the transcript level
- 4 transcripts of FBgn0020556
- FBtr0083341
- longest FlyBase ncRNA annotation (length is 31,065nt in annotation, 1124nt after splicing (?))
- FlyBase page
- the 3 other transcripts are > 1200nt
- FBtr0083341
It appears that in at least one case (FBgn0062978, go to "GENE MODEL & FEATURES" -> "COMMENTS ON GENE MODEL") FlyBase assigns the ncRNA type to transcripts that have a poor computed ORF.
The Final Word
We're keeping the long ncRNAs.
Here is the e-mail from FlyBase answering my question about how they got the ncRNA tag:
Dear Andrew,
> comments: I have noticed there are several annotations of type > "ncRNA" that are much longer than anything I've seen in Rfam (e.g. > FBtr0080319, FBtr0080320, FBtr0083341). What is the decision > process/criteria for assigning "ncRNA" to an experimentally- > detected transcript?
Generally, these are genes for which there is good evidence that they are transcribed, but the resulting transcripts have realtively small ORF's. More recently, we have been able to add the additional assessment of whether there are conserved "protein signatures" within these transcripts (conserved across the 12 sequenced Drosophila species); if there are not, it is additional evidence that the description as a ncRNA is correct.
The first 2 transcripts you cite are derived from the bft gene. We have classified bft as a ncRNA gene, based on descriptions of investigators (Hardiman et al., 2002, Genetics 161: 231; and sequence accession records submitted by the same authors).
The third transcript is derived from bxd. This has also been identified as a likely ncRNA in the literature (Martin et al., 1995, Proc. Natl. Acad. Sci. USA 92: 8398).
Clearly, our knowledge of large ncRNAs is still rudimentary. Rigorous criteria cannot be developed until we know at lot more about them. Certainly what constitutes a "relatively small ORF" is very subjective. In some cases, we have added a comment to the gene model that something annotated as protein-coding gene may actually produce non-coding RNA(s); we simply have no way of knowing at this point in time.
Sincerely,
Lynn Crosby Fly Base
---
-- Created by: Andrew Uzilov on 11 May 2007