Fly Base

From Biowiki
Jump to: navigation, search

---

Fly Base notes

A good resource for Drosophila genomic data. Located here: [[1]]

CAUTION: Some of the data on FlyBase is not documented well. When in doubt, contact FlyBase with questions. They have been quite helpful to me in the past, and some notes on this page come directly from their e-mails to me. [AVU]

Nomenclature, notation, conventions

This was written based on FlyBase release 5.3 (September 2007). It seems to apply to earlier releases 5.1 and 4.3 also, but there may be exceptions.

Release numbering

dmel releases are numbered X.Y, where X is the sequence version and Y is the annotation version. So, for example, dmel annotations 5.1 and 5.3 are in the same coordinate system (dmel release 5 coords), whereas 4.3 and 5.1 are not.

GFF syntax and organization

See also:

Genes

GFF type gene (column 3) is the top-level annotation. It has no parents (no Parent attribute).

These annotations appear in the "Gene Span" track of FlyBase GBrowse. Their FlyBase IDs start with FBgn.

To figure out what kind of gene it is, you have to look in the attributes (column 9), since the type column obviously won't tell you, or look up the gene on FlyBase by its ID.

The ontology terms describing the gene are stored in the Ontology_term attribute, but not every gene has an ontology term. Ontology_term is only used with annotations of GFF type gene.

Transcripts

A gene can have multiple transcripts and often does. Transcripts are children of gene - each transcript annotation in the GFF links back to its parent gene via the Parent attribute.

Transcript IDs start with FBtr. Unlike genes, the GFF type column of transcripts tells you what kind of transcript it is.

The following FlyBase types are used to describe transcripts: miRNA mRNA ncRNA pseudogene rRNA snoRNA snRNA tRNA

Note that are only 7 types of transcripts relevant to the Twelve Fly Screen: miRNA ncRNA pseudogene rRNA snoRNA snRNA tRNA (every FBtr except for mRNA). Things like tmRNA, SRP RNA, etc. are aggregated under ncRNA.

The FBtr annotation is not solely for experimentally-verified transcription: http://flybase.org/static_pages/newhelp/transcript_help.html

Where to get more info

So, you see a term like "splign:na_dbEST_ncbi" used in a GFF dump. What does it mean? FlyBase doesn't define these sometimes, so you have to contact them and ask.

Sometimes going to the database where they pulled their data from is useful. For example, BDGP (Berkeley Drosophila Genome Project) nomenclature seems to be used sometimes (some terms seem to match those used here).

Release-specific notes

dmel release 5 (assembly)

What is Chromosome U (chrU)?

From the release notes:

The file na_armUextra.dmel.RELEASE5 contains 34,630 small scaffolds produced by the Celera shotgun assembler which could not be consistently joined with larger scaffolds.

So it appears that chrU is just a bunch of hodge-podge sequence.

It's probably best to stay away from analysis of chrU sequence. Again, choice excerpts from the above document (highlighting mine):

Nor can we exclude the possibility of contaminations from other organisms. We are making this data available as a resource for analysis of region which cannot be assembled well, such as satelites or simple repeats. Since some of this data is low quality, researchers are encouraged to contact either BDGP or DHGP for further details on this resource.

This indicates that it might be wise to stay away from chrU.

dmel release 5.1 (annotations)

How do we get a list of features in "cDNA and other aligned sequences" and "EST" GBrowse tracks?

How do we filter the FlyBase GFF dump down to only annotations that appear in the "cDNA and other aligned sequences" and "EST" GBrowse tracks?

Some notes about the tracks here (look under GENOME REAGENTS AND DATA). Apparently sim4 and splign are used to align sequences to the genome.

But what do we grep for in the GFF dump? This is what Andy Schroeder from FlyBase recommended:

For ESTs: splign:na_dbEST_ncbi from ncbi alignments to euchromatin [AVU note: this source doesn't occur in dmel r5.1 annotations)

sim4:na_dbEST.diff.dmel for older BDGP alignments euchromatin (and heterochromatin?) from other than the sequenced strain

sim4:na_dbEST.same.dmel for older BDGP alignments to euchromatin(and heterochromatin?) from the sequenced strain

For cDNAs and or other submitted mRNA type sequences: splign:na_cDNA_ncbi for ncbi cDNA FLI_CDNA alignments to euchromatin

sim4tandem:na_gb.dmel for older BDGP alignments to euchromatin

sim4:na_cDNA.dros for cDNAs from ncbi aligned to heterochromatin

I noticed something he missed, which must also be grepped for:

By the way, I also notice that in addition to sources you mentioned, features from source sim4_na_gb.tpa.dmel appear in the cDNA track also, e.g. take a look at feature BK002591.1 here:

http://flybase.org/cgi-bin/gbrowse/dmel/?name=3R:12713418..12717483;h_feat=_clear_;h_region=_clear_

Judging from the Gen Bank entry, I am guessing "tpa" stands for "third-party annotation."

The colons get replaced with underscores, so you are selecting using the following conditional:

...WHERE TYPE = 'match' AND
(
 SOURCE = 'sim4_na_dbEST.diff.dmel'
 OR
 SOURCE = 'sim4_na_dbEST.same.dmel'
 OR
 SOURCE = 'splign_na_cDNA_ncbi'
 OR
 SOURCE = 'sim4tandem_na_gb.dmel'
 OR
 SOURCE = 'sim4_na_cDNA.dros'
 OR
 SOURCE = 'source sim4_na_gb.tpa.dmel'
 OR
 SOURCE = 'sim4_na_gb.tpa.dmel'
);

We use "match" instead of "match_part" to capture the introns, also.

Sources for computationally predicted annotations

When downloading GFF dumps of predicted annotations, use the following to decipher the "source" column (col 2).

<noautolink>

program sourcename
BATZ_Contrast caf1
BATZ_Contrast_NA caf1
BREN_N-Scan caf1
CONGO Dmel r4.3
DGIL_snap caf1
DGIL_snap_homology caf1
EISE_exonerate caf1
EISE_genemapper caf1
EISE_genewise caf1
GLEANR caf1
Mc Promoter3.0 dummy
NCBI_gnomon caf1
OXFD_exonerate caf1
PACH_genemapper caf1
RGUI_geneid_v1.2 caf1
RGUI_geneid_v1.2_u12 caf1
ROBI_manual caf1
Tandem_Repeat_Finder_75-20 dummy
assembly path
aubrey_cytolocator cytology
augustus dummy
bdgp_unknown_clonelocator scaffoldBACs
blastn na_dbEST.dpse
blastp Dmel_proteomic
blastx_masked aa_SPTR.dmel
blastx_masked aa_SPTR.insect
blastx_masked aa_SPTR.othinv
blastx_masked aa_SPTR.othvert
blastx_masked aa_SPTR.plant
blastx_masked aa_SPTR.primate
blastx_masked aa_SPTR.rodent
blastx_masked aa_SPTR.worm
blastx_masked aa_SPTR.yeast
blastx_masked dmel-all-translation-r4.3.fasta
blastx_masked eisen_v2_orthologs_aa.fasta
blastx_masked inparanoid_dmel_orthologs_aa.fas
blastz rui chen
dmel_r3_to_dmel_r4_migration dmel_r3_affy_oligos
genewise Brian Bettencourt
genie_masked dummy
genscan Brian Bettencourt
genscan_masked dummy
promoter dummy
prosplign aa_ncbi_dmel
prosplign aa_ncbi_other
repeat_runner_seg dummy
repeatmasker dummy
sim4 all_r32_subject_ortho_exons_nuc.
sim4 dmel-all-ncRNA-r4.3.fasta
sim4 dmel-all-pseudogene-r4.3.fasta
sim4 dmel-all-transcript-r4.0.fasta
sim4 dmel-all-transcript-r4.2.fasta
sim4 dmel-all-transcript-r4.3.fasta
sim4 dmel-all-transposon-r4.3.fasta
sim4 eisen_v2_orthologs_nt.fasta
sim4 na_ARGs.dros
sim4 na_ARGsCDS.dros
sim4 na_DGC.in_process.dros
sim4 na_HDP_RNAi.dmel
sim4 na_HDP_mRNA.dmel
sim4 na_cDNA.dros
sim4 na_dbEST.diff.dmel
sim4 na_dbEST.same.dmel
sim4 na_gadfly.dros.RELEASE2
sim4 na_gb.dmel
sim4 na_gb.tpa.dmel
sim4 na_het_transcript.dmel.RELEASE32
sim4 na_re2.dros
sim4 na_smallRNA.dros
sim4 na_transcript.dmel.RELEASE31
sim4 na_transcript.dmel.RELEASE32
sim4 preR5_gadfly4U_transcripts.fasta
sim4 stencil_annotCDS.fa
sim4tandem na_gb.dmel
splign na_cDNA_ncbi
splign na_dbEST_ncbi
tRNAscan-SE dummy
tblastn Dmel r3.1
tblastx_masked na_dbEST.insect
tblastxwrap_masked na_baylorf1_scfchunk.dpse
tblastxwrap_masked na_scf_chunk_agambiae.fa
twinscan Brian Bettencourt

</noautolink>

dmel release 5.3 (annotations)

Apparently contains many more snoRNAs than prior releases.

Notes from release 5.1 probably still apply, but there are some clear changes/improvements.

ribosomal RNA

There are a couple of issues revealed by the Twelve Fly Screen regarding Fly Base's ribosomal RNA (rRNA) annotations that are good to know for figuring out sensitivity in a ncRNA screen.

With the exception of Dmel, rRNA sequence is unassembled in the CAF1 datasets. This is why the Dmel rRNA sequence is unaligned. See this paper for an analysis of rRNA sequence from the sequencing reads.

The first issue: the only rRNA annotations in Fly Base (release 5.3) are for 5S rRNA and 2 mitochondrial rRNAs (lrRNA-RA and srRNA-RA).

5.8S, 18S, and 28S are not annotated; however, they are large (>150nt) and should give a good signal due to their strongly conserved structure and because at least some of their substructures should be smaller than 120-130nt, therefore discoverable even with our size-constrained phylo-grammars. It is conceivable we are picking them up, but we simply don't know.

The second issue is that we know for a fact that Twelve Fly Screen's sensitivity for 5S rRNA is very low. To track down why, let's examine where these rRNA's are located and how well they align to the other 11 flies.

The 5S rRNA genomic distribution in release 5.3 is:

chromosome number of 5S rRNA annotations
2R 96
dmel_mitochondrion_genome 2
U 64
XHet 2

These annotations were put here as follows:

ssh lorien
cd /home/projects/caf1screen/source/flybase/
awk '$3=="rRNA" {print}' dmel-rRNA-r5.3.gff > dmel-rRNA-r5.3.gff

Interestingly, all 96 are clustered very close to each other in the range [15617067,15653783] (36,717 nt), all on the same strand. See them in the genome browser: http://flybase.org/cgi-bin/gbrowse/dmel/?name=2R:15617067..15653783

Clearly, there are 96 5S rRNAs on chr2R that we can pick up. But are in they in the Pecan/Mercator alignment? Mitch Skinner has extracted this set from the alignment (after doing the release 5 -> release 4 coordinate transformation), which is here: http://biowiki.org/~mitch/rRNA-subalignments.stock

These alignments are very poor for our purposes:

  • dmel sequence aligns to all gaps in other genomes
  • at least one case of barely any dmel sequence in an alignment

which explains the poor sensitivity.

But why are these 5S rRNAs not aligning to the dmel sequence? Rob noticed they are mostly (always?) repeat-masked, but we don't know how that affects Pecan's alignment algorithm; however, it's an idea to investigate.

Let's BLAST some dmel 5S rRNA sequence against the other flies, starting from the top of Mitch's file, to see if they are even in the assembly:

  • FBtr0086444
    • against dpse (release 2.0, seems more complete than any of the other non-_dmel_ flies)
      • Top hits are against unassembled regions:

* see match in genome browser

      • but a promising one in chromosome 2:

* see match in genome browser

  • FBtr0086442
    • against dpse (release 2.0)
      • Once again, top hits are against unassembled regions, but here's another one on chr 2:

see match in genome browser

      • But wait? That's the same hit from earlier... well of course, since the 5S rRNA are all highly homologous. So we don't even need to BLAST different dmel sequences, one should do.

Having learned that, let's try BLASTing FBtr0086442 against some other species:

  • dsim: all matches are to chrU except for one, which is on 2R (see in genome browser)
  • dyak: just under half of the matches are to chr2L, chr2R, and chrX, with the rest being to variants of chrU; interesting that it fragmented across multiple chromosomes like that

The "unrealistically long ncRNA annotation" problem (release 5.1 and 5.3)

There are some features of type ncRNA in FlyBase that simply cannot be real ncRNAs because they are too long. UPDATE: It is better to trust FlyBase on this and not throw them out. They seem odd, but valid. I'm keeping these notes here for the time being. [AVU]

The longest ncRNA in Rfam 8.0 is 800nt.

TODO: put in a makefile:

# get Rfam sequence lengths, sorting them
wget ftp://ftp.sanger.ac.uk/pub/databases/Rfam/CURRENT/Rfam.fasta.gz
gunzip Rfam.fasta.gz
cat Rfam.fasta | perl -pe 'if (/^>/){$_="\n"}else{chomp}' | perl -pe 'chomp; $_=length($_)."\n"' | awk '$1>0 {print}' | sort -n > Rfam.lengths

Then again, maybe really long ncRNAs are plausible:

FlyBase GFF dumps:

  • /nfs/projects/caf1screen/source/flybase/dmel-all-r5.1.gff
    • has 60 ncRNA features with length > 800 (above max Rfam length)
    • has 6 ncRNA features with length >= 10000
  • /nfs/projects/caf1screen/source/flybase/dmel-all-r5.3.gff
    • has 66 ncRNA features with length > 800 (above max Rfam length)
    • has 11 ncRNA features with length >= 10000

CAUTION: These lengths are FlyBase annotation lengths, but the mature transcript may be shorter due to splicing!

TODO: put in a makefile:

# find nonsensically-long ncRNA annotations
awk '($3 == "ncRNA") && ($5-$4+1 > 800) {print "# length = " $5-$4+1; print}' dmel-all-r5.3.gff > too-long-ncRNA.r5.3.gff
# which ones are '''really''' bad? (>= 10,000)
grep -A 1 "length = ....." too-long-ncRNA.r5.3.gff | grep ncRNA

Examples (from r5.3):

* the type is listed as mRNA * excerpt: alternatively spliced; this transcript does not appear to have a protein coding region; may function at the transcript level

  • 4 transcripts of FBgn0020556
    • FBtr0083341
      • longest FlyBase ncRNA annotation (length is 31,065nt in annotation, 1124nt after splicing (?))
      • FlyBase page
    • the 3 other transcripts are > 1200nt

It appears that in at least one case (FBgn0062978, go to "GENE MODEL & FEATURES" -> "COMMENTS ON GENE MODEL") FlyBase assigns the ncRNA type to transcripts that have a poor computed ORF.

The Final Word

We're keeping the long ncRNAs.

Here is the e-mail from FlyBase answering my question about how they got the ncRNA tag:

Dear Andrew,

> comments: I have noticed there are several annotations of type > "ncRNA" that are much longer than anything I've seen in Rfam (e.g. > FBtr0080319, FBtr0080320, FBtr0083341). What is the decision > process/criteria for assigning "ncRNA" to an experimentally- > detected transcript?

Generally, these are genes for which there is good evidence that they are transcribed, but the resulting transcripts have realtively small ORF's. More recently, we have been able to add the additional assessment of whether there are conserved "protein signatures" within these transcripts (conserved across the 12 sequenced Drosophila species); if there are not, it is additional evidence that the description as a ncRNA is correct.

The first 2 transcripts you cite are derived from the bft gene. We have classified bft as a ncRNA gene, based on descriptions of investigators (Hardiman et al., 2002, Genetics 161: 231; and sequence accession records submitted by the same authors).

The third transcript is derived from bxd. This has also been identified as a likely ncRNA in the literature (Martin et al., 1995, Proc. Natl. Acad. Sci. USA 92: 8398).

Clearly, our knowledge of large ncRNAs is still rudimentary. Rigorous criteria cannot be developed until we know at lot more about them. Certainly what constitutes a "relatively small ORF" is very subjective. In some cases, we have added a comment to the gene model that something annotated as protein-coding gene may actually produce non-coding RNA(s); we simply have no way of knowing at this point in time.

Sincerely,

Lynn Crosby Fly Base

---

-- Created by: Andrew Uzilov on 11 May 2007