Viral Alignments

From Biowiki
Jump to: navigation, search

Database of annotated genomic viral alignments

A project to generate a comprehensive database of annotated alignments of viral genomes, for the training, benchmarking and application of bioinformatics tools for alignment analysis (motif-finding, phylogenetic, etc.)

The concept is a loose amalgam of several existing databases, including the RNA virus database, the LANL HIV database, Rhino Base, parts of Gen Bank, RFAM and other sources.

Annotated genome/alignment features to include:

  • protein-coding genes and protein transcripts (accounting for overlapping ORFs, frameshifts, peptide cleavage...)
  • structural elements, incl. (where available) programmed frameshifts, IRESes, regulatory elements & complete genome structure (e.g. HIV)
  • a maintained library of tools for easy automated import/synchronization with other viral alignment/annotation file formats & databases (Gen Bank, literature, etc)
  • both predicted & experimentally validated content, with solid distinctions drawn (GO evidence codes?)
  • phylogeny; subtypes (incl. both ARGs, and sets of trees a la Rec HMM)

The database can be downloaded as a set of documents, or browsed intuitively using JBrowse.

Implementation

Media:TWikiDocGraphics.bubble.gif IH 8/30 I propose we start with Stockholm format alignments combined with GFF format annotations. We can eventually map these to JSON documents for quick porting to JBrowse and compatibility with Couch DB, which we envisage as the eventual JBrowse database.

For example, consider the following Stockholm example from the page Wikipedia:Stockholm_format

# STOCKHOLM 1.0
#=GF ID	 UPSK
#=GF SE	 Predicted; Infernal 
#=GF SS	 Published; [[PMID:9223489]]
#=GF RN	 [1]
#=GF RM	 9223489
#=GF RT	 The role of the pseudoknot at the 3' end of turnip yellow mosaic
#=GF RT	 virus RNA in minus-strand synthesis by the viral RNA-dependent RNA
#=GF RT	 polymerase.
#=GF RA	 Deiman BA, Kortlever RM, Pleij CW;
#=GF RL	 J Virol 1997;71:5990-5996.

AF035635.1/619-641				 UGAGUUCUCGAUCUCUAAAAUCG
M24804.1/82-104					 UGAGUUCUCUAUCUCUAAAAUCG
J04373.1/6212-6234				 UAAGUUCUCGAUCUUUAAAAUCG
M24803.1/1-23						UAAGUUCUCGAUCUCUAAAAUCG
#=GC SS_cons						 .AAA....<<<<aaa....>>>>
//

We could render this more-or-less directly as

{
 "Format": "STOCKHOLM 1.0",
 "Row": {
  "AF035635.1/619-641":				 "UGAGUUCUCGAUCUCUAAAAUCG",
  "M24804.1/82-104":					 "UGAGUUCUCUAUCUCUAAAAUCG",
  "J04373.1/6212-6234":				 "UAAGUUCUCGAUCUUUAAAAUCG",
  "M24803.1/1-23":						"UAAGUUCUCGAUCUCUAAAAUCG"
 },
 "GC": {
  "SS_cons":								".AAA....<<<<aaa....>>>>",
 },
 "GF": {
  "ID":	 "UPSK"
  "SE":	 "Predicted; Infernal",
  "SS":	 "Published; PMID:9223489",
  "RN":	 "[1]",
  "RM":	 "9223489",
  "RT" : [
	"The role of the pseudoknot at the 3' end of turnip yellow mosaic", 
	"virus RNA in minus-strand synthesis by the viral RNA-dependent RNA",
	"polymerase."
	],
  "RA" :	  "Deiman BA, Kortlever RM, Pleij CW;",
  "RL":	 "J Virol 1997;71:5990-5996."
 }
}

Similar literal translations can be imagined for GFF.

We should stick to straight Stockholm and GFF initially... but JSON versions of those formats could be useful.

  • Converters to common output formats (.fa .stk , other flatfiles, is possible to export to Geneious?)
  • Customizeable output format adaptors via Map Reduce (c.f. Exonerate's 'roll your own' feature, so nice...)

Pilot genomes

Key points: These nucleotide (RNA) alignments have at their foundation, a careful superimposition of crystal structures for all known resolved (encoded) protein structures. Whenever new sequences were added, initially by Clustal W profile-profile fits, the compilation was then edited (Wisconsin Package: Seq Lab) to conform with that fit, and to also conform with any other known information concerning protein or RNA conservation for these strains. The RNA alignments maintain open reading frames from which the aligned polyproteins were translated directly. In addition, they respect all known 2D and 3D RNA structural motifs, including 5' and 3' stem elements, 5' pseudoknots, 5' IRES, poly(C) tracts, 3' polyadenylation sites, internal cre elements, etc. The protein alignments for each genus, species or type respect all known proteolytic cleavage sites, enzyme active sites, core protein structures (helices, sheets, turns), and mapped antigenic sites. The RNA alignments (.rna) may be translated directly to produce the respective aligned polyproteins (.p123).

  • Wikipedia:HIV
    • Build on LANL HIV database, possibly other databases
    • 'Complete' 2ndary structure recently released. Make available in downloadable alignment format.
    • Many different divergent strains, with possibly gained/lossed structures. Allow for downloading a selectable subset of the different serotypes

Features

  • Initial - not so computationally demanding
    • Alignment generation - browse e.g. RNA virus database and select species/types,
      • download non-aligned, or possibly align on server (?)
    • Feature finder/importer
      • Search various relevant databases/papers for features of given species/serotype

* e.g. http://search.cpan.org/~cjfields/BioPerl-1.6.0/Bio/DB/HIV.pm * e.g. HIV whole-genome structure from Watts et al.: Architecture and secondary structure of an entire HIV-1 RNA genome. Nature 2009;460:711-6.

  • Manual importation (e.g. PV and related picornaviruses have 2-3 known structures each)
    • These could be shared also...though I'm not sure if the goal of this is a shared information source per se... (?)
    • JBrowse
      • View alignment/genome features in separate tracks
      • Genomic map (e.g. UTRs, ORFs, etc). Ideally reasonably easily imported from Gen Bank, using a selected species as the reference for codons, etc
    • local conservation, recombination rates, GC content, etc
    • multiple related features, e.g. known structured elements (in this and possibly related species/serotypes) vs. RNAdecoder/Xrate posterior probs vs. ( within vs outside group structure variability)
    • Rec HMM output for multi-tree/recombinant alignments
  • Eventual
    • Computational tools - already exist, interface w/ db tool
      • Allow alignment of desired sequences
      • Tree construction (e.g. Structural EM program, can perform various algorithms)
      • Multi-tree/multi-alignment suggestions w/recHMM
      • Represent annotation dependencies on previously-trained models (e.g. using RNA decoder predictions to train xrate)
    • Computational ideas/goals
      • Enable sensitive, specific, flexible discovery of structured elements within protien-coding regions of viral genomes
      • RNAdecoder/Xrate + structural distance ANOVA + comparison w/known elements (?)

-- Oscar Westesson - 30 Aug 2009