Transmissible Gastroenteritis Virus
To see my JBrowse Transmissible Gastroenteritis genome annotation, please visit TGV
Table of Contents:
Transmissible Gastroenteritis Virus
Cononaviruses from Microbewiki
The functional viral proteins annotated within the Transmissible Gasteroenteritis Virus genome:
Coronaviruses are single stranded enveloped RNA viruses that have a helical geometry. Coronaviruses are the largest of RNA viruses with a genome size of up to 31kb. These viruses are grouped in the order Nidovirales. The structure common to all coronaviruses consists of spike (S), envelop (E), membrane (M), and nucleocapsid (N) proteins. There are currently three different groups of coronaviruses. Some of the notable viruses are group 2b SARS
, group 1 Human coronavirus, and TGEV.
Visit Fall09 Virus Phylogenetic Tree
to see how coranoviruses are related to other students' viruses.
Transmissible Gastroenteritis Virus Biology
TGEV belongs to the coronavirus family. It is an enveloped virus with a positive single stranded RNA genome. TGEV has three major structural porteins, which are phosphoprotein (N), integral membrane protein (E1), and large glycoprotein (E2). The N protein encapsulates the genomic RNA, and the S protein forms viral projections. The nucleocapsid proteins have been shown to be important for the viral genome replication.
The 3' segment of about 8000 nucleotides encodes subgenomic RNAs. The remaining part of the genome encodes viral replicase. The three largest gene sequence from 5' to 3' is in the order of E2 to E1 to N. There are about seven other open reading frames that are not structurally related. There are very little overlaps among the genes, and is densely packed. A negative strand is synthesized to serve as a template for transcribing RNAs of one genome size and several subgenome sized RNAs.
The E2 protein forms a petal-shaped 20nm long projection from the virus's surface. The E2 protein is thought to be involved in pathogenesis by helping the virus enter the host cytoplasm. The E2 protein initially has 1447residues, and then a short hydrophobic sequence is cleaved. After glycosylation of the protein in the golgi, the protein is then incorporated into the new virus. There are several functional domains within the E2 protein. A 20 residue hydrophobic segment at the C-terminus anchors the protein in the lipid membrane. The rest of the protein is divided into two parts, a hydrophilic stretch that is inside the virus and a cysteine rich stretch that are possibly fatty acylation sites. The E1 protein is mostly embedded in the lipid envelop and hence plays an essential role in virus architecture. The E1 protein is postulated to interact with the lymphocyte membrane, which leads to the induction of IFN-coding genes.
Transmissible Gastroenteritis Virus Morphology
The morphology of TGEV was mostly determined by electron microscopy techniques. The morphology is similar to myxovirus and oncogenic virus in that they have surface projections and an envelop. The viruses are mainly circular in shape with a diameter ranging from 100 to 150um including the surface projections. The projections were mainly petal-shaped attached by a very narrow stalk. The projections seemed to be very easily detached from the virus and were only found on select areas.
General morphology of coronaviruses from viralzone
Transmissible Gastroenteritis Virus Pathology
TGEV infects both pigs and humans. However, in young pigs, the mortality rate is close to 100%. The pathology of TGEV is similar to that of other coronaviruses. Coronaviruses enter the host by first attaching to the host cell using the spike glycoprotein. The S protein interacts with the procine aminopeptidase N (pAPN), a cellular receptor, to aide in its entry. The same cell receptor is also a point of contact for Human Coronaviruses. A domain in the S spike protein is recognized by pAPN, and transfection of pAPN occurs to nonpermissive cells and infects them with TGEV. Once the virus infects the host, it multiplies in the cell lining of the small intestine resulting in the loss of absorptive cells that in turn leads to shortening of villi. The infected swine then has reduced capability for digesting food and die from dehydration.
The Transmissible Gastroenteritis Virus has been engineered as an expression vector. The vector was constructed by replacing the nonessential 3a and 3b ORF , which is driven by the transcription-regulating sequences (TRS) with green fluorescent protein. The resulting construct was still enteropathogenic, but with reduced growth. The infection of cells with this altered virus elicits a specific lactogenic immune response against the heterologous protein . The application of this vector is in the development of a vaccine or even gene therapy. The motivation for engineering the TGEV genome is that coronaviruses have large genomes, so they have room for insertion of foreign genes. Coronaviruses also infect the respiratory track , and they can be used to target antigens to that area and generate some immune response.
Comparison To Other Viruses
The Transmissible Gastroenteritis Virus shares a great deal of homology with other coronaviruses of the family. For instance, the Mpro proteinase gene is highly conserved among the three coronaviruses assigned, Human Coronavirus, SARS
, and TGEV. The substrates of the proteinases have remarkably conserved binding sites, and hence the active sites of the proteinases are also similar among the coronaviruses. The spike protein is also conserved among coronaviruses.
JBrowse Genome Annotation Tracks
The gene track involves simple annotations from the standard reference sequence chosen from GenBank
. I chose my reference sequence, NC_002306.2, because
it was a listed reference sequence, and though it had provisional review status, it had "replaced" the previous reference sequence for TGV almost 5 years ago.
Although the 3'UTR and 5'UTR were not annotated, the listing of mat_pept (mature_peptide) from the replicase was detailed enough, though with
putative tags, that it seemed worth the risk of a "provisional review."
exons & mRNA
The exons and mRNA tracks are basically mirrors off the gene track, because in this case the ssRNA virus is positive stranded, so directionality is not a problem,
and the genes that were annotated were all subannotated with children exon/mRNA features. Thus, there are three identical tracks with labels of parents/children.
Since these are ssRNA viruses, mRNA is equivalent to the cDNA making up the genome noted.
The refSeq NC_002306.2 had replicase, its cleaved post-processed mature peptides, and certain other putative proteins annotated from the massive replicase transcript;
it is a little unclear what the purpose of some of these mature peptides are since they are all labeled putative and some of the actual NCBI references contain little more, but in general these relate to metal-binding domains, ribonucleases, and subunits of the larger replicase polypeptide.
colEntropy (window = 100, step = 5)
Column shannon-entropy with a sliding window size of 100, with step sizes of 10, give us a general picture of conservation using column-relative conservation as the scoring metric. Low entropy implies that the weighted sum of the entropies (-log Probabilities) of the different nucleotides in the given column were low, suggesting
that each nucleotide found had a relatively high probability of occurence -- i.e. high conservation in the column. Taking the average over the given window with small step sizes gives up a low-granularity, smooth picture of how column entropy changes over the nucleotide distribution, relative to the alignment as a whole. i.e. how well are all the sequences aligned conserved over the given window range? This differs from shannon information, which asks instead what the given nucleotide of the reference sequence tells us in terms of information bits given that we have the alignment and know the probability distribution.. both ask the same question regarding conservation, and both suggest that there is a highly conserved series of putative domain structures around the 3' end of the replicase construct, but they ask from different perspectives, either relative to the information present in the alignment or relative to the information added by the reference sequence.
The point of a window model is to pick up on local structure and to damp out single high intensity signals when taken relative to a wider window -- it deals with anomalies, it is kind of a normalizing structure. The problem, though, is that without
considerations, which I did not add into the basic model, I assume an
I.I.D. model that does not effectively pick up on dinucleotide or trinucleotide or k-nucleotide pattern repeats, meaning I damp anomalous signals without taking into account true local structure. I normalize with my window, which is necessasary, given the amount of noise I would have had with low window sizes, but without true word size consderation in my sliding window entropy model, I cannot truly say I took into account local sequence distributions.
Thus, my column entropy track provides a general picture of information content and local conservation but it never fully realizes its capacity for distinct, precise local distribution modeling. Further extensions must be made to introduce word sizes to realize optimal analysis of the given sequence alignments.
Again, in the same way that the mRNA and exons mirror the genes, so do the CDS. The only difference is that the CDS are tagged with proteins beyond the cleaved products of the replicase gene; they annotate spike, membrane, and envelope proteins as well, near the end of the genome, and these annotations can be pulled out
with my script (see genbank parsing from main page). CDS, mRNA, and exons do not tell us very much for ssRNA; in theory there could be untranslated exons or
variants of mRNA from processing/splicing or secondary structure variants, but from a pure sequence standpoint there is little variation in NC_002306.2.
Annotated restriction enzymes. See following credits. Not my own perl script.
gcContent (window = 30, step = 7)
I chose window of 30 and step of 7 as 30 is not an integer multiple of 7, therefore we will ensure that there are few non-overlapping regions annotated,
i.e. 0-30, 7-37, ... have a lot of overlap, but each signal itself maintains a relatively small window, 30, allowing us to pick up localized signals instead of
damping them out and losing the local information by aggregating to a global (larger window) mean. The idea is that GC signals are local and their global impact is
likely to be minimal, and if we have too large of window sizes we would expect that local signals would just fade into the noise. That said, other than the initial spike
which I note is in the initial 500-bp 5'UTR region, and therefore potentially involved in heavy base pairing for secondary structure in the expected IRES in the UTR,
the rest of the signals seem indistinguishable from noise. There is one region around 6,025 bp with the lowest GC content, but it is localized within one of the subunits
of the replicase polypeptide so I am unclear as to its purpose; this isn't near the end.. it isn't a poly-A tail. Honestly, I wouldn't expect viruses to have the kind of
tatatatatatatatat or gcgcgcgcgcgc repeats in long enough strings to generate visible signals above the background variance and noise.
Shannon information would be expected to mirror at least in shape and general peaks of intensity, etc. the column entropy, as shannon information is an indication
of how much bit information knowledge of a nucleotide at a column in an alignment tells us about the other nucleotides in that alignment. Since we get column entropy
by simple probabilistically weighted summation of shannon information, it seems likely that for closely related sequences as NC_002306.2 (TGV) was aligned to,
shannon entropy would in fact be comparable, ignoring scaling factors, to shannon information. The variances in troughs and peaks, and the general scaling differences,
are largely irrelevant; consider this track an indication that at the very least my parsing algorithms and analytic algorithms were consistent given the alignment data set. The same pattern of high conservation (low information.. i.e. we gain little in the knowledge of "a" at col i if we are highly certain about an "a" at i) occurs
around the latter putative domains of the post-processed replicase polypeptide.
One interesting fact is that shannon entropy is directly related to the reference sequence, while column entropy by nature of its summation is relative to the alignment. Thus, we have a picture of the information passed by this genome relative to the alignment, but the fact that the two tracks are highly similar tells us that
the aligned genomes are in fact similar -- both one relative to all and all relative to each other result in the same general entropic patterns, however they are scaled.
One gene, nsp 3b, was annotated as a pseudogene with regulatory capacity for a noncoding gene nsp 3b. The /pseudo annotation was clear on both it and its misc feature;
I included it purely because it was annotated, but this feature in particular had "no experimental evidence" and is clearly a provisional annotation based on homology modeling with related species.
Proteins are annotated from CDS; they are not mature_peptides, they are simple protein annotations yanked from the CDS locus_tag, protein_id, and product fields. In particular, note that while mature_proteins annotated in genbank refer to post-processed products of the replicase transcript, the CDS annotations parsed with my script denote the regions left out, near the end of the genome -- the spike protein, membrane protein, envelope protein, etc.
Protein domains were annotated by hand in .gff3 format from the literature; see the link to Transmissible Gasteroenteritis Annotations
. These are recognized subunits with classified structure or function that make up certain putative domains from the post-processed replicase polypeptide.
Alignment (window = W, step = S)
Alignments were run with W=100, S=5 as particularly for column entropies, we cannot get a good picture of overall conservation with nucleotide alignments (vs. amino acid alignments) without taking into account local distributions. Sliding window entropy with larger window sizes (still, ~0.3% of genome = W) ensures that we get a picture of the distribution around nucleotide positions as opposed to static representations of entropy by simple column-entropy based models, but the fact of the matter is that simple sliding window entropy is not a perfect picture of conservation; it is not a perfect scoring metric. While it does give an excellent picture of highly conserved regions, the 16-35 genomic sequences upon which alignments were conducted (the data shown is from a batch of 19 closely related coronaviral genomes) were not nearly enough to reduce noise or highlight intense mutational hotspots. The best we get is a dim view of high conservation (low entropy, low information content) around the 3' end of the replicase gene, at least relative to the higher entropic values near the 3' end of the genome, which suggests that at least among the species aligned, membrane, spike, and envelope proteins are less highly conserved than are transcriptional and translational machinery, as well as insertion machinery as represented by the replicase mature peptides. We would have expected this.
Note on relative Scaling:
One important note to keep in mind is that although the height scaling was conserved at 150 for all wig files, there are variations in the actual numerical scaling between shannon Information content and shannon entropy; more importantly.. gc content should be read as a percentage of the full height of the track.
for the scripts used to generate restriction sites track.
Credit given to script creator.
In retrospect, alignment models and conservation scoring work better when I don't have to consider degeneracy in nucleotide codes. Sliding window models would also best be applied to amino acid alignments, the only caveat being that a more general parser, capable of dealing with multiple disconnected domains and variable coding frames,
would be needed to best align the various sequences -- as who is to say that with all the overlapping coding frames a standard alignment truly captures the relative conservation of my viral genome? Nucleotide alignments generate a general picture of conserved domains and, as in the case of the 5'UTR and IRES homology modeling attempt, it is clear that simple sliding window entropy models can locate high-conservation local regions, but RNA structure prediction and amino acid alignment would go a long way towards generating quantitative data. I guess, though, in the end, the point of the JBrowse genome viewer is to point the user in the direction of "the next test" and make clear what information is lacking. Further tracks that should be added, and can be, given the generalizable and reference-sequence relative scripts I have written and posted, include amino-acid alignment and word-length consideration in conservation scoring for local distributions.
In essence, the general conservation picture has been generated by these alignments and basic information-theoretic analyses. What we need to do next is fine-tune the data, reduce noise, and sharpen true signals with confidence. We need a better scoring metric that either takes into account nucleotide degeneracy and word lengths OR
switches to amino acid conservation OR, in the case of 5' or 3' UTRS and even internal regulatory regions, focuses on RNA structure for conserved loops, etc around which to focus our alignments (i.e. expected troughs for entropic analysis, etc., high conservation).
Laude H, Rasschaert D, Delmas B, Godet M, Gelfi J, Charley B. Molecular biology of transmissible gastroenteritis virus.
Vet Microbiol. 1990 Jun;23(1-4):147-54.
Tajima M. Morphology of transmissible gastroenteritis virus of pigs. A possible member of coronaviruses. Brief report.
Arch Gesamte Virusforsch. 1970;29(1):105-8.
Sanchez CM, Izeta A, Sanchez-Morgado JM, Alonso S, Sola I, Balasch M, Plana-Duran J, Enjuanes L. Targeted recombination demonstrates that the spike gene of transmissible gastroenteritis coronavirus is a determinant of its enteric tropism and virulence.
J Virol. 1999 Sep;73(9):7607-18.
Sola I, Alonso S, Zuniga S, Balasch M, Plana-Duran J, Enjuanes L. Engineering the transmissible gastroenteritis virus genome as an expression vector inducing lactogenic immunity.
J Virol. 2003 Apr;77(7):4357-69.
Almazan F, Galan C, Enjuanes L. The nucleoprotein is required for efficient coronavirus genome replication.
J Virol. 2004 Nov;78(22):12683-8.
-- %TEACHINGWEB%.SushantSundaresh - 07 Dec 2009
Copyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback