Andrew's make-windows.pl uses the GFF convention for strand coordinates. The first number is always less than the second number. In the plus strand, the coordinates increase from 5' to 3'. For the minus strand, the coordinates are decreasing as you move from 5' to 3'. The sequences themselves are always 5' to 3'.
Another point is that the windows we are inputting into the RNA predictors are of DNA sequences. So, the corresponding RNA template gene would be on the opposite strand. There is no need to convert the DNA to RNA by switching T to U since most programs do this.
Rfam has an interesting paragraph about ncRna pseudogenes here. Rfam uses '.' for unpaired bases but this is wildcard character for xrate. Andrew changed this to '_'. Rfam splits up ncRnas into 3 classes: gene, cis-reg, and intron.
Andreas put the dataset he created (see Andreas Heger LJOrthologs For Yuri) in /mnt/nfs/users/aheger/yuri_malis.tgz. There are three fasta files for each cluster:
- raw_xxx.fasta: the alignment of the exons as output by dialign/muscle
- cleaned_xxx.fastsa: the alignment of the exons, after some cleaning. He used Gblocks to clean up the alignment.
- extended_xxx_1.fasta: the sequences of the mRNAS plus 1kb on either end.
Andreas described his gene prediction pipeline using Exonerate:
- Pairwise orthology assignment using normalized bitscores (verified using ds=synonymous substitution)
- Cluster into orthologous groups
- Building multiple alignments
- Use dialign -nt for < 50 nt
- Use muscle for 50-500 nt
- Don't align for > 500?
- Filter alignments to about 5000. Orthologs and paralogs.
- Phylogenetic tree estimation using Phylip and PAML
- He does three passes for gene prediction, with increasing resolution (full DP mode).
- The terminal exon is the hardest to predict.
- D.willistoni has a GC content in codons much lower than other species, about 50%, rather than 70%.
Meeting with Ian.
- Research plan
- ncRna pipeline
- incorporate discriminative model (algorithm dev)
- Pipeline Models (build for as many elements as possible)
- protein coding genes: exons, introns, splice sites, UTRs (RNA structure inside introns)
- noncoding RNA
- miRNA (stems evolve slower b/c bind to mRnas)
- other ncRNA (evolve faster)
- conserved non-genic DNA
- intergenic (neutral model)
- Take into account different GC content
- Splice site model
- Stratasplice = weight matrix with different GC content
- For example, have models with high GC, low GC, and high CpG
- Write a wrapper script that creates one grammar file incorporating the user selected models.
- Mike Eisen's group did protein coding gene alignments. Contact them or Andreas. Also get flanking DNA regions.
- Papers to reference: QRNA, Evofold, Rnaz
Short discussion on Contrafold. It uses Conditional Random Fields (CRF). An HMM computes P(seq, annot) while a CRF computes P(annot|seq). An HMM is a generative model while a CRF is a discriminative model (see article by Ng and Jordan). In the CRF state diagram the arrows point from the emissions to the hidden states (?), the opposite direction of an HMM.
GTJ trained grammar tr02.eg now has avg substitution rates of [1.02 0.944 1.06] or after rescaling of [0.959 0.887 1.00]. Rates cited in paper were [1.027 0.775 1.00]. Much better than before where rates were about 0.2, rescaling nullprot3.eg to 1 had a big effect. EM converged in 23 iterations versus ~150 before.
Poster should focus on the xrate family of programs: xgram, xfold, xprot. xrate uses single column rate matrices. Andrew will provide content on his ncRna runs. I'll discuss my xprot and xfold runs. Try to do an Rnaz scan for comparison with window=120, overlap=40. Redo the xfold scan if possible with smaller windows.
nullprot3.eg avg substitution rate was about 0.5. Ian rescaled to 1.0. Need to redo training and data using these rates. Rates from training on GTJ db should also be rescaled so that avg substitution rate for loop category = 1, as cited in paper.
Was noticing slow download rates of 20-40 kps from babylon. Andrew's suggestions to have quicker connections to cluster: Try cp to /tmp on babylon before sftp to avoid nfs traffic (use this to benchmark download speed). Home directory is on lorien so logging into there could speed up editing files. Also Tram emacs module allows for local edit over ssh connection.
log_rate_em_counts = # of muts observed. If some counts are near 0, need more training data.
-- Yuri Bendana - 18 May 2006