PFold vs. RFAM-trained parameters
Below are bubble plots of the dinucleotide rate matrix estimated from pairwise alignments and used by PFOLD, along with the analogous rate matrix estimated from RFAM multiple alignments.
Note that the RFAM-trained parameters have significantly larger probabilities for non-canonical basepairs, which appears to have a detrimental impact on the performance of the RFAM-trained rates at predicting ncRNAs in multi-genome-alignment screens.
I don't think the difference is an XRATE error; I have reason to think this is a real difference between the RFAM structural alignments and the training data that Bjarne Knudsen used (derived from a merge of the Bayreuth tRNA database of Sprinzl et al, and the LSU rRNA database of De Rijk et al).
The difference in the rates could be down to a couple of things:
- according to Knudsen & Hein: RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics 1999;15:446-54. (p449), the Bayreuth tRNA database
didn't have any noncanonical basepairs in, so they had to add them in by assuming that single-base symmetric doublestrand bulges were actually basepairs, i.e. <.<....>.> would be converted to <<<....>>>
- RFAM imposes a single consensus structure on all members of a
family, so the potential for misannotated basepairs in a very large family is rather big.
Pfold probabilities & rates
- Pfold parameters.
XRATE-estimated probabilities & rates (from RFAM)
- RFAM-trained parameters (from alignments annotated as having "published" secondary structures, as opposed to "predicted").
XRATE-estimated probabilities & rates (from CONSAN training set)
- CONSAN mix80-trained parameters
-- Ian Holmes - 31 Oct 2007