My project is a combination of the TKF91 model and xprot's evolutionary HMMs to build and assess multiple alignments. By generating pairwise alignments between two segments of a phylogenetic tree (i.e. by cutting a branch into two), and then generating a pairwise alignment between them, one can slowly improve the accuracy of the combined multiple alignment. Moves will be generated by randomly generating an alignment using TKF, then scoring them against xprot. Moves will be accepted or rejected based on the hastings ratio of the new alignment to the old.
xprot will require training first. I plan to train against a subset of sequences from BaliBase
or the HomstradDatabase
and then validate the method by testing against new sequences from that same database.
The major goal of the project including determining whether the method is computationally viable, and, if so, what are the best parameters to use in the evolutionary HMM and MCMC move generation and acceptance.
-- Ryan Ritterson - 17 Nov 2005
I seperated the HOMSTRAD database into ~800 training sequences and ~200 test sequences in order to test each alignment model seperately. After running each alignment under tkfalign only, as well as tkfalign+xprot combined, I found the sum-of-pairs score for the combined model to be .02 lower (indicating that the combined model correctly aligned 2% fewer pairwise residues than the tkfalign only model.
I initially suspected that the MCMC move acceptance rate was too low or too high and reducing the effectiveness of the model. However, after re-running samples with more logging (and obtaining similar results), I realized that the acceptance rate was approximately. 5, which is an acceptable rate. However, other issues were brought to light in explaining the score disparity. It is possible that xprot was overtrained by using 80% of the database for training the HMM and thus generalized poorly to the test sequences. It is also possible that sheer random sampling accounted for the score disparity (as .02 is a very low score difference). I believe that the result was actually a combination of using too much of the database as training as well as using too few MC sample steps. However, the computational time required to test both of those hypotheses was beyond the time alloted for the project and thus I was unable to investigate them (I also began to have several problems with the computational nodes used to run jobs in the first place, which also had an affect on my data-gathering efficiency).
However, I did discover that combining the two models together required very little manual human labor-- the vast majority of the extra work was in the computational run time. Thus the approach was proven to be computationally viable, if suboptimal for this particual combination of models. The next steps in the project, were it to be investigated further, would be to attempt to optimize the parameters of the combined model to see how much, if any, improvement can be made in the alignments. I firmly believe the approach of using two cooperative models is a good one. It would also be interesting to combine 3 or more models and use a democratic type voting process in deciding where to procede next (based on a consensus likelihood, or weighting each models relative power in choosing whether to accept or reject moves, etc). It would also be interesting to more fully-integrate the external likelihood estimator program such that it is called at every Monte Carlo move, instead of every 100 or so samples when tkfalign is storing a total alignment.
-- Ryan Ritterson - 15 Dec 2005
: as Kaspar suggested in class, information on the move acceptance rate would be useful.
Copyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback