Difference between revisions of "Yb Xgram Paper"
From Biowiki
(Imported from TWiki) 
m (Move page script moved page YbXgramPaper to Yb Xgram Paper: Rename from TWiki to MediaWiki style) 
(No difference)

Latest revision as of 23:43, 1 January 2017
Status
*Task*  *Description*  *Priority*  Status*  *Comments 
1  Calc probs and likelihoods  ?  Done (100%)  
2  Training on GTJ DB  ?  Done (99%)  Align models.dat? 
3  GTJ rates  ?  Done (100%)  
4  Document  High  Done 
Task Details
 Compute probs and likelihoods
 Xprot probs (use sum score and max score flags) of the form P(A,Dparams_trained_on_T) where
 A={DSSP annotated, unannotated}
 D={GTJ_training_data, GTJ_test_set, Homstrad_training_data, Homstrad_test_set}
 T={GTJ_training_data (actual and derived), Homstrad_training_data}
 Also calculate posterior probability P(AD,Theta) = P(A,DTheta)/P(DTheta)
 Xprot probs (use sum score and max score flags) of the form P(A,Dparams_trained_on_T) where
* NB record both SC_max and SC_sum for later analysis  IH
 Create script to split training database into separate files, calc tree using nullprot.eg, and merge back into one database.
 Remove ghf10 scop family from complete Homstrad db and train on it. Verify using Blast that test set has <30% id with this training set.
 Remove ghf17, ghf5
 Train on Homstrad with minincr=0.0001.
 Create script that calcs sum scores for test cases.
 Get GTJ log likelihood for each run.
 Diagnose why xprot training on GTJ DB doesn't give good results.
 Give Ian the parameter set (grammar file) that causes newmat to break.
 Get the EM counts.
 Calculate trees for database.
 Redo training using prot3.eg (or another?) as input.
 Create nullprot3.eg
 Run training with small increment threshold.
 Scale grammar rates so that loop rate = 1.
 Email GTJ and verify that method for reconstructing GTJ db is correct.
 Convert BRKALN.annotated to stockholm and train on this instead.
 Id GTJ rates
 Extract parameters from GTJ code/data and put into xprot grammar file.
 Examine code/data and verify with Nick and Jeffrey the rate matrix calculation.
 Create perl script to extract rates and put in xgram format.
 Scale grammar rates so that loop rate = 1.
 Run xprot with this grammar file and see if results match GTJ.
 Extract parameters from GTJ code/data and put into xprot grammar file.
 Document
Results
$ Hom1: alphabeta barrel class minus ghf10 scop family $ Hom2: complete db minus ghf10 scop family $ Hom3: Hom2 with minincr=0.0001 $ Hom4: Hom2 with minincr=0.00001 $ GTJ1: derived GTJ db using models.dat and brkaln directory $ GTJ2: actual GTJ parameters $ GTJ3: derived GTJ db using brkaln.annotated
*Run*  *Annot?*  *Data*  *Training*  *SC_max*  *SC_sum*  *%Acc* 
1  N  ghf10  Hom1  5034  4962  68.1 
2  N  Hom1  Hom1  163081  
3  N  psefl  Hom1  4880  4821  63.8 
6  Y  Hom1  Hom1  173966  
9  N  ghf10  GTJ1  5128  5078  42.0 
11  N  psefl  GTJ1  4863  4811  41.7 
12  N  GTJ1  GTJ1  1909113  
16  Y  GTJ1  GTJ1  2146768  
17  N  ghf10  GTJ2  5114  5049  65.4 
19  N  psefl  GTJ2  5077  5018  65.7 
18  N  Hom1  GTJ2  166780  
20  N  GTJ1  GTJ2  2639207  
22  Y  Hom1  GTJ2  177080  
24  Y  GTJ1  GTJ2  2797797  
41  N  Hom2  GTJ2  2562795  
42  Y  Hom2  GTJ2  2719021  
43  N  GTJ3  GTJ2  
44  Y  GTJ3  GTJ2  
25  N  ghf10  Hom2  5084  5018  68.4 
26  N  Hom2  Hom2  2555107  
27  N  psefl  Hom2  5008  4947  64.1 
30  Y  Hom2  Hom2  2702371  
33  N  ghf10  Hom3  5067  4998  60.4 
34  N  Hom3  Hom3  2551162  
35  N  psefl  Hom3  4997  4940  62.1 
36  Y  Hom3  Hom3  2699709  
37  N  ghf10  Hom4  5077  5005  58.5 
38  N  Hom4  Hom4  2557615  
39  N  psefl  Hom4  5045  4988  57.3 
40  Y  Hom4  Hom4  2706292  
47  N  ghf10  GTJ3  ?  
48  N  psefl  GTJ3  ?  
45  N  GTJ3  GTJ3  
46  Y  GTJ3  GTJ3 
Questions
Archived Questions
 Should I drop the gapfiltered test cases since GTJ doesn't remove gappy columns?
 Yes  IH
 prot3.eg: In rate matrix, r > n is missing and n > {} is missing 2
 Zero rates are omitted from the grammar file. If you want to prevent this behavior, one workaround is to use a parametric model instead. IH 6/23/2006
 Use nullprot.eg to derive tree for test alignment or let xprot use trained grammar instead?
 Use nullprot.eg. Otherwise xprot will
use first rate matrix of input grammar to calculate treeask you for a grammar file for tree estimation (7/13/2006)
 Use nullprot.eg. Otherwise xprot will
 How to do scaling of rate matrices as described by Jeff?
 Avg rate of mut = Sum(e(i) * R(i,i)), where e is equilib freq of aa for category and R is rate matrix for category. Scale this by Psi, equilib freq of categories, so that avg rate = 1.
 prot3.eg: Where did these rates originally come from?
 Not sure  probably fairly ad hoc, e.g. all simple scalar multiples of some generic AA substitution matrix(?). It's probably not worth using parameters whose provenance is unknown, like these. In fact using them as a seed is even a little suspect: we want our procedure to be as reproducible as possible & to use as little prior information as possible. IH
 How to calculate equilibrium distribution of secondary structure categories for GTJ db?
 T(i,j) is transition prob of phyloHMM states. Seek vector q of equilibrium phyloHmm state probs. q is a left eigenvector of T with eigenvalue = 1.
 Sum(q(i)*T(i,j)) = q(j) <=> q*T = q
 T(i,j) is transition prob of phyloHMM states. Seek vector q of equilibrium phyloHmm state probs. q is a left eigenvector of T with eigenvalue = 1.
 Yuri Bendana  22 Jun 2006