Yb Xgram Paper

From Biowiki
Jump to: navigation, search


*Task* *Description* *Priority* Status* *Comments
1 Calc probs and likelihoods  ? Done (100%)
2 Training on GTJ DB  ? Done (99%) Align models.dat?
3 GTJ rates  ? Done (100%)
4 Document High Done

Task Details

  1. Compute probs and likelihoods
    • TWikiDocGraphics.choice-yes.gif Xprot probs (use sum score and max score flags) of the form P(A,D|params_trained_on_T) where
      • A={DSSP annotated, unannotated}
      • D={GTJ_training_data, GTJ_test_set, Homstrad_training_data, Homstrad_test_set}
      • T={GTJ_training_data (actual and derived), Homstrad_training_data}
      • Also calculate posterior probability P(A|D,Theta) = P(A,D|Theta)/P(D|Theta)

* NB record both SC_max and SC_sum for later analysis - IH

    • TWikiDocGraphics.choice-yes.gif Create script to split training database into separate files, calc tree using nullprot.eg, and merge back into one database.
    • TWikiDocGraphics.choice-yes.gif Remove ghf10 scop family from complete Homstrad db and train on it. Verify using Blast that test set has <30% id with this training set.
      • Remove ghf17, ghf5
    • TWikiDocGraphics.choice-yes.gif Train on Homstrad with minincr=0.0001.
    • TWikiDocGraphics.choice-yes.gif Create script that calcs sum scores for test cases.
    • Get GTJ log likelihood for each run.
  1. Diagnose why xprot training on GTJ DB doesn't give good results.
    • TWikiDocGraphics.choice-yes.gif Give Ian the parameter set (grammar file) that causes newmat to break.
    • TWikiDocGraphics.choice-yes.gif Get the EM counts.
    • TWikiDocGraphics.choice-yes.gif Calculate trees for database.
    • TWikiDocGraphics.choice-yes.gif Redo training using prot3.eg (or another?) as input.
    • TWikiDocGraphics.choice-yes.gif Create nullprot3.eg
    • TWikiDocGraphics.choice-yes.gif Run training with small increment threshold.
    • TWikiDocGraphics.choice-yes.gif Scale grammar rates so that loop rate = 1.
    • TWikiDocGraphics.choice-yes.gif Email GTJ and verify that method for reconstructing GTJ db is correct.
    • Convert BRKALN.annotated to stockholm and train on this instead.
  1. Id GTJ rates
    • TWikiDocGraphics.choice-yes.gif Extract parameters from GTJ code/data and put into xprot grammar file.
      • Examine code/data and verify with Nick and Jeffrey the rate matrix calculation.
      • Create perl script to extract rates and put in xgram format.
    • TWikiDocGraphics.choice-yes.gif Scale grammar rates so that loop rate = 1.
    • TWikiDocGraphics.choice-yes.gif Run xprot with this grammar file and see if results match GTJ.
  1. Document
    • TWikiDocGraphics.choice-yes.gif Fold the tables into one table.
    • Condense document text to a couple of paragraphs. Use the terms introduced in the main body of the paper such as phylo-hmm.


$ Hom1: alpha-beta barrel class minus ghf10 scop family $ Hom2: complete db minus ghf10 scop family $ Hom3: Hom2 with minincr=0.0001 $ Hom4: Hom2 with minincr=0.00001 $ GTJ1: derived GTJ db using models.dat and brkaln directory $ GTJ2: actual GTJ parameters $ GTJ3: derived GTJ db using brkaln.annotated

*Run* *Annot?* *Data* *Training* *SC_max* *SC_sum* *%Acc*
1 N ghf10 Hom1 -5034 -4962 68.1
2 N Hom1 Hom1 -163081
3 N psefl Hom1 -4880 -4821 63.8
6 Y Hom1 Hom1 -173966
9 N ghf10 GTJ1 -5128 -5078 42.0
11 N psefl GTJ1 -4863 -4811 41.7
12 N GTJ1 GTJ1 -1909113
16 Y GTJ1 GTJ1 -2146768
17 N ghf10 GTJ2 -5114 -5049 65.4
19 N psefl GTJ2 -5077 -5018 65.7
18 N Hom1 GTJ2 -166780
20 N GTJ1 GTJ2 -2639207
22 Y Hom1 GTJ2 -177080
24 Y GTJ1 GTJ2 -2797797
41 N Hom2 GTJ2 -2562795
42 Y Hom2 GTJ2 -2719021
43 N GTJ3 GTJ2
44 Y GTJ3 GTJ2
25 N ghf10 Hom2 -5084 -5018 68.4
26 N Hom2 Hom2 -2555107
27 N psefl Hom2 -5008 -4947 64.1
30 Y Hom2 Hom2 -2702371
33 N ghf10 Hom3 -5067 -4998 60.4
34 N Hom3 Hom3 -2551162
35 N psefl Hom3 -4997 -4940 62.1
36 Y Hom3 Hom3 -2699709
37 N ghf10 Hom4 -5077 -5005 58.5
38 N Hom4 Hom4 -2557615
39 N psefl Hom4 -5045 -4988 57.3
40 Y Hom4 Hom4 -2706292
47 N ghf10 GTJ3 ?
48 N psefl GTJ3 ?
45 N GTJ3 GTJ3
46 Y GTJ3 GTJ3


Archived Questions

  • Should I drop the gap-filtered test cases since GTJ doesn't remove gappy columns?
    • Yes - IH
  • prot3.eg: In rate matrix, r -> n is missing and n -> {} is missing 2
    • Zero rates are omitted from the grammar file. If you want to prevent this behavior, one workaround is to use a parametric model instead. IH 6/23/2006
  • Use nullprot.eg to derive tree for test alignment or let xprot use trained grammar instead?
    • Use nullprot.eg. Otherwise xprot will use first rate matrix of input grammar to calculate tree ask you for a grammar file for tree estimation (7/13/2006)
  • How to do scaling of rate matrices as described by Jeff?
    • Avg rate of mut = Sum(e(i) * R(i,i)), where e is equilib freq of aa for category and R is rate matrix for category. Scale this by Psi, equilib freq of categories, so that avg rate = 1.
  • prot3.eg: Where did these rates originally come from?
    • Not sure - probably fairly ad hoc, e.g. all simple scalar multiples of some generic AA substitution matrix(?). It's probably not worth using parameters whose provenance is unknown, like these. In fact using them as a seed is even a little suspect: we want our procedure to be as reproducible as possible & to use as little prior information as possible. IH
  • How to calculate equilibrium distribution of secondary structure categories for GTJ db?
    • T(i,j) is transition prob of phylo-HMM states. Seek vector q of equilibrium phylo-Hmm state probs. q is a left eigenvector of T with eigenvalue = 1.
      • Sum(q(i)*T(i,j)) = q(j) <=> q*T = q

-- Yuri Bendana - 22 Jun 2006