# Yb Xgram Paper

From Biowiki

### Status

*Task* | *Description* | *Priority* | Status* |
*Comments |

1 | Calc probs and likelihoods | ? | Done (100%) | |

2 | Training on GTJ DB | ? | Done (99%) | Align models.dat? |

3 | GTJ rates | ? | Done (100%) | |

4 | Document | High | Done |

### Task Details

- Compute probs and likelihoods

- Xprot probs (use sum score and max score flags) of the form P(A,D|params_trained_on_T) where
- A={DSSP annotated, unannotated}
- D={GTJ_training_data, GTJ_test_set, Homstrad_training_data, Homstrad_test_set}
- T={GTJ_training_data (actual and derived), Homstrad_training_data}
- Also calculate posterior probability P(A|D,Theta) = P(A,D|Theta)/P(D|Theta)

- Xprot probs (use sum score and max score flags) of the form P(A,D|params_trained_on_T) where

* NB record both SC_max and SC_sum for later analysis - IH

- Create script to split training database into separate files, calc tree using nullprot.eg, and merge back into one database.
- Remove ghf10 scop family from complete Homstrad db and train on it. Verify using Blast that test set has <30% id with this training set.
- Remove ghf17, ghf5

- Train on Homstrad with minincr=0.0001.
- Create script that calcs sum scores for test cases.
- Get GTJ log likelihood for each run.

- Diagnose why xprot training on GTJ DB doesn't give good results.

- Give Ian the parameter set (grammar file) that causes newmat to break.
- Get the EM counts.
- Calculate trees for database.
- Redo training using prot3.eg (or another?) as input.
- Create nullprot3.eg
- Run training with small increment threshold.
- Scale grammar rates so that loop rate = 1.
- Email GTJ and verify that method for reconstructing GTJ db is correct.
- Convert BRKALN.annotated to stockholm and train on this instead.

- Id GTJ rates

- Extract parameters from GTJ code/data and put into xprot grammar file.
- Examine code/data and verify with Nick and Jeffrey the rate matrix calculation.
- Create perl script to extract rates and put in xgram format.

- Scale grammar rates so that loop rate = 1.
- Run xprot with this grammar file and see if results match GTJ.

- Extract parameters from GTJ code/data and put into xprot grammar file.

- Document

### Results

$ Hom1: alpha-beta barrel class minus ghf10 scop family $ Hom2: complete db minus ghf10 scop family $ Hom3: Hom2 with minincr=0.0001 $ Hom4: Hom2 with minincr=0.00001 $ GTJ1: derived GTJ db using models.dat and brkaln directory $ GTJ2: actual GTJ parameters $ GTJ3: derived GTJ db using brkaln.annotated

*Run* | *Annot?* | *Data* | *Training* | *SC_max* | *SC_sum* | *%Acc* |

1 | N | ghf10 | Hom1 | -5034 | -4962 | 68.1 |

2 | N | Hom1 | Hom1 | -163081 | ||

3 | N | psefl | Hom1 | -4880 | -4821 | 63.8 |

6 | Y | Hom1 | Hom1 | -173966 | ||

9 | N | ghf10 | GTJ1 | -5128 | -5078 | 42.0 |

11 | N | psefl | GTJ1 | -4863 | -4811 | 41.7 |

12 | N | GTJ1 | GTJ1 | -1909113 | ||

16 | Y | GTJ1 | GTJ1 | -2146768 | ||

17 | N | ghf10 | GTJ2 | -5114 | -5049 | 65.4 |

19 | N | psefl | GTJ2 | -5077 | -5018 | 65.7 |

18 | N | Hom1 | GTJ2 | -166780 | ||

20 | N | GTJ1 | GTJ2 | -2639207 | ||

22 | Y | Hom1 | GTJ2 | -177080 | ||

24 | Y | GTJ1 | GTJ2 | -2797797 | ||

41 | N | Hom2 | GTJ2 | -2562795 | ||

42 | Y | Hom2 | GTJ2 | -2719021 | ||

43 | N | GTJ3 | GTJ2 | |||

44 | Y | GTJ3 | GTJ2 | |||

25 | N | ghf10 | Hom2 | -5084 | -5018 | 68.4 |

26 | N | Hom2 | Hom2 | -2555107 | ||

27 | N | psefl | Hom2 | -5008 | -4947 | 64.1 |

30 | Y | Hom2 | Hom2 | -2702371 | ||

33 | N | ghf10 | Hom3 | -5067 | -4998 | 60.4 |

34 | N | Hom3 | Hom3 | -2551162 | ||

35 | N | psefl | Hom3 | -4997 | -4940 | 62.1 |

36 | Y | Hom3 | Hom3 | -2699709 | ||

37 | N | ghf10 | Hom4 | -5077 | -5005 | 58.5 |

38 | N | Hom4 | Hom4 | -2557615 | ||

39 | N | psefl | Hom4 | -5045 | -4988 | 57.3 |

40 | Y | Hom4 | Hom4 | -2706292 | ||

47 | N | ghf10 | GTJ3 | ? | ||

48 | N | psefl | GTJ3 | ? | ||

45 | N | GTJ3 | GTJ3 | |||

46 | Y | GTJ3 | GTJ3 |

### Questions

### Archived Questions

- Should I drop the gap-filtered test cases since GTJ doesn't remove gappy columns?
- Yes - IH

- prot3.eg: In rate matrix, r -> n is missing and n -> {} is missing 2
- Zero rates are omitted from the grammar file. If you want to prevent this behavior, one workaround is to use a parametric model instead. IH 6/23/2006

- Use nullprot.eg to derive tree for test alignment or let xprot use trained grammar instead?
- Use nullprot.eg. Otherwise xprot will
~~use first rate matrix of input grammar to calculate tree~~ask you for a grammar file for tree estimation (7/13/2006)

- Use nullprot.eg. Otherwise xprot will

- How to do scaling of rate matrices as described by Jeff?
- Avg rate of mut = Sum(e(i) * R(i,i)), where e is equilib freq of aa for category and R is rate matrix for category. Scale this by Psi, equilib freq of categories, so that avg rate = 1.

- prot3.eg: Where did these rates originally come from?
- Not sure - probably fairly ad hoc, e.g. all simple scalar multiples of some generic AA substitution matrix(?). It's probably not worth using parameters whose provenance is unknown, like these. In fact using them as a seed is even a little suspect: we want our procedure to be as reproducible as possible & to use as little prior information as possible. IH

- How to calculate equilibrium distribution of secondary structure categories for GTJ db?
- T(i,j) is transition prob of phylo-HMM states. Seek vector q of equilibrium phylo-Hmm state probs. q is a left eigenvector of T with eigenvalue = 1.
- Sum(q(i)*T(i,j)) = q(j) <=> q*T = q

- T(i,j) is transition prob of phylo-HMM states. Seek vector q of equilibrium phylo-Hmm state probs. q is a left eigenvector of T with eigenvalue = 1.

-- Yuri Bendana - 22 Jun 2006