Predicting rates for external/internal residues in protein
Wed Sep 20, 2006 10:06 AM
Preparing the data set
Mapping protein structures onto the fly predictions.
Downloaded the latest list from
for an alternative source.
The pdb list contains 57381 entries, however they are mapped to only 9842 entries.
For 9362 I can retrieve the sequences and I get alignments to 3524 of these using
strict BLASTP cutoff of 1e-20. They match to 6571 transcripts (in 4064 genes)
(out of 19369 transcripts: 34% coverage). I guess I could use gtg to boost these
SUPERFAMILY matches 58% of all sequences (11314, they use the same sequence set), but
I don't think getting the alignments will be easy. The alignment view takes ages. A tab
separated file is downloadable with the coordinates of the domain and the superfamily id
and the best matching structure.
The taxonomy list in CATH (Gene3D
) is broken, so no info there.
Anyways, remote matches I will not need, because homology modelling will be difficult.
I want to stay above the 40% identity threshold anyways, so that I can reliably map
surface residues. This is the distribution of the pides of my blast matches:
Even with 10e-20 I am way in the twilight zone of modelling and need to filter more.
- how many are left at 30%/40% threshold?
- how much overlap is there with the 1:1 ortholog set?
- get ASA from HSSP files and annotate the multiple alignments.
- build a grammar
- run xgram
- re-activate the rasmol/python scripts for plotting.
Looks I need to spend a day at least full time in the preparation of the data.
Fri Sep 22, 2006 11:01 AM
Preparing the data set
Filtering predictions at 40% threshold: there are 3,577 sequences left. Of these,
1,034 match the full clusters.
Matching residues between the multiple alignments and structures involves three
- pdb -> uniprot
- uniprot -> dmel protein
- dmel protein -> multiple alignment
The problem is that the Gblocks multiple alignments are pruned so I can't use those without
reconstructing the removed residues. This is possible and I might have to do this anyways.
Alternatively, I could use the aa alignments, as they preserve all dmel residues, so in
theory the dmel protein sequence should be in it in its entirety.
I want as many residues in the structure covered as possible. Otherwise there would be
a bias: the loop region would be removed.
So it is decided: I use the aa alignments.
- 20 Sep 2006