Predicting rates for external/internal residues in protein
Wed Sep 20, 2006 10:06 AM
Preparing the data set
Mapping protein structures onto the fly predictions.
Downloaded the latest list from
http://www.ebi.ac.uk/msd-srv/docs/sifts/ftp.html
See
http://www.bioinf.org.uk/pdbsws/
for an alternative source.
The pdb list contains 57381 entries, however they are mapped to only 9842 entries.
For 9362 I can retrieve the sequences and I get alignments to 3524 of these using
strict BLASTP cutoff of 1e-20. They match to 6571 transcripts (in 4064 genes)
(out of 19369 transcripts: 34% coverage). I guess I could use gtg to boost these
numbers?
SUPERFAMILY matches 58% of all sequences (11314, they use the same sequence set), but
I don't think getting the alignments will be easy. The alignment view takes ages. A tab
separated file is downloadable with the coordinates of the domain and the superfamily id
and the best matching structure.
The taxonomy list in CATH (
Gene3D) is broken, so no info there.
Anyways, remote matches I will not need, because homology modelling will be difficult.
I want to stay above the 40% identity threshold anyways, so that I can reliably map
surface residues. This is the distribution of the pides of my blast matches:
Even with 10e-20 I am way in the twilight zone of modelling and need to filter more.
Next steps:
- how many are left at 30%/40% threshold?
- how much overlap is there with the 1:1 ortholog set?
- get ASA from HSSP files and annotate the multiple alignments.
- build a grammar
- run xgram
- analysis
- re-activate the rasmol/python scripts for plotting.
Looks I need to spend a day at least full time in the preparation of the data.
Fri Sep 22, 2006 11:01 AM
Preparing the data set
Filtering predictions at 40% threshold: there are 3,577 sequences left. Of these,
1,034 match the full clusters.
Matching residues between the multiple alignments and structures involves three
steps:
- pdb -> uniprot
- uniprot -> dmel protein
- dmel protein -> multiple alignment
The problem is that the Gblocks multiple alignments are pruned so I can't use those without
reconstructing the removed residues. This is possible and I might have to do this anyways.
Alternatively, I could use the aa alignments, as they preserve all dmel residues, so in
theory the dmel protein sequence should be in it in its entirety.
I want as many residues in the structure covered as possible. Otherwise there would be
a bias: the loop region would be removed.
So it is decided: I use the aa alignments.
--
AndreasHeger - 20 Sep 2006