Predicting rates for external/internal residues in protein

Wed Sep 20, 2006 10:06 AM

Preparing the data set

Mapping protein structures onto the fly predictions.

Downloaded the latest list from http://www.ebi.ac.uk/msd-srv/docs/sifts/ftp.html

See http://www.bioinf.org.uk/pdbsws/ for an alternative source.

The pdb list contains 57381 entries, however they are mapped to only 9842 entries. For 9362 I can retrieve the sequences and I get alignments to 3524 of these using strict BLASTP cutoff of 1e-20. They match to 6571 transcripts (in 4064 genes) (out of 19369 transcripts: 34% coverage). I guess I could use gtg to boost these numbers?

SUPERFAMILY matches 58% of all sequences (11314, they use the same sequence set), but I don't think getting the alignments will be easy. The alignment view takes ages. A tab separated file is downloadable with the coordinates of the domain and the superfamily id and the best matching structure.

The taxonomy list in CATH (Gene3D) is broken, so no info there.

Anyways, remote matches I will not need, because homology modelling will be difficult. I want to stay above the 40% identity threshold anyways, so that I can reliably map surface residues. This is the distribution of the pides of my blast matches:

blast_vs_structures.png

Even with 10e-20 I am way in the twilight zone of modelling and need to filter more.

Next steps:

  • how many are left at 30%/40% threshold?
  • how much overlap is there with the 1:1 ortholog set?
  • get ASA from HSSP files and annotate the multiple alignments.
    • look for thresholds
  • build a grammar
  • run xgram
  • analysis
    • re-activate the rasmol/python scripts for plotting.

Looks I need to spend a day at least full time in the preparation of the data.

Fri Sep 22, 2006 11:01 AM

Preparing the data set

Filtering predictions at 40% threshold: there are 3,577 sequences left. Of these, 1,034 match the full clusters.

Matching residues between the multiple alignments and structures involves three steps:

  1. pdb -> uniprot
  2. uniprot -> dmel protein
  3. dmel protein -> multiple alignment

The problem is that the Gblocks multiple alignments are pruned so I can't use those without reconstructing the removed residues. This is possible and I might have to do this anyways. Alternatively, I could use the aa alignments, as they preserve all dmel residues, so in theory the dmel protein sequence should be in it in its entirety.

I want as many residues in the structure covered as possible. Otherwise there would be a bias: the loop region would be removed.

So it is decided: I use the aa alignments.

-- AndreasHeger - 20 Sep 2006

Topic revision: r4 - 2006-09-23 - AndreasHeger
 

This site is powered by the TWiki collaboration platformCopyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
TWiki Appliance - Powered by TurnKey Linux