|
|
-- David Li - 19 Nov 2008
Final Project
Project: METAGENOMICS
We will examine the ability of probabilistic models to correctly identify the source of small DNA sequences taken from a range of organisms found in an environmental sample.
Group Members
- Patrick Harrigan
- David Li
- Jason Pai
- Robert Kuo
Methods
In order to identify the source of the small DNA segments, we will generate nucleotide order statistics from known genomes of species believed to be present in the environmental sample. These order statistics can be used to determine the posterior probability that a short sequence came from a genome present in the training data. The short sequences will be identified as belonging to the genome whose order statistics generate the highest posterior probability.
A script will be used to generate order-n Markov statistics for each genome present in the training data. By taking short sequences randomly from the known genomes used to train the algorithm, the ability of the method to correctly identify the source of these short sequences can be determined. We will automate this process for different order Markov statistics in order to determine how higher order compositional statistics increase the accuracy of the algorithm. Increasing the order of the Markov statistics will also increase the computing time required by the algorithm, and we will determine if the increase in accuracy justifies this increase in run time.
Additionally, we plan to use existing bioinformatic tools such as glimmer to identify likely coding and non coding regions of the genomes in the training data. If the order statistics of coding and non coding regions vary within each genome in the training data, using these order statistics to calculate the posterior probabilities might be a more accurate means of identifying the source of the short DNA sequences.
We will be using our own personal laptops as well as the DECF machines to run our scripts.
SCHEDULE
- Nov 23:
- We will have written the script to generate order n markov statistics
- We will have written the script to automate the benchmarking
- We will have gathered the training data for our genomes
- Nov 29:
- We will written the algorithm to generate posterior probabilities and have begun benchmarking
- We will have used Glimmer to identify non coding and coding regions and began to benchmark this process as well.
- Past Nov 30:
- We will write our final report and prepare our presentation
|