Home - this site is powered by TWiki(R)
Fall08 > TWikiUsers > JasonLee > JSFinalReport
TWiki webs: Main | TWiki | Sandbox   Log In or Register

Changes | Index | Search | Go

Final Project

Siddharth Shah and Jason Lee

Introduction:

Metagenomics is the study of genetic material recovered from environmental samples. This is much different than genetic analysis done on cultured samples, where all present organisms are known. Unlike regular genetic sequencing, Metagenomics takes a variety of genomes simultaneously and sequences them. This calls for specialized algorithms that are able to identify the differences between the strands of short-read DNA and bin them into their respective species. There are many different possible algorithms can be implemented and it was decided to test whether an implemented version of compositional analysis in tandem with the (modified) Needleman-Wunsch pairwise alignment algorithm would do give beneficial results.

Methods

We programmed our solution entirely in Perl, as it has become almost a standard in bioinformatics. The first program, hereinafter referred to as the “binner,” sorts through a FASTA file of short read sequences and matches each sequence with a genome in another FASTA file. We added one additional parameter to make this program more robust: a stem for the names of the files to be created. The binner was the primary focus of the project. Initially we used a modified simulator.pl to create random sequences to our specifications. We computed the compositional statistics for each short read sequence and compared it to the compositional statistics for each genome. We opted to employ a root-mean-square method to decide which genome was the best match for that particular short read sequence (if sr is the short read, srB denotes the percentage composition of base B in sr, and gi is a genome in the training data, we took the g that minimized the expression (((srB-gi,A)2 + (srC-gi,C)2 + (srT-gi,G)2 + (srA-gi,T)2)/4)1/2. This generally produced reasonably accurate results, but the accuracy took a significant hit when the compositions of the genomes were very similar; this phenomenon is expounded upon in later sections. In summary, we realized that we should defer to a different method in cases where the short read sequence could reasonably match more than one genome in the training data. We considered Markov models before deciding to take a more novel approach that satisfied our interest in sequence alignment; we used the Needleman-Wunsch algorithm with a 0 gap penalty, but we also modified it to produce scores that rewarded streaks of matches (which indicate a better match overall). Further work might include using the Smith-Waterman algorithm, which highlights high-scoring local alignments. To compare our results with Needleman-Wunsch and without, we had two Perl programs (one for each) and computed their accuracy using a third program that merely counted incorrect binnings.

The benchmark program was the second major program we used. Its purpose was to make random cuts from FASTA files, each containing one genome sequence, in order to simulate a typical metagenomics experiment. It takes four arguments: the length of the short read sequences to produce, the number of short read sequences to make, the output file to which to write the short read sequences, and finally any number (greater than 0) of FASTA files containing genomes to be sampled (i.e., the training data to use to produce the test data). It was far simpler than the binner, both in purpose and in coding complexity.

Results

closely related sequences
# of short reads Length of short reads time (NW) time (no NW) accuracy (NW) accuracy (no NW)
80 100 2179 0.03 91.88 73.56
80 200 4994 0.03 98.74 84.03

randomly distributed sequences
# of short reads Length of short reads time (NW) time (no NW) accuracy (NW) accuracy (no NW)
80 100 259 0.04 88.75 74.37
80 200 621 0.04 96.34 84.11
80 300 1053 0.05 98.87 91.62
100 100 330 0.04 91.48 76.93
100 200 733 0.05 98.92 92.23
100 300 1159 0.05 100 96.73
150 100 419 0.06 97.43 83.31
150 200 801 0.05 98.45 93.12
150 300 1449 0.05 100 98.5

large disparity between sequences
# of short reads Length of short reads time (NW) time (no NW) accuracy (NW) accuracy (no NW)
80 100 0.03 0.03 100 100
80 200 0.04 0.04 100 100
80 300 0.05 0.05 100 100
100 100 0.04 0.04 100 100
100 200 0.05 0.05 100 100
100 300 0.05 0.06 100 100
150 100 0.04 0.04 100 100
150 200 0.05 0.04 100 100
150 300 0.06 0.06 100 100

Different situations were tested in order to fully understand how multiple factors affect the speed, accuracy, and overall efficiency of the binning algorithm. The first set of situations covered was the differences between sequences. Accuracy and runtime of the binning algorithm were tested for closely related sequences (similar compositions), randomly distributed sequences (randomly generated compositions) , and distant sequences (sequences with very different compositions). Also, the time and efficiency compared to different number of short reads, as well as compared to the length of the short reads was tested. And last, the program was run with and without the Needleman-Wunsch algorithm to compare accuracy and runtime of just composition-based analysis versus the modified version with Needleman-Wunsch.

Analysis and Discussion

We found that the longer the length of the short read, the more accurate the program becomes at the expense of program run-time. This is to be expected because we use an adapted Needleman-Wunsch which has a time complexity of O(L*N) where L is the length of the short read and N is the length of the long sequence. Since the length of the training data does not change, changing the length of the short read affects the time by multiples of N. Also, we found that as we added more short reads, the accuracy slightly increased but asymptotically increased towards a finite accuracy. This showed that the actual accuracy was not increasing but the average of short reads made the resulting accuracy closer to the true accuracy of the algorithm. In the end we found that longer short reads were taken, the compositional accuracy became closer to the algorithm with the modified Needleman-Wunsch incorporated. This, taken into account with the horrible time efficiency of Needleman-Wunsch, leads us to conclude that an exact sequence comparison proves too inefficient to calculate when there are thousands of short reads as well as many training sequences to compare.

Contribution Statement

We both decided on the algorithms to use and general approach (we like sequence alignment, so we attempted to incorporate that). Siddharth coded most of the Perl, with occasional input and help from Jason. We both performed data collection and analysis. Jason put together most of the presentation, with occasional input and help from Siddharth. Siddharth did the Methods part of this paper, and Jason did the rest. We each edited the other's part(s) of the paper.
I Attachment Action Size Date Who Comment
Txttxt acctest.pl.txt manage 0.6 K 2008-12-17 - 07:27 JasonLee  
Txttxt benchmark.pl.txt manage 1.5 K 2008-12-17 - 07:28 SiddharthShah  
Txttxt classifyReads.pl.txt manage 3.4 K 2008-12-17 - 07:26 SiddharthShah  
Txttxt classifyReads_NW.pl.txt manage 5.5 K 2008-12-17 - 07:28 SiddharthShah  
Edit | Attach | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | Raw edit | More topic actions


Parents: TWikiUsers > JasonLee
This site is powered by the TWiki collaboration platformCopyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
TWiki Appliance - Powered by TurnKey Linux