Final Project
Siddharth Shah and Jason Lee
Introduction:
Metagenomics is the study of genetic material recovered from environmental samples. This is much different than genetic analysis done on cultured samples, where all present organisms are known. Unlike regular genetic sequencing, Metagenomics takes a variety of genomes simultaneously and sequences them. This calls for specialized algorithms that are able to identify the differences between the strands of short-read DNA and bin them into their respective species. There are many different possible algorithms can be implemented and it was decided to test whether an implemented version of compositional analysis in tandem with the (modified) Needleman-Wunsch pairwise alignment algorithm would do give beneficial results.
Methods
We programmed our solution entirely in Perl, as it has become almost a standard in bioinformatics. The first program, hereinafter referred to as the “binner,” sorts through a FASTA file of short read sequences and matches each sequence with a genome in another FASTA file. We added one additional parameter to make this program more robust: a stem for the names of the files to be created. The binner was the primary focus of the project. Initially we used a modified simulator.pl to create random sequences to our specifications. We computed the compositional statistics for each short read sequence and compared it to the compositional statistics for each genome. We opted to employ a root-mean-square method to decide which genome was the best match for that particular short read sequence (if sr is the short read, sr
B denotes the percentage composition of base B in sr, and g
i is a genome in the training data, we took the g that minimized the expression (((sr
B-g
i,A)
2 + (sr
C-g
i,C)
2 + (sr
T-g
i,G)
2 + (sr
A-g
i,T)
2)/4)
1/2. This generally produced reasonably accurate results, but the accuracy took a significant hit when the compositions of the genomes were very similar; this phenomenon is expounded upon in later sections. In summary, we realized that we should defer to a different method in cases where the short read sequence could reasonably match more than one genome in the training data. We considered Markov models before deciding to take a more novel approach that satisfied our interest in sequence alignment; we used the Needleman-Wunsch algorithm with a 0 gap penalty, but we also modified it to produce scores that rewarded streaks of matches (which indicate a better match overall). Further work might include using the Smith-Waterman algorithm, which highlights high-scoring local alignments. To compare our results with Needleman-Wunsch and without, we had two Perl programs (one for each) and computed their accuracy using a third program that merely counted incorrect binnings.
The benchmark program was the second major program we used. Its purpose was to make random cuts from FASTA files, each containing one genome sequence, in order to simulate a typical metagenomics experiment. It takes four arguments: the length of the short read sequences to produce, the number of short read sequences to make, the output file to which to write the short read sequences, and finally any number (greater than 0) of FASTA files containing genomes to be sampled (i.e., the training data to use to produce the test data). It was far simpler than the binner, both in purpose and in coding complexity.
Results
| closely related sequences |
| # of short reads |
Length of short reads |
time (NW) |
time (no NW) |
accuracy (NW) |
accuracy (no NW) |
| 80 |
100 |
2179 |
0.03 |
91.88 |
73.56 |
| 80 |
200 |
4994 |
0.03 |
98.74 |
84.03 |
| randomly distributed sequences |
| # of short reads |
Length of short reads |
time (NW) |
time (no NW) |
accuracy (NW) |
accuracy (no NW) |
| 80 |
100 |
259 |
0.04 |
88.75 |
74.37 |
| 80 |
200 |
621 |
0.04 |
96.34 |
84.11 |
| 80 |
300 |
1053 |
0.05 |
98.87 |
91.62 |
| 100 |
100 |
330 |
0.04 |
91.48 |
76.93 |
| 100 |
200 |
733 |
0.05 |
98.92 |
92.23 |
| 100 |
300 |
1159 |
0.05 |
100 |
96.73 |
| 150 |
100 |
419 |
0.06 |
97.43 |
83.31 |
| 150 |
200 |
801 |
0.05 |
98.45 |
93.12 |
| 150 |
300 |
1449 |
0.05 |
100 |
98.5 |
| large disparity between sequences |
| # of short reads |
Length of short reads |
time (NW) |
time (no NW) |
accuracy (NW) |
accuracy (no NW) |
| 80 |
100 |
0.03 |
0.03 |
100 |
100 |
| 80 |
200 |
0.04 |
0.04 |
100 |
100 |
| 80 |
300 |
0.05 |
0.05 |
100 |
100 |
| 100 |
100 |
0.04 |
0.04 |
100 |
100 |
| 100 |
200 |
0.05 |
0.05 |
100 |
100 |
| 100 |
300 |
0.05 |
0.06 |
100 |
100 |
| 150 |
100 |
0.04 |
0.04 |
100 |
100 |
| 150 |
200 |
0.05 |
0.04 |
100 |
100 |
| 150 |
300 |
0.06 |
0.06 |
100 |
100 |
Different situations were tested in order to fully understand how multiple factors affect the speed, accuracy, and overall efficiency of the binning algorithm. The first set of situations covered was the differences between sequences. Accuracy and runtime of the binning algorithm were tested for closely related sequences (similar compositions), randomly distributed sequences (randomly generated compositions) , and distant sequences (sequences with very different compositions). Also, the time and efficiency compared to different number of short reads, as well as compared to the length of the short reads was tested. And last, the program was run with and without the Needleman-Wunsch algorithm to compare accuracy and runtime of just composition-based analysis versus the modified version with Needleman-Wunsch.
Analysis and Discussion
We found that the longer the length of the short read, the more accurate the program becomes at the expense of program run-time. This is to be expected because we use an adapted Needleman-Wunsch which has a time complexity of O(L*N) where L is the length of the short read and N is the length of the long sequence. Since the length of the training data does not change, changing the length of the short read affects the time by multiples of N. Also, we found that as we added more short reads, the accuracy slightly increased but asymptotically increased towards a finite accuracy. This showed that the actual accuracy was not increasing but the average of short reads made the resulting accuracy closer to the true accuracy of the algorithm. In the end we found that longer short reads were taken, the compositional accuracy became closer to the algorithm with the modified Needleman-Wunsch incorporated. This, taken into account with the horrible time efficiency of Needleman-Wunsch, leads us to conclude that an exact sequence comparison proves too inefficient to calculate when there are thousands of short reads as well as many training sequences to compare.
Contribution Statement
We both decided on the algorithms to use and general approach (we like sequence alignment, so we attempted to incorporate that). Siddharth coded most of the Perl, with occasional input and help from Jason. We both performed data collection and analysis. Jason put together most of the presentation, with occasional input and help from Siddharth. Siddharth did the Methods part of this paper, and Jason did the rest. We each edited the other's part(s) of the paper.

Copyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback