to the Sequence Simulator problem.
Perl sequence simulator (problem)
For this assignment you are asked to write a sequence "simulator" that generates random FASTA sequence files with similar statistics to a real (experimentally sequenced) FASTA file.
Your program should be able to
- calculate length & nucleotide composition statistics from a FASTA file;
- save those length & composition statistics to a named parameter file, or
- load length & composition statistics from a previously-saved parameter file;
- generate a random FASTA file with these length & composition parameters
- Make your output satisfy the additional constraints that
- your generated sequence starts with a start codon,
- in this reading frame, your sequence doesn't contain any premature stop codons. (Including a stop codon at the end is optional.)
NB this may slightly change the percent nucleotide composition of your file, i.e., if you change one letter in a stop codon to prevent the sequence from terminating at that point. This is fine; you don't need to go back and change the statistics of your file. This editing of codons is intended to be a clean-up step to generate a coding sequence out of the simulated fasta sequence you've created. When looking for stop codons, you can assume that the first nucleotide of your sequence is the first letter of the first codon, so that your reading frame begins with position 1.
It is optional to ensure that your sequence length is a multiple of 3, although this might actually help your design (e.g. if you write a loop to go through and check for stop codons).
The assignment will test your ability to use random numbers (which you will have to look up or otherwise investigate -- hint: use the
function), parse the @ARGV rray, handle files, and implement (or copy-and-paste) subroutines described in class.
Here are a few notes about the assignment:
1. Where the homework assignment asks for simulated sequence with the "same composition and length statistics" as the named file, we did not intend that you (necessarily) had to reproduce the exact same number of A's, C's, G's and T's as the original file. Rather, you only need to reproduce the approximate composition and length statistics. Some students have done this, and that's OK, but you should take this as a pretty strong clue that it's HARDER to reproduce the exact counts.
If you want to do the exact counts, you should think carefully about whether the program that you've written is producing a truly random sequence -- e.g. does it tend to clump up the most common nucleotide at the end of the simulated sequence? (Google "Fisher-Yates shuffle" for more info on this.)
2. The requirement that your file "doesn't contain any premature stop codons" seems to've thrown a few people. Specific questions included: (a) "what reading frame is the sequence assumed to be in?" (a.k.a. "should I look at every nucleotide triplet, or only every third nucleotide triplet?") (b) "what if the sequence is not a multiple of 3?" (as in the default sequence length of 1000)
Please note that the homework page now clarifies BOTH these questions. Quoting from the page:
When looking for stop codons, you can assume that the first nucleotide of your sequence is the first letter of the first codon, so that your reading frame begins with position 1. It is not necessary to ensure that your sequence length is a multiple of 3 if this will unnecessarily complicate your program, but it might help you in writing a loop to go through and check for stop codons.
3. A few other people had questions about the overall design such as "what happens if you use -load and -save together". The answer is, if we have left a particular point ambiguous or unspecified, then this means that there IS no right answer. The idea is that you will be using this program yourself later in the semester, and you should do whatever makes most sense to you -- and explain your decision! The assignment does stipulate that your program includes a help message; this help message would be a natural place to explain/justify your choice. We deliberately left certain parts of the specification loose, so that there was room for you to make your own design decisions.
Hope this all helps. If you have further questions (and these questions cannot be answered by the rule-of-thumb of "the program should do what makes most sense to you"), then please do email me for clarification.
Suppose your program is called
Then it should implement the following behavior:
Simulating a FASTA file of DNA
If run without any arguments (as shown above), the program should generate & print (on standard output) a FASTA sequence file
of uniform nucleotide composition
(i.e. 25%/25%/25%/25% A/C/G/T) containing a single sequence of length 1 kilobase.
Calculating length & nucleotide composition statistics
simulator.pl -calc seqfile.fasta
If run using the
argument on a named file (here
the program should first calculate nucleotide composition, number of sequences and length statistics
(e.g. average sequence length),
then generate a FASTA sequence file with those same composition and length statistics
Loading & saving parameters
simulator.pl -calc seqfile.fasta -save myparams
simulator.pl -load myparams
simulator.pl -load myparams -save myparams2
The first line should calculate composition and length statistics and generate a random FASTA file as described above.
However it should also
save the composition and length statistics to a file called
using a format that you define.
The specification of this parameter file format should be included in your homework.
The second line should load the parameters from the named parameter file (
before generating the random FASTA file.
The third line should load the parameters from the named parameter file (
and should then save them to another file (
before generating the random FASTA file.
As a test, the two files (
) should be identical.
When invoked with the
option, the program should display a short help message, including all of the above arguments and instructions on their usage.
Graduate level / extra credit
For extra credit, or as a minimum for graduate students, implement one or more of the following options:
- dinucleotide or higher-order compositional statistics (e.g. trinucleotide, tetranucleotide, or k-nucleotide where k is specifiable by the user...)
- probability distributions over sequence lengths (e.g. a geometric distribution, or some other distribution; be sure to estimate the relevant parameters using the
-calc option and include them in the parameter file)
- optionally, allow the user to edit the parameter file so as to be able to insert particular user-specified motifs at random into the simulation data (e.g. transcription-factor binding site sequences)
- You will be graded on correctness (90%) and style (10%). Please see the StyleGuidelines for expectations about your style.
- Grading for correctness will be automated, so if your output format does not match the format described above, you will lose points. It's ok if you have a little extra whitespace at the ends of lines or between different classes of sequences.
- Turn in your program by uploading the .py file to your personal wiki page.
- As mentioned in lecture, you may work with 1 other student if you so choose. (Remember, you can turn in at most 3 other assignments with the same student). If you do work with another student, put "I worked with xxxx" in a comment at the top of your .py file. Each person should turn in code, even if it's the same.
- Information about LATE assignments: You lose 20% of the points of the assignment for every day it's late. Contact Mohammad or Professor Holmes at least 48 hours before the due date if you have extenuating circumstances, if reasonably possible.
Copyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback