Final project (v1.8)
The schedule for the project (due dates etc) may be found on the main class page
Minor change to RankScore equation
for clarity. No longer using ΔPID (as it is meaningless in diversity). Now uses PID between 2 sequences and is added to 'RankScore'. Goal is still to maximize diversity. (12-14-11)
Added a basic sample input set
of files for students that want to check their 'Rank Score' calculation. Email MohammadAzimi
with your 'output.fasta' and 'owneffector.fasta' (if you have one) to get feedback. (12-13-11)
Removed '2(Δ%Gap)' from the RankScore equation
since %Gap (Y
) is an 'upper limit' and not a 'target' as pointed out by JamesMacaulay
Added clarification about riboswitch design and alternate approach
Updated Scoring Scheme
and changed codonfreq.txt
to contain RNA
codons (instead of DNA) as shown in the example in the FAQ. (12-8-11)
section below. (12-6-11)
v1.2 Effector sequence
will be an RNA sequence, not DNA. (12-5-11)
Added an additional variable
(percentage gap) to restrict sequence length (11-28-11)
Initial release (11-17-11)
The goal of the project is to implement a computer program, whose specification (including input and output formats) is described below. Programs, along with presentations, will be evaluated competitively in the final exam. Each team will present their implementation and describe what they see as advantages/disadvantages of their approach at the beginning of the final session. At the end of the final you will be asked to rank all other teams’ programs and presentations, and this peer review will form a part of the grade.
Each program must be submitted as a tarball named final.tar.gz
containing an executable named final.pl
(plus all dependencies: modules, libraries, etc) that, when run, will look for the named input files in the current working directory, and produce the named output file
Specifically, if your tarball is in the current working directory, the following sequence of shell commands should output your results:
tar -xvzf final.tar.gz
Inputs and outputs
Inputs (use the specified filenames and formats):
- protein.fasta A single protein sequence in FASTA format.
- sites.fasta A set of restriction enzyme sites that must not occur in the output (DNA sequence in FASTA format). Neither the site nor its reverse complement must appear in the eventual output.
- effector.fasta An effector RNA sequence (FASTA format). See below for description and relevance of this.
- codonfreq.txt A codon frequency table (64 rows, 2 columns: “codon”, a 3-character RNA codon, & “frequency”, a non-negative real number)
- params.txt A two-line file containing two numeric parameters.
- Line 1: N (the number of variant DNA sequences your program should generate)
- Line 2: X (the target percentage identity between the protein in protein.fasta, and the protein encoded by your DNA sequences)
- Line 3: Y (percentage of gaps allowed when generated sequence is aligned to original sequence)
Outputs (use the specified filename and format):
- output.fasta A file containing N DNA sequences (FASTA format)
Each output DNA sequence must satisfy the following criteria:
- It must encode a bacterial protein-coding gene, with a basic bacterial promoter (including a Pribnow box), a Shine-Dalgarno sequence, a start codon, a stop codon, and an intrinsic terminator.
- The DNA sequence must be in lower case, except for the start and stop codons which must be in upper case.
- There must be a riboswitch such that the Shine-Dalgarno sequence will only be exposed if the effector sequence is present.
- If P % is the percentage identity between the protein sequence in protein.fasta and the protein sequences encoded by your DNA output, then P must be as close as possible to X. (So, for example, if X = 80, then all your proteins should be roughly 80% identical to the input protein.)
- The N encoded proteins should be as diverse as possible. (So, the percentage identity between any two of your proteins should be as low as possible.)
- The relative codon usage for each amino acid should match the frequency table as closely as possible, compared to synonymous codons. (For example, given that your protein sequence has 10 glycine residues, the corresponding protein-coding DNA sequence must use the codons GGA, GGC, GGG and GGT to code for those glycines. The relative frequencies of each of these codons should be proportional to the frequencies of those same codons in codonfreq.txt.)
- Identical codons, or synonymous codons differing only at the third position, should be spaced apart as much as possible.
- Teams can attempt to design a DNA sequence with a functional riboswitch that uses the supplied effector for full credit (20pts).
- Teams can elect to take a 5pt deduction and define their own effector rather than using the one that will be supplied.
- If students choose this route, they should supply the effector sequence as a separate file 'owneffector.fasta' to be output along with 'output.fasta' in the same directory.
- Note: For the purposes of this project we assume that for an effector to successfully bind, its exact complement or reverse complement has to be present in the sequence. Furthermore, if multiple complement or reverse complement sequences occur throughout the entire length of the sequence (even within the coding region), we assume the effector binds to them and those nucleotides are completely unavailable for base-pairing.
- Note: It's sufficient for a single nucleotide of the Shine-Dalgarno sequence to be base-paired for it to be considered inaccessible. All six nucleotides of the Shine-Dalgarno sequence must be non-base-paired for it to be considered accessible (a non-base-paired SD sequence that is exposed in the loop region of a stem-loop is considered accessible).
- Points will be awarded for fulfilling the project criteria, with an exact distribution to be announced during the project period.
- Points will be deducted for increased computation time. Programs running for longer than about a minute may be killed (depending on time constraints), so make sure your program outputs sequences as soon as it discovers them.
- You may work in teams or individually. The maximum team size is four people. Individual contributions to the team effort must be clearly delineated.
- You will have 4 weeks from the project announcement to the final. Additional project requirements and details of the scoring scheme will be revealed at the end of week 3.
- Scoring distribution: The following features will be checked for in your generated DNA sequences (max available points = 65):
- Check output.fasta is valid FASTA format, if not valid, no further points awarded, go to step 2.
- Check for Promoter region (5pts), if not valid, no further points awarded, go to step 2.
- Check for Shine-Dalgarno sequence (5pts), if not valid, no further points awarded, go to step 2.
- Check for Start Codon (5pts), if not valid, no further points awarded, go to step 2.
- Check for Stop Codon (5pts), if not valid, no further points awarded, go to step 2.
- Check for Terminator (5pts), if not valid, no further points awarded, go to step 2.
- Check for functional Riboswitch (20pts). Go to step 2.
- Check for Restriction sites and their reverse complement (10pts). Go to step 3.
- If steps 1 and 2 were successful, perform ranking against other teams (10pts). This ranking will use the following equation:
- ΔPID is the difference between the PID of your output sequence and the target PID.
- ΔCodonFreq is the difference between the a codon's frequency in your output sequence and the target frequency (averaged across all codons).
- (Avg Space between similar codons) / (sequence length) is the average space between identical/synonymous codons, averaged for all codons and divided by sequence length.
- The last term being added is the average PID between all your generated sequences, multiplied by the number of sequences generated.
- If a given sequence exceeds the allowed percentage of gaps, it does not get scored.
- Note: You want to minimize your 'RankScore' to maximize points earned.
- Note: For steps 1 and 2, the points will be split up across the desired number of sequences. Therefore, if the desired number of sequences is 4 (_N_=4), and only 3 of your 4 sequences have a functional riboswitch, you will be awarded 15 points.
Structure of the final
- 10 minute presentation per group: (15% of grade)
- Discuss your implementation
- Discuss weakness/strengths of your implementation (accuracy, CPU/memory usage, etc.)
- Peer review of other implementations. (10% of grade)
- Testing of your group’s implementation. (65% of grade)
- Description of individual contributions to group implementations. (10% of grade)
Frequently Asked Questions
These are common questions about the design and function of your programs that you may find to be helpful. Check back occasionally for updates.
1. Should my promoter region contain specific sequences at the -10 and -35 sites?
Although there is natural variability in the sequence composition of prokaryote promoters, you should ensure that your DNA sequence contains a Pribnow box (specifically the sequence TATAAT) starting at the -10 position. The composition of the -35 sequence can have more flexibility in terms of composition and position. See example below:
For design purposes, the gene to be transcribed begins at the +1 site, exactly 10 nucleotides downstream of the start of the Pribnow box.
2. Does my sequence have to produce the hammerhead secondary structure as seen in lab/HW when there is no effector present and the Shine-Dalgarno sequence is base-paired (ie. OFF state)?
No, a hairpin structure would be adequate. See illustration below:
3. What are acceptable start codons?
AUG, GUG and UUG
4. Should the codon frequency determine the overall codon usage or simply the distribution across degenerate codons of individual amino acids?
The codon frequency table will give overall frequency of each of the 64 codons. You should use this information to get a similar relative frequency of codons that code for the same amino acid. Your overall codon frequency will be more directly impacted by the protein sequence input and the percent identity you want to achieve.
5. Do codon frequency files contain the overall frequency of all 64 codons or just the frequency of all degenerate codons that code for a specific amino acid?
The codon frequency files will contain the overall frequency of all 64 codons with each line containing the nucleotide triplet, followed by a single space, followed by the frequency of that triplet per thousand
codons (the sum of all the frequencies should be ~1000). See bellow:
... (56 lines omitted)
6. How will percent identity (PID) and percentage of gaps be calculated?
The PID between two sequences will be calculated by aligning two sequences using the Needleman-Wunsch algorithm with the BLOSUM62 scoring matrix and a fixed gap penalty of -4. See example below:
In the example above, after aligning the sequences, we see that 3/5 residues match perfectly and there is a single gap. Therefore, the PID is 60% and percentage of gaps is 20%.
Note: There are likely to be multiple maximal alignments for two sequences depending on the implementation of the traceback algorithm. For scoring, we will use the non-optimized implementation provided in the HW solutions
7. Can I use 3rd party software packages?
Yes. The test environment will have a few packages installed by default. It's up to you to request installation of additional software packages (by 8AM, 12-14-11). The default software packages are:
- Vienna RNA Package
Use of 3rd party software packages is not required. Contact MohammadAzimi
for issues relating to installation and/or testing.
8. There is variability in the prokaryotic terminator sequence, what are the requirements for terminator composition for this project?
You should ensure that your intrinsic terminator is composed of a CG-rich hairpin loop (at-least 4 base-pairs) followed by a poly-U tail (composed of at-least 8 sequential Uracil nucleotides). For the purposes of this project, transcription will be terminated immediately after the end of the 8+ Uracil nucleotides.
Sample Input Files
Copyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback