|
| Final project
This page outlines the final project possibilities for BioE131/231.
Brief summmary of rules
The project topics are listed below.
You will write up a report of your project on the wiki page (including a mini-report, three weeks in)
and you'll also give a 5-minute presentation at the final exam.
Breakdown of credit
Credit will be broken down as follows:
- 20% for the mini-report (wiki page to be reviewed on November 19)
- 50% for the final report (wiki page to be reviewed on December 16)
- 30% for the final presentation (December 16)
Mini-report
The mini-report on your project wiki page is our main way to check that you're on track to complete a project.
It should contain an outline of who your team is (if any), which project you plan to do, the general approach that you plan to take and your approximate schedule.
The description of your approach should not include minutiae, just broad detail: what programs you'll write or use, what experiments you'll do, what computers you'll do them on, etc.
A good mini-report is like a grant proposal: persuasive, reasoned and short.
(We enjoy reading these)
A bad mini-report doesn't guarantee you'll fail; unfortunately, it doesn't guarantee that we can get you back on track, either.
(But we'll try)
Final report
Like the mini-report, the final report should be posted on the wiki.
The final report should follow the standard format for a scientific paper:
IMRAD (Wikipedia)
That is:
Introduction, Methods, Results, Analysis, Discussion.
Optionally you may move the Methods section to the end, after the Discussion.
(This is the way a lot of journals do it nowadays)
Keep each section short -- one or two paragraphs is about right
(although in the Methods section you can go into more detail
if this would be necessary for a reader to reproduce your results).
Supplemental files, such as program scripts and short (<1MB) output files,
can be included as attachments to the wiki page.
See the following link for a more thorough description of the various sections in the IMRAD format:
Final exam
The time limits on presentations will be strictly enforced: 5 minutes per team-member, with two- and one-minute warnings.
You will have full access to the class AV equipment (including speakers) and you can bring your laptop.
The goal of a scientific presentation is to inform and persuade:
- Clearly describe your work and show the results you obtained;
- Convince the audience that...
- the questions you addressed are interesting and worthwhile;
- the results are scientifically valid;
- any claims you make are well-supported.
A little entertainment never hurts either.
Here are a few tips on scientific presentations:
Teams
You are encouraged to form teams, in which case your allotted presentation time will be proportional to the number of people in the team
(5 minutes per person).
You also need to clearly identify the separate contributions of team members.
Examples of clear contribution statements:
- "Malia wrote program A; Sasha wrote program B; Michelle did database searches and most of the presentation; all of us ran simulations and analyzed results"
- "Joey and Tommy did the computer science; Johnny did the biology; Dee-Dee did the math; all contributed equally to the presentation"
- "Dave coded the Perl, Andrew & Martin searched the genome, Vince wrote the report"
Fall08 project topics
There are three projects available. Choose one of these.
Assume that any project details left unspecified are free choices for you to make.
Your choices should be guided by best scientific practice, by consultation with the literature, and by reasoning based on the principles and examples taught in class.
Project Topic #1: species binning for metagenomics
In Metagenomics (Wikipedia) , short DNA sequence reads are extracted from multiple organisms sharing a single environmental niche.
A common method for dealing with the overwhelming amount of data from such experiments is to attempt to pre-process the short read data
by first "binning" it according to its sequence composition.
Your goal for this project is twofold. First, you should write a program that
- for input, takes the following:
- a FASTA file containing a large number of short sequences, corresponding to reads from a mixture of genomes (the test data);
- a FASTA file containing longer sequences, one for each species that is expected either to be present in the mixture, or to be closely related to another species that is present (the training data).
- the program should use the training data to gather compositional statistics for the sequences, and then should use any algorithm that you choose to partition the test data according to the most likely species.
- for output, the program should generate a series of FASTA files, one for each sequence in the training data; each such generated file should contain all the short-read sequences that your program decided were most similar to that training sequence.
The second goal is to critically evaluate your program by means of a benchmark. For this you can use simulated data OR real data.
Either way, as part of your benchmark, you should create a program that simulates a metagenomics experiment, as follows:
- takes as input one or more FASTA files, each containing a genome sequence;
- generates a large number of simulated "short reads", corresponding to random subsequences of this sequence (making sure that you record which short read came from which sequence).
You should use this "metagenomics simulator" in your benchmark.
You may also wish to use the PerlSequenceSimulator that you created as part of an earlier homework; or you may use real data.
The design of your binning algorithm and your subsequent evaluation are essentially up to you.
Some aspects of your program's performance that you might want to measure are (a) binning accuracy and (b) speed.
For more suggestions on experimental design, see the following paper:
A few other questions you might want to consider:
- Typically, how many reads are produced by metagenomics experiments? How long are those reads? How can we even talk about "typically" anyway, when the range of metagenomics experiments and experimental equipment is so broad? Damn, am I going to have to focus on a particular sub-area?
- What will be the accuracy (and computational complexity) of an exact approach based on sequence matching? How about an *inexact* approach using composition statistics?
- How many of your design questions are you able to answer a priori, by doing some back-of-the-envelope calculations, or by going to the library? How many require simulation to answer effectively?
Project Topic #2: software for RNA logic gate design
The idea behind this project is to implement a (series of) programs like the one that you sketched for the RNA folding lab & homework.
Each program that you write should generate a candidate sequence for one of the ribozyme-based logic gates described in this paper:
One gate is fine; however, if you are on a roll (or working in a large team), implement as many gates as you are able to do in the time you have.
Refer to the RNA folding lab & homework for more details and background info.
You may optionally include a test for inter-accessibility between the states of your gate (i.e. examining the kinetics of your sequence's unfolding and re-folding pathway).
For example, one tool that you might investigate using for this purpose is kinfold.
However, since we have not explicitly used RNA folding kinetics tools in lab, this is not a mandatory extension (even though it was part of Penchovsky and Breaker's software).
Project Topic #3: a phylogenomic survey of the phospholipid synthesis pathway
Inositol derivatives are versatile molecules that are implicated in signaling and membrane trafficking.
Their evolution and phylogenetic dispersal is reviewed in the following paper:
Consider especially Figure 2 of that paper, which describes the metabolic pathway for biosynthesis of inositol derivatives,
and Table 1 which indicates the phylogenetic distribution (at a broad scale) of many of the proteins from Figure 2.
The purpose of this (open-ended) project is a more detailed exploration of the results summarized in Table 1.
You will investigate the presence or absence of proteins involved in inositol biosynthesis in the genomes of selected species listed in the first column of that table.
It is strongly suggested that you focus your search on bacterial and archaeal genomes since these tend to be smallest, offering you the most bang/$ for the computational resources you have available.
Other than that, you are free to choose any genomes you wish:
casting your net widely and doing a broad-scale survey of multiple genomes and proteins,
or homing in more narrowly on a particular subset.
You will be graded on
- the accuracy and cogency of your summary of relevant biological literature (e.g. relevant parts of the review paper by Michell and/or other papers in this area);
- the design and execution of your computational experiments;
- your ability to connect the results of your computational experiments to the underlying biology.
|