Bayesian inference exercise.
These programming exercises also develop your understanding of Bayesian statistical inference procedures.
First read the Wikipedia article on Wikipedia:Bayesian_inference
Specifically, be sure to read the section titled
Open reading frames - submit as hw8a.py
Suppose that S is a sequence of N codons. Equivalently, S can be thought of as a sequence of 3N nucleotides.
We model S as a sequence of Wikipedia:Independent_and_identically-distributed_random_variables
The random variables can be codons (N of them) or nucleotides (3N of them).
These are two very different models.
In a sequence of 3N I.I.D. nucleotides, we can't guarantee that the sequence won't contain any stop codons
(unless we remove T's and/or A's entirely).
However, when generating a sequence of N I.I.D. codons, we can guarantee to generate no stop codons quite easily.
Consider the following two ways of generating sequence S:
- (G=0) S is an I.I.D. sequence of 3N nucleotides. Each nucleotide (x) is independently sampled from some probability distribution p(x).
- (G=1) S is an I.I.D. sequence of N independently sampled codons. Each codon is sampled as follows:
- Sample three random nucleotides (xyz) from the same probability distribution: i.e. sample x from p(x), y from p(y) and z from p(z).
- If the resultant codon xyz is a stop codon, then go back to the previous step (resampling x, y and z), repeating until you get an xyz that is not a stop codon.
- Use codon xyz.
These two ways of generating S represent two alternate probabilistic models
Let G denote the model used: if G=0 then S is a random sequence of nucleotides (and may contain stop codons). If G=1 then S is a random coding sequence (and will contain no stop codons).
Assume that both of these models are equally probable, a priori
The goal of this exercise is to write a Python program that does the following:
- Read in a correctly formatted FastaFormat file (perform normal error checking for file existence - this is considered stylistic at this point);
- Estimate a probability distribution over nucleotides, p(x), by counting all nucleotides in the file. (10)
- For each sequence S in the file, calculate and print the following:
- the posterior probability P(G=1|S); (15)
- the log-ratio of the posterior probabilities, . (15)
Comment on the log-ratio (just add a comment at the top of your .py file). Does it tell you anything that the posterior probability P(G=1|S) does not? Are there any circumstances (i.e. input data) for which it might be a more appropriate to quote the log-ratio rather than the posterior probability? (5)
Event times - hw8b.py
The goal of this exercise is to write a Python program that reads in three files (E1, E2 & E3).
The files summarize the result of experiments to record spike trains in neurons.
Each file contains a list of time points, measured in seconds, of recorded neural spikes. (These are not durations - they are time points at which a neural spike occurred.)
There is one time measurement per line; the final line of the file represents not a spike measurement but the time at which the experiment was stopped.
The first two files represent experiments over long time periods, under different conditions.
, the neuron is in a resting state; in file
, the neuron is activated.
represents an experiment wherein it is not known whether the neuron was activated or not.
(Note that both activated and resting neurons have spikes
- they just have different
Assume that, in general (disregarding the specific conditions under which experiments E1 and E2 were conducted), the neuron is activated 1% of the time.
Further assume that the number of spikes in an activated or a resting neuron follows a Wikipedia:Poisson_distribution
. You may also assume that your estimate of lambda for the resting and activated states are in fact the true values of lambda for those states (rather than integrating out a prior over lambda... if you don't know what this means, don't worry about it yet).
Write a program to calculate and print the posterior probability that the neuron was activated during experiment E3. (45)
- Hint Pay attention to units - be consistent.
- Hint This follows the same form as above and the examples in class, with just a little more thought about what p( y | x ) means.
- Include comments in your code that indicate what quantities you are calculating. (e.g.:
# P(x | y) = p(x,y) / p(y) ).
Here are the files:
- E1: spike times for resting neuron
- E2: spike times for activated neuron
- E3: spike times for neuron in unknown state
- As always, the remaining 10% of your grade will come from style considerations.
A solution can be found here: BayesianInferenceSolution
- 06 Nov 2008
- 16 Nov 2012 (Python update)
Copyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback