Home - this site is powered by TWiki(R)
Teaching > BioE131 > BacterialGenePrediction > BayesianInferenceHomework
TWiki webs: Main | TWiki | Sandbox   Log In or Register

Bayesian inference exercise.

These programming exercises also develop your understanding of Bayesian statistical inference procedures.

Background reading

First read the Wikipedia article on Wikipedia:Bayesian_inference. Specifically, be sure to read the section titled Examples.

Open reading frames - submit as hw8a.py

Suppose that S is a sequence of N codons. Equivalently, S can be thought of as a sequence of 3N nucleotides.

We model S as a sequence of Wikipedia:Independent_and_identically-distributed_random_variables. The random variables can be codons (N of them) or nucleotides (3N of them).

These are two very different models. In a sequence of 3N I.I.D. nucleotides, we can't guarantee that the sequence won't contain any stop codons (unless we remove T's and/or A's entirely). However, when generating a sequence of N I.I.D. codons, we can guarantee to generate no stop codons quite easily.

Consider the following two ways of generating sequence S:

  • (G=0) S is an I.I.D. sequence of 3N nucleotides. Each nucleotide (x) is independently sampled from some probability distribution p(x).
  • (G=1) S is an I.I.D. sequence of N independently sampled codons. Each codon is sampled as follows:
    1. Sample three random nucleotides (xyz) from the same probability distribution: i.e. sample x from p(x), y from p(y) and z from p(z).
    2. If the resultant codon xyz is a stop codon, then go back to the previous step (resampling x, y and z), repeating until you get an xyz that is not a stop codon.
    3. Use codon xyz.

These two ways of generating S represent two alternate probabilistic models for S. Let G denote the model used: if G=0 then S is a random sequence of nucleotides (and may contain stop codons). If G=1 then S is a random coding sequence (and will contain no stop codons).

Assume that both of these models are equally probable, a priori. Thus .

The goal of this exercise is to write a Python program that does the following:

  1. Read in a correctly formatted FastaFormat file (perform normal error checking for file existence - this is considered stylistic at this point);
  2. Estimate a probability distribution over nucleotides, p(x), by counting all nucleotides in the file. (10)
  3. For each sequence S in the file, calculate and print the following:
    • the posterior probability P(G=1|S); (15)
    • the log-ratio of the posterior probabilities, . (15)

Comment on the log-ratio (just add a comment at the top of your .py file). Does it tell you anything that the posterior probability P(G=1|S) does not? Are there any circumstances (i.e. input data) for which it might be a more appropriate to quote the log-ratio rather than the posterior probability? (5)

Event times - hw8b.py

The goal of this exercise is to write a Python program that reads in three files (E1, E2 & E3). The files summarize the result of experiments to record spike trains in neurons. Each file contains a list of time points, measured in seconds, of recorded neural spikes. (These are not durations - they are time points at which a neural spike occurred.) There is one time measurement per line; the final line of the file represents not a spike measurement but the time at which the experiment was stopped.

The first two files represent experiments over long time periods, under different conditions. In file E1, the neuron is in a resting state; in file E2, the neuron is activated. File E3 represents an experiment wherein it is not known whether the neuron was activated or not.

(Note that both activated and resting neurons have spikes - they just have different spike patterns.)

Assume that, in general (disregarding the specific conditions under which experiments E1 and E2 were conducted), the neuron is activated 1% of the time. Further assume that the number of spikes in an activated or a resting neuron follows a Wikipedia:Poisson_distribution. You may also assume that your estimate of lambda for the resting and activated states are in fact the true values of lambda for those states (rather than integrating out a prior over lambda... if you don't know what this means, don't worry about it yet).

Write a program to calculate and print the posterior probability that the neuron was activated during experiment E3. (45)

  • Hint Pay attention to units - be consistent.
  • Hint This follows the same form as above and the examples in class, with just a little more thought about what p( y | x ) means.
  • Include comments in your code that indicate what quantities you are calculating. (e.g.: # P(x | y) = p(x,y) / p(y) ).

Here are the files:

  • E1: spike times for resting neuron
  • E2: spike times for activated neuron
  • E3: spike times for neuron in unknown state

Other Notes

  • As always, the remaining 10% of your grade will come from style considerations.

A solution can be found here: BayesianInferenceSolution

-- IanHolmes - 06 Nov 2008
-- BenjaminEpstein - 16 Nov 2012 (Python update)

I Attachment Action Size Date Who Comment
E1EXT E1 manage 6.5 K 2007-10-27 - 00:06 IanHolmes spike times for resting neuron
E2EXT E2 manage 6.4 K 2007-10-27 - 00:06 IanHolmes spike times for activated neuron
E3EXT E3 manage 0.2 K 2007-10-27 - 00:06 IanHolmes spike times for neuron in unknown state
Edit | Attach | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | Raw edit | More topic actions

This site is powered by the TWiki collaboration platformCopyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
TWiki Appliance - Powered by TurnKey Linux