Teaching > BioE131 > BacterialGenePrediction > BayesianInferenceHomework

# Bayesian inference exercise.

These programming exercises also develop your understanding of Bayesian statistical inference procedures.

First read the Wikipedia article on Wikipedia:Bayesian_inference. Specifically, be sure to read the section titled Examples.

## Open reading frames - submit as hw8a.py

Suppose that S is a sequence of N codons. Equivalently, S can be thought of as a sequence of 3N nucleotides.

We model S as a sequence of Wikipedia:Independent_and_identically-distributed_random_variables. The random variables can be codons (N of them) or nucleotides (3N of them).

These are two very different models. In a sequence of 3N I.I.D. nucleotides, we can't guarantee that the sequence won't contain any stop codons (unless we remove T's and/or A's entirely). However, when generating a sequence of N I.I.D. codons, we can guarantee to generate no stop codons quite easily.

Consider the following two ways of generating sequence S:

• (G=0) S is an I.I.D. sequence of 3N nucleotides. Each nucleotide (x) is independently sampled from some probability distribution p(x).
• (G=1) S is an I.I.D. sequence of N independently sampled codons. Each codon is sampled as follows:
1. Sample three random nucleotides (xyz) from the same probability distribution: i.e. sample x from p(x), y from p(y) and z from p(z).
2. If the resultant codon xyz is a stop codon, then go back to the previous step (resampling x, y and z), repeating until you get an xyz that is not a stop codon.
3. Use codon xyz.

These two ways of generating S represent two alternate probabilistic models for S. Let G denote the model used: if G=0 then S is a random sequence of nucleotides (and may contain stop codons). If G=1 then S is a random coding sequence (and will contain no stop codons).

Assume that both of these models are equally probable, a priori. Thus .

The goal of this exercise is to write a Python program that does the following:

1. Read in a correctly formatted FastaFormat file (perform normal error checking for file existence - this is considered stylistic at this point);
2. Estimate a probability distribution over nucleotides, p(x), by counting all nucleotides in the file. (10)
3. For each sequence S in the file, calculate and print the following:
• the posterior probability P(G=1|S); (15)
• the log-ratio of the posterior probabilities, . (15)

Comment on the log-ratio (just add a comment at the top of your .py file). Does it tell you anything that the posterior probability P(G=1|S) does not? Are there any circumstances (i.e. input data) for which it might be a more appropriate to quote the log-ratio rather than the posterior probability? (5)

## Event times - hw8b.py

The goal of this exercise is to write a Python program that reads in three files (E1, E2 & E3). The files summarize the result of experiments to record spike trains in neurons. Each file contains a list of time points, measured in seconds, of recorded neural spikes. (These are not durations - they are time points at which a neural spike occurred.) There is one time measurement per line; the final line of the file represents not a spike measurement but the time at which the experiment was stopped.

The first two files represent experiments over long time periods, under different conditions. In file `E1`, the neuron is in a resting state; in file `E2`, the neuron is activated. File `E3` represents an experiment wherein it is not known whether the neuron was activated or not.

(Note that both activated and resting neurons have spikes - they just have different spike patterns.)

Assume that, in general (disregarding the specific conditions under which experiments E1 and E2 were conducted), the neuron is activated 1% of the time. Further assume that the number of spikes in an activated or a resting neuron follows a Wikipedia:Poisson_distribution. You may also assume that your estimate of lambda for the resting and activated states are in fact the true values of lambda for those states (rather than integrating out a prior over lambda... if you don't know what this means, don't worry about it yet).

Write a program to calculate and print the posterior probability that the neuron was activated during experiment E3. (45)

• Hint Pay attention to units - be consistent.
• Hint This follows the same form as above and the examples in class, with just a little more thought about what p( y | x ) means.
• Include comments in your code that indicate what quantities you are calculating. (e.g.: `# P(x | y) = p(x,y) / p(y)` ).

Here are the files:

• E1: spike times for resting neuron
• E2: spike times for activated neuron
• E3: spike times for neuron in unknown state

## Other Notes

• As always, the remaining 10% of your grade will come from style considerations.

A solution can be found here: BayesianInferenceSolution

-- IanHolmes - 06 Nov 2008
-- BenjaminEpstein - 16 Nov 2012 (Python update)

I Attachment Action Size Date Who Comment
EXT E1 manage 6.5 K 2007-10-27 - 00:06 IanHolmes spike times for resting neuron
EXT E2 manage 6.4 K 2007-10-27 - 00:06 IanHolmes spike times for activated neuron
EXT E3 manage 0.2 K 2007-10-27 - 00:06 IanHolmes spike times for neuron in unknown state

Copyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback