click on the Biowiki logo to go to homepage
Edit Raw Print
Links Diffs RSS
About Stats Recent


Research Teaching Blog
Fall10 | Sandbox
Biowiki > Teaching > Bio E 131 > SampleMidterm

Search

Advanced search...

Topics

PageRank Checker

Sample midterm questions

Sample questions for the BioE131 midterm exam.

Multiple-choice questions

(1) Which of the following are all Perl keywords that can be used to manipulate strings?

    (a) foreach, each, for

    (b) reverse, uc, chop

    (c) while, if, unless

    (d) reverse, sort, grep

(2) Which of the following is the most suitable Perl expression for testing if a DNA sequence contains a run of 3 or more "AT"'s?

    (a) /[AT]+/g (one or more AT's, global; will match any run of characters with only the letters A or T whose length is at least 1; i.e. A, T, AT, ATT, TTT, etc all match)

    (b) /(AT)+/i (one or more AT's, case-insensitive; this is the wrong number of AT's, as it will match AT or ATAT as well as ATATAT, ATATATAT etc.)

    (c) /AT+++/g (this is incorrect regular expression syntax)

    (d) /ATATAT/i (three AT's, case-insensitive; this is the correct number of AT's and, additionally, does not assume that the sequence is in upper-case)

(3) Suppose you are writing a Perl program that needs to store a mapping of names to numbers (as in a telephone directory). The most appropriate data structure for this task would be:

    (a) a scalar variable.

    (b) an array variable.

    (c) a hash variable.

    (d) a filehandle.

(4) Which is the most accurate statement of the relationship between the "Gene Ontology" and "Reactome"?

    (a) Reactome is a database of reactions, while Gene Ontology is a database of gene sequences.

    (b) The Gene Ontology is a formal vocabulary for describing gene functions, which was developed using Reactome, a database of cellular reactions and pathways.

    (c) Reactome is a database of cellular reactions and pathways, which are expressed using terms from the Gene Ontology, a formal vocabulary for gene function.

    (d) Reactome is oriented to bacteria, whereas the Gene Ontology is oriented to multicellular organisms.

(5) The "central dogma of molecular biology" is that DNA makes RNA makes protein: DNA is the storage material, RNA is for transport, and proteins are enzymes. This is...

    (a) always true, with no exceptions.

    (b) true with some exceptions; for example, some genes code for RNA enzymes.

    (c) true only for eukaryotes; bacteria, on the other hand, lack the "RNA to protein" step.

    (d) true only if the genome does not contain any stop codons.

Short written-answer questions


Consider the following Perl program:

        my ($total, $n) = (0, 0);
        while (<>) {
            my @f = split;
            $total += $f[0];
            ++$n;
        }
        print $total / $n;
    
Using plain English, give a brief written description of what this program does.
    takes user input from standard input. for each line of numbers the user types in, extracts the first number in the line and add it to a cumulative sum. when the user is done inputting numbers, the average of the numbers is printed.


Describe the scoring scheme for the Needleman-Wunsch algorithm, and explain what properties of this scoring scheme allow the highest-scoring alignment to be calculated for any pair of sequences.

    The score of an alignment is defined to be the sum of the substitution score S(a,b) for each aligned residue pair (a,b), plus a gap penalty G for every gap. The maximum score can be calculated by dynamic programming because "max" is distributive over "+", i.e. \max_{i,j}(x_i+y_j)=\max_i(x_i)+\max_j(y_j), so that the max score for any two sequences can be expressed recursively in terms of the max scores for shorter sequences. If A[i] represents the i'th residue of sequence A, and B[j] the j'th residue of sequence B, then the Needleman-Wunsch recursion is:

      C(i,j) = max { C(i-1,j-1)+S(A[i],B[j]), C(i,j-1)-G, C(i-1,j)-G }

    where C(i,j) denotes the maximal alignment score for the subsequences A[1]..A[i] and B[1]..B[j]. The recursion is terminated by the following "boundary condition":

      C(0,0)=0


A bag contains a four-sided die, three six-sided dice and two eight-sided dice.

    (a) What is the probability that, if you randomly pick a die from the bag and roll it, you will get a five?

      Let D be the number of sides of the dice that you pick and let N be the number that you roll.

       \begin{array}{rll} P(D=4) & = & 1/6 \\ P(D=6) & = & 3/6 \\ & = & 1/2 \\ P(D=8) & = & 2/6 \\ & = & 1/3 \\ P(N=n|D=d) & = & \left\{ \begin{array}{ll} 1/d & (n \leq d) \\ 0 & (n > d) \end{array} \right. \\ P(D=d,N=n) & = & P(D=d)P(N=n|D=d) \\ P(N=n) & = & \sum_d P(D=d,N=n) \\ & = & \sum_d P(D=d)P(N=n|D=d)\\ P(N=5) & = & P(D=4)P(N=5|D=4)+P(D=6)P(N=5|D=6)+P(D=8)P(N=5|D=8) \\ & = & 0+\frac{1}{2}*\frac{1}{6}+\frac{1}{3}*\frac{1}{8} \\ & = & \frac{1}{8} \end{array}

      (b) Ezekiel takes the bag from you, picks a die out of it, and without showing you the roll, announces that he rolled a five. What is the probability that the die Ezekiel rolled was six-sided (assuming that you trust his report of what he rolled)?

        Using Bayes' Theorem:

        P(D=d|N=n)=\frac{P(D=d,N=n)}{P(N=n)}=\frac{P(D=d)P(N=n|D=d)}{P(N=n)} Therefore P(D=6|N=5)=\frac{\frac{1}{2}*\frac{1}{6}}{\frac{1}{8}}=\frac{2}{3}


    Two sequences, each one kilobase long, are aligned using the Smith-Waterman algorithm. The pairwise alignment algorithm takes 5 seconds to run on Maira's PC. Approximately how long might you expect the same PC would take to align two sequences each of length three kilobases? Roughly how much memory would this take?

      Pairwise sequence alignment involves the construction of a dynamic programming matrix that has L^2+2L+1 cells, where L is the sequence length. The 2L+1 is negligible for long sequences, compared to the leading term L^2. Pairwise alignment is therefore said to have time and memory complexity O(L^2). Scaling up the sequence length by a factor of three will scale up the runtime by a factor of (approximately) three squared, i.e. nine. So you'd expect it to take around 45 seconds (as long as it doesn't run out of memory, in which case it would start "swapping" i.e. using the disk drive as extra memory, which would drastically slow things down).

      How much memory is needed? The dynamic programming matrix will have roughly 3,000*3,000 cells; that is, 9 million cells. The actual memory used will depend on how many bytes are used to store the score variable in each cell. If Maira's PC is a 32-bit machine (as are most desktop PCs in 2006) then the default storage required for a variable is 4 bytes (8 bits per byte, 32 bits per variable). So let's assume 4 bytes per cell, requiring about 36 million bytes, or around 36 megabytes of RAM.

      This is a quite modest amount of memory for modern PCs, and we could stop here, but (to really pound the @!?% out of this question) we can imagine squeezing the matrix into less space by programming the algorithm ultra-efficiently. Let's say the score for aligning two nucleotides is +5 for a match, and -4 for a mismatch. Then the maximum possible score for the two 3kb sequences is 15,000 and the minimum possible score is 0 (because this is Smith-Waterman alignment, so the score never gets negative). It is possible to store this range of integer scores in a 16-bit variable (2^16=65,536). Sixteen bits per cell means two bytes per cell. There are nine million cells (as before). So, if you were really starved of memory (e.g. you had an old PC) you could align two three-kilobase sequences in 18 megabytes or so. On the other hand, a modern 64-bit implementation of the algorithm might use 72Mb, if it was lazily coded.

      Answer: About 45 seconds of CPU time, and 36Mb of RAM. Being out by a factor of 2 in either direction is OK for the RAM usage; there is also a small dependence on how you define "kilobase" (1000) and "kilobyte" (1024); so any answer in the range 17-18Mb, 34-36Mb or 68-72Mb is OK.

      See also Binary prefix (Wikipedia) for discussion of "kilo=1000" vs "kibi=1024"...


    Give three examples of the relevance of information theory to bioinformatics. Include formulae for relevant quantities.

      Possible answers include Shannon entropy (can use this for several purposes: as a general measure of the complexity of a probability distribution e.g. DNA or protein composition, to score columns in multiple alignments, to identify highly conserved columns, to scan for low-complexity regions in a sequence...), relative entropy (measures the difference between two probability distributions), mutual information (a measure of how much one random variable tells you about another random variable, used e.g. to detect basepairing between two columns in a multiple alignment of RNA sequences), substitution matrix scores for sequence alignment have an interpretation as the information content of the match (in data compression terms, the number of bits you save by encoding the two nucleotides jointly vs separately), compression of large sequence databases, information content of a motif (provides a way to estimate the expected frequency of that motif in random sequence)... of course we need to give formulae for all these things...


    Discuss the various interpretations of the concept of "evolutionary time" in phylogenetics. Consider, in particular, the following questions: What does it mean for a phylogenetic tree to be "clock-like"? What is the alternative? Is the Jukes-Cantor time estimate a clock-like estimate? What problems arise with the Jukes-Cantor estimate for highly diverged sequences?

      In a clock-like tree, all the branch lengths are consistent with a global reference frame of evolutionary time. In other words, the branch length from the root of the tree to any particular node, X, identifies the time (or date) corresponding to node X. If the leaves are contemporaneous (e.g. in a species tree where each leaf represents a modern-day sequence), the branch length from the root of the tree to any tip is equal, resulting in a symmetric tree. If the leaves represent sequence data sampled at different times (e.g. in a tree of viral RNA, serially sampled from a single patient), then the branch length from root to tip corresponds to the time at which the sequence was sampled. Clock-like behavior is violated if there are different rates of evolution along different lineages, or if the tree contains species with varying generation times. This generally results in an asymmetric tree. The alternative to assuming a clock-like tree is simply to regard the branch length as an abstract "distance", conventionally measured in units of the average number of substitutions per site. In this case, the length of an individual branch may still be proportional to the evolutionary time along that branch, but the constant of proportionality may vary throughout the tree. An individual branch or subtree may, therefore, sometimes still be said to be clock-like (or "based on a molecular clock", particularly if the branch lengths are derived from a probabilistic model that explicitly includes time; see discussion of Jukes-Cantor time estimates, below). However, the tree as a whole is not clock-like, because the overall set of branch lengths does not allow you to correctly pinpoint each node in time. For example, consider a tree containing rodent and primate genes. The subtree corresponding to the rodents may be approximately clock-like, as may be the subtree corresponding to the primates; but, since rodents have shorter generation times than primates, the tree as a whole will probably not be clock-like, since the rodent branches will (in general) be longer.
    "Is the Jukes-Cantor time estimate a clock-like estimate?"
      The Jukes-Cantor model assumes that sequences evolve according to a molecular clock, and yields a time-like estimate of the pairwise divergence time between two sequences. However, use of this time estimate does not, in itself, guarantee clock-like symmetry in the tree that is eventually created. The Jukes-Cantor time estimate is essentially just used to populate a distance matrix; a separate tree-building algorithm is required to build the actual tree. Thus, to ensure a clock-like tree, you need to use a tree-building algorithm that guarantees this property, such as UPGMA.
    "What problems arise with the Jukes-Cantor estimate for highly diverged sequences?"
      As evolutionary time passes, all similarity to the ancestral sequence is lost. That is, two sequences that are related by the Jukes-Cantor model, but are highly diverged, will have low sequence identity (tending towards 25%, the expected value for unrelated sequence). This means that estimates of the divergence time of the two sequences will be extremely unreliable, and will have a large variance. Pretty much the only thing that you can say with any confidence is that the sequences are highly diverged. If, for example, the Jukes-Cantor model averages one substitution per site per million years, then it is very hard to distinguish between a 10Myr alignment and a 100Myr alignment, since in both cases the sequences appear essentially unrelated.


    What kinds of data are in Genbank?

      Sequences, literature references, information about the source of the sequence (organism, contributor, etc.), feature annotations (genes, exons, introns, repeats, binding sites...)


    The following table gives US mortalities for one week in January 2003 for (i) motor vehicle-related causes and (ii) all causes (source: National Center for Health Statistics). What's the probability of being killed in a car crash? What about the probability of dying on a Monday? If someone dies on a Monday, what's the probability that they died in a car crash? If someone dies in a car crash, what's the probability they died on a Monday?

    Day Auto-related All causes
    Mon 96 7456
    Tue 94 7048
    Wed 82 7020
    Thu 88 6987
    Fri 105 7219
    Sat 115 7403
    Sun 112 7162
      Probability of being killed in a car crash: sum(column 1) / sum(column 2)
      Probability of dying on a Monday: 7456 / sum(column 2)
      If someone dies on a Monday, the probability that they died in a car crash is: 96/7456
      If someone dies in a car crash, the probability that they died on a Monday is: 96 / sum(column 1)


    Approximately how frequently would you expect to see the hexamer AGGTCA in a perfectly random DNA sequence?

      Perfectly random means that each base has same probability of 1/4, aka a uniform distribution. So the probability that a random 6-base motif is AGGTCA is p = (1/4)^6. In random sequence, you would expect to see this motif (on average) every 1/p nucleotides, i.e. every 4^6 nucleotides. You can get to the same answer by observing that the motif contains 2*6=12 bits of information and so you expect to see it once every 2^12 positions (2^12 = 4^6).


    Midterm papers past

    2008

    See this page:

    2007

    2006

    2005

Actions: Edit | Attach | New | Ref-By | Printable view | Raw view | Normal view | See diffs | Help | More...