We are investigating various methods for inference when trying to use microarrays to do large scale detection of bacteria from complex environmental samples. Traditionally, Microarrays have been used to interogate for the presence or absence of specific target genes. These target genes are specific to one particular organism. Recently, work has been done to use microarrays to attempt to characterize the bacterial composition of an environmental sample. Custom microarrays have been constructed by our collaborators at LBL using Affymetrix photolithographic microarray technology. These microarrays contain ~ 500000 25 mer oligonucleotides specific to the 16s rRNA gene from a diverse sample of bacteria which we wish to detect in our environmental samples. Because the probes are constructed by taking the complementary RNA strand from these targets and the target sequences are nearly homologous at numerous locations there is a probe selection problem. Unfortunately, because the array has been constructed we want to determine which probes will be good at probing for specific target sequences. In this manner, given the results of an experiment we can make inference regarding the resulting expression patterns from an experiment.
We propose a probabilistic model where the resulting intensity which we measure at each probe location will be related to the sequences present. We will formalize this using a Bayesian framework. This framework will allow us to compute marginal probabilities of each sequence being present given experimental data. In this way, we hope to make inference about the composition of a biological sample. We will model the dependency of the probe signal and the sequences as a bipartite graph where each sequence can be multiply connected to a set of probes (this is the parent set for each probe). Because we are dealing with 100,000 sequences and 500000 probes we cannot compute the marginal probabilites exactly due to exponential scaling in the number of sequences which we wish to consider. We will use loopy sum product, MCMC, and a variety of simulations to attempt to compute estimated marginal probabilities. The graph will be constructed based on physical characteristics of the theoretical probe-target duplexes. Edges will be drawn where we believe a target sequence duplex could possible form. We also conduct data analysis on experiments to try to estimate the distribution of intensity measurements from a signal distribution and a background distribution.
Results, current status
We have implemented the loopy sum-product and a MCMC sampler in order to adress the inference problem. Using the suggested framework on very small datasets, where exact computations are possible, has shown the framework to behave reasonably, in the sense that posterior marginal probabilities correctly capture the essence of the (simulated) true model. Subsequently we have attempted to assess the computational methodology on small simulated dataset (at most 5000 probes, 100 sequences), with emphasis on whether the methodology is scalable to real data. While the datasets are small, they have been simulated as being heavily interconnected, representing small densely connected subgraphs of the original graph. We have had problems getting the loopy sum-product to perform adequately, which we suspect may be improved by better scaling/normalization of the intermediate calculations. The fact that we are running into under/overflow calculations already at this stage does not seem promising however. The MCMC sampler performs quite well even with a very simple proposal distribution. We are however not certain that the simple proposal distribution is scalable to the real problem (a suggestion has been made to improve it), although that is a bit hypothetical as it works fine on the simulated datasets assessed so far. We believe our current approach is promising, the next step is testing it on real data which we expect to be done within the next 1-2 months. The final report is available upon request as a pdf file.
1. JR Cole, B Chai, RJ Farris, Q Wang, SA Kulam, DM McGarrell
, GM Garrity, and JM. Tiedje. The ribosomal database pro ject (rdp-ii): sequences and tools for high-throughput rrna analysis. Nucleic Acids Res, 33(Database Issue):D294–D296, Jan 1 2005
2. T.Z. DeSantis
, C.E. Stone, S.R. Murray, J.P. Moberg, and G.L. Andersen. Rapid quantification and taxonomic classification of environmental dna from both prokaryotic and eukaryotic orgins using a microarray. FEMS Microbiology Letters, 245:271–278, 2005
3. Lars Kaderali and Alexander Schliep. An algorithm to select target specific probes for dna chips. Preprint
4. K.H. Woese, G.E. Fox, L. Zablen, T. Uchida, L. Bonen, K. Pechman, B.J. Lewis, and D. Stahl. Conservation of primary structure in 16s ribosomal rna. Nature, 254:83–86, 1975
- 05 Dec 2005
: since you have a bipartite graph with directed edges (or so it seemed from your presentation), why do you need loopy sum-product?
If all edges are directed from sequences to probes, how can cycles arise?
: we have cycles in the undirected form of the graph; that is what matters. The sum-product algorithm is the distributive law applied to a likelihood with a special algebraic structure represented by a graph. That structure is also present in a directed graph, it just has another interpretation. In fact, since we observe all nodes of one form, one may claim we do not really have a bipartite graph. That is most easily seen from the factor graph representation.
Copyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback