Research  Teaching 



 TWikiGuest  18 Dec 2005
Repetitive Element Segmentation using CRFsAbstract Given nucleotide base sequences of repetitive DNA elements in a particular genome, we address the task of determining the repeated subelements within them, and labeling these subrepeats with the same label. Our ultimate goal is to cluster the repetitive DNA elements the labeling we identify here allows us to assign an edit distance among the sequences which can be used for the purpose of clustering. While our eventual goal is clustering, here we address the first task of identifying and labeling the repeat substructure. The task is complicated by the fact that the repeated subelements are not exact replicates of one another, and the sequences may be interrupted, duplicated, have some parts missing or be mangled in different ways. Given our relatively sparse understanding for transposable elements and their mechanism of replication, it has been difficult to make meaningful assumptions about their generation process. It has therefore been difficult to employ probabilistic models that are trained generatively (i.e., using particular assumptions about the generation of transposable elements) to segment and cluster these elements. Good performance on the task of identifying subclasses and element subclassification has, in practice, been difficult to achieve. In this work, a model extending the CRF model to the semiCRF case is used. This conditionally trained version of semiHMMs is applied to produce the highest scoring sequence segmentations of the input sequences given the model. A semiCRF produces a labeled segmentation of a repeat element sequence, S. Labels are assigned to substrings of S, where an assignment implies a relationship between this substring and all likelabeled subsequences in other repeats. One of the advantages of using CRFs is their ability to incorporate into the same framework sets of arbitrary features. An additional advantage is that the parameters of the model can be trained with discriminative methods. This results in the ability to use training data to determine the relative importance of different features, and in principle, could induce new informative features dynamically from the labeled data. The framework enables us to utilize a large number of domainspecific features for protein alignment. The model parameters weight the various local feature functions. We can learn the relative weights for this model in a discriminative framework exactly, using dynamic programming. Inference in this model is also done with dynamic programming. Besides commonlyused local features (for instance, looking for start or endcodons in a 3nucleotide window), features that are wholesequence dependent (like the degree of similarity to other subregions in the sequence) are added to contribute to the overall score of a segmentation and labeling of a subregion. The Repeat SubSegmentation Problem Repeat elements make up a large fraction of many eukaryotic genomes. Within these regions, the occurrence of Transposable Elements is rampant. The term Transposable Elements (TEs) groups several subclasses of elements that repeatedly replicate in the genome, either through the reverse transcription of an RNA intermediate (class I elements), or autonomously from DNA to DNA by excision and repair (class II elements). Class I elements are further grouped by the presence (LTR elements) or absence (LINE and SINE elements) of long terminal repeats. Class II elements are largely comprised of elements with terminally inverted repeats (TIR elements). Some knowledge about sequence structure in transposons is known to be related to their mechanism for replication. For example, class I elements typically contain open reading frames in the interior coding for the retrotransposase enzyme that aids in their retrotransposition. Beyond these general outlines, the actual mechanisms for replication are poorly understood. Nonetheless, when studying TEs, it is helpful and customary to characterize specific instances (or insertions) by their mechanism of replication and segregate them into TE familiesgroups of elements that purportedly evolved from the same transposing sequence. TE family classification is complicated by the fact that TE instances are shuffled and scrambled as they evolve. When copies of an invading TE family are not under selective pressure, mutations mangle the sequences of each insertion, resulting in related elements that are of different length, incomplete structure, and beyond recognition of sequence similarity. Furthermore, some autonomously transposing elements have evolved new classes of nonautonomously transposing elements (Vitte and Panaud, 2003). In such cases, copies of two different families show sequence similarity in substructures, but differ in their replication method, or in substructure order and content. TEs also show insertion site bias for transposing into adjacent positions in the genome or in a nested fashion (Freund and Meselson, 1984; Inouye et al. 1984; Losada et al. 1999). Consequently, many repeat regions are combinations of tandem and nested arrays of complete and partial copies of different TEs which may then transpose together leading to complex relationships between originating elements and their replicants. These artifacts of TE evolution complicate the definition of element boundaries and cause problems in classifying related elements into subfamilies (Bao and Eddy, 2002). We present a method to identify and label only those subsegments that are integral to the replication mechanism of the repetitive element. We are motivated to identify only those regions and perform TE family classification based on them. While mutations occur throughout the entire sequence of the repeat, we hold that so long as the repeat was still 'active' (i.e., capable of mobility) the functional subrepeat portions were better conserved than the remainder of the sequence; therefore, selecting clusters that represent functional subfamilies should be based on these functional subrepeats. Probabilistic model We begin with V, the union of two sets of random variables, V = X_\Union_Y, where X comprises of our observed input sequences and Y is the set of output labels we wish to assign to X. For the purposes of classifying repeats, we are most interested in labeling those subregions of sequence (which we call subrepeats) that play some role in repeat element mobility. Such functional subrepeats include coding elements, DNA binding sites, and terminal repeats as outlined in Table 1. Every variable in X is an observed nucleotide from the discrete set of WatsonCrick base pairing nucleotides, \Omega = {A,G,C,T}. Every variable from Y takes an outcome from the discrete set of labels, S, which is defined in Table 1. _Table 1_ Set S Class (or State) types in our model
Additionally, we can define a set of feature functions F = {f_xyk}. Each such realvalued function associates a particular label assignment with a real value. We can group all such feature functions for as an exponetiated weighted sum to result in a positive, realvalued 'score' for a particular assignment. We call the resulting functions 'compatibility functions' because they can be viewed as a score indicating the degree of compatibility a particular assignment has given the weights for the features. \Phi_{ijk}(x_{ijk},y_{ijk}) = exp{\Sum_k \theta_{ijk}f_{ijk}(x_{ijk},y_{ijk})} We are left with a set of n positive, realvalued compatibility functions {\phi_{ijk}} that map the combinatorial product space of XxY to ndimensional positive, realvalued vectors in R^n. We would like to use this scoring mechanism to induce a probability distribution on the space of all possible assignments in order to be able to evaluate the goodness of a particular assignment. Graphical models are particularly useful for a task like this since a graphical model describes a family of probability distributions that factorize according to an underlying graph. We must therefore choose an underlying graph, and use it in conjunction with our compatibility function to be able to perform our classification task in a probabilistic framework. Classifiers like support verctor machines assume that the output variables in Y are independent and identically distributed, with a direct dependency on X. This type of directed grapical model factorizes as: p(y,x) = \Product_{{x,y} \in V} p( x\ y) In our application, however, the elements in x are ordered sequentially, and there is an obvious dependency among them. We could then arrange the output variables in a linear chain, as in a hidden Markov Model (HMM). An HMM consists of the state set S (as in Table 1) and an observation set, \Omega. Having linearized the input, X = {x_t}_{t=1...T}, we have the related linear output, Y = {y_t}_{t=1...T}. This dependency structure is still 'directed', and factorizes as: p(y,x) = \Product_{t=1...T} p( y_t\ y_{t1}) p(x_t \ y_t) = \frac{1}{Z} exp {\Sum_{k=1...K} \theta_k f_k(y_t, y_{t1}, x_t)} where the leftmost equivalence induces a probability distribution that factorizes according to the underlying firstorder Markov chain graph by using the normalization factor Z and the feature functions introduced above. Z, often referred to as the partition function, is defined by Z = \Sum_{x,y} exp {\Sum_{k=1...K} \theta_k f_k(y_t, y_{t1}, x_t)} In effect, Z sums the compatibility score values for all possible assignments of Y to the observations in X. In a linear chain graph, Z is tractable due to dynamic programming. In most graph configurations, however, Z is intractable and is the reason inference in many models is computationally infeasible. Although this model no longer assumes that each assignment of Y to an element in X as an independent and identically distributed event, it still makes the 'naive Bayes' assumption that the observations are independent given the state sequence. This assumption often leads to decreased performance, because we are still attempting to model p(x) the distribution that generated the observations. However, we don't know enough about the embedded structure of the states in S, nor do we have enough examples, to formulate a generative model (as in the HMM) to characterize the emission sequence for each label. One solution is to model the conditional probability, p(y \x) directly which is all that is necessary for the classification task. We remain with the same linearchain graph topology, but remove the graph directionality (i.e., a linear chain undirected model). This yields a linearConditional Random Field (linearCRF). This can be written in the form p(y\x) = \frac{1}{Z(x)} exp { (\Sum_{k=1...K} \theta_k f_k(y_t, y_{t1}, x_t) } In this case, Z(x) is dependent on the observations in X: Z(x) = \Sum_{x \in X} exp { (\Sum_{k=1...K} \theta_k f_k(y_t, y_{t1}, x_t) } By modeling the conditional probability distribution directly, we need to make no assumptions about the complex dependencies among x, or the specific form of p(x). It is also important to note that the summation in Z(x) over all possible outcomes of Y, has an exponentially large number of terms, but is not intractable as it could be efficiently computed by forwardbackward as shown by Lafferty et. al. (2001). This formulation allows us to use a much richer set of features that could model longdistance dependencies and extract more information about the structured data we observed. However, we are still considering each label Y for each X, while in our problem domain of repeated elements, the same label (for instance, an open reading frame) extends sequentially for some length. Additionally, the subsequences underlying these labels have various collective sequence attributes (for instance, "sequence similarity to other known entities") that cannot be described by aggregates of features for each individual observed nucleotide. These features tend to characterize the structure of the label entities in S, and we would therefore like to use them. semiConditional Random Field (semiCRF) formulation is a natural choice for performing joint segmentation and collective labeling because it allows us to easily construct labelbased features (for instance, "similarity to other known entities" or "sequence length"). We now let s = {x}_{1...p} denote a segmentation of x, where the segments are defined by a start position, x_s, and end position x_e, and a label y_s \in Y. This implies a the subsequence x_{s...e} is entirely labeled by y_s. This allows us to describe a vector of segment feature functions which map a segmentlabel association to a positive, realvalued score. Once again, the graph factorizes as a linear chain, only now we sum over all possible segmentations and labelassociations. In this way, when we perform inference, we are simultaneously segmenting and labeling the subsequences. Inference in a semiCRF requires finding the best segmentationandlabeling through argmax_s P(sx,\Theta) = argmax_s \frac{1}{Z(x) exp { (\Sum_{s} \theta_s f_s(x, s) } which can be done efficiently by the Viterbi algorithm as described by Sarawagi and Cohen (2004). For parameter estimation, the goal is to maximize the loglikelihood over the given training set T, which also defines the maximum entropy model for the given data (Sarawagi and Cohen, 2004). Since the optimization is still convex, this could either be done by direct optimization using conjugate gradient or by approaches similar to expectation maximization (Mc Callum et al, 2005). Results Experimental results were evaluated on the handcurated data set of transposable elements in D.melanogaster. This set has been used in the past to evaluate other common methods for TE segmentation and clustering (Bao and Eddy, 2002). This set is referred to as BDGPTrans and contains 1572 elements from D.melanogaster release 4.0. These elements are further subdivided into 92 families, of which 43.4% are LTR element, 10.9% are LINElike elements, 23.7% are TIR elements and 2% are FB elements. Using the emblformat sequences available for the BDGPTrans set at http: // www .fruitfly .org / p_disrupt / TE.html, our strings were further segmented into fields such as "ORF for pol gene" or "threeprimelong terminal repeat." Each sequence is treated as a single unsegmented character sequence. We present our results on this data set with crossvalidation, as well as predicted segmentations on segments recently found using a comparativegenomicbased method (Caspi and Pachter, 2005). CrossFamily Validation A leavefamilyout crossvalidation was performed on the large Class I families of the BDGPTrans set. For each cross, elements in the one Class I family were placed in the test set while the remainder were placed in the training set. Method In our experiments, we trained a semiCRFs to mark entity segments with labels I from those listed in Table 1. Since this is a multiclass model, we did not introduce a separate "nonentity" label. Below we describe the features used to construct the semiCRF for the application of repeat element subsegmentation. Classes We have general knowledge about the requirements for a repeat to be functional and have encoded each of these into types of regions for labeling as described in Table 1. However, since we are often dealing with mangled or rearranged sequences, we did not make a finitestate machine representation of the model that would restrict certain transitions in the label sequence. By not defining such restrictions, we allow the possible transitions to be inferred from the training data.
Features As noted, we gain a lot of flexibility in our model by including nonindependent input features and removing the constraint of adhering to prespecified edit operations. The input features in our experiments include features that capture local dependencies among bases (Table 2), those that capture longrange dependencies among bases (Table 3), and homology features obtained from outside sources (Table 4). We note that for inframe coding sequences, the most useful features were those computed over a window of 3, 4, and 5 bases. The window of bases, combined with the predicted label, allow the model to capture aminoacid structure of the genes commonly coded for by transposable elements. Table 2 local dependency features
Table 3 longrange dependencies
Table 4 blastz feature similarities
Due to the prohibitive training time, the last set of features (in Table 4) were removed from the final implementation, and the number of windows checked in Table 3 was reduced to three (values used were 15, 30, and 45). The following table lists the sensitivity per base recorded on the crossvalidation set. In all, five crossvalidation experiments were performed. *_Table 5_ Sensitivity results for each class, averaged over 5 crossvalidation experiments.
Computational considerations While both inference and learning in CRFs is tractable, and dependent on dynamic programming (which is generall O(T^2,L^2) where T is the number of sequences, and L is the length of the sequence), it is important to note that in practice, training is computationally expensive. Using all 1572 transposable element sequences in drosophila melanogaster euchromatin, with a limited set of 100 features, took over 7 hours. The more latent variables there are, and the more outcomes there are (in our case, 10 output labels) the greater the computational burden of needing to perform inference repeatedly during training.
Further Validation The next steps would be to compare our segmentation with that performed by other programs for the same purpose. Volfovsky et al. (2001) and Bao and Eddy (2002) recently developed new heuristic algorithms for de novo repeat classification that perform well in practice (RepeatFinder and RECON). Additionally, Pevzner et al (2004) employed a graph optimization scheme using approximate de Bruijn graphs that represent repeats as mosaics of subrepeats (FragmentGluer).
Future Work and Discussion The problem of repeat element subsegmentation and labeling was addressed using semiCRFs. The use of semiCRFs improves upon graphical models currently used for the task (namely, approximate de Bruijn graphs, FragmentGluer) because semiCRFs can model longdistance dependencies that can be used to obtain better accuracy in joint segmentation and labeling tasks. In the deBruijn graph, segments can be missed as a result of insufficient sequence similarity between the relevant subsegments. In the semiCRF model, however, additional features and attributes of the entity are added to bolster candidates that may have other appropriate attributes, but have mutated far beyond the threshold set for sequence similarity. In effect, using our model removes the need to set threshold parameters, and allows use of a richer set of observed features of entities. Additionally, collective segmentation and labeling is done simultaneously, so that the dependencies between the two tasks are leveraged. In future experiments, a refined set of labels ought to be used (those in Table 1). Specifically, the number of labels ought to be limited to allow for faster training. Additionally, I believe the startcodon should be a state on its own, allowing the addition of features that pertain to the remaining coding sequence being 'inframe', and enforcing features that disallow the appearance of the start codon 'inframe' in the remainder of the coding region. Splicesite boundary conditions ought to be similarly modeled.
Bibliography


Teaching.AnatCaspiRepetitiveElementSegmentationUsingConditionalRandomFieldsGraduateProject r49  20070618  21:23:49  IanHolmes  Biowiki content is in the public domain. Comments on this site? AnatCaspiRepetitiveElementSegmentationUsingConditionalRandomFieldsGraduateProject">Send feedback 