click on the Biowiki logo to go to homepage Home Home | EditEdit | Attach Attach | New New | Site Map Site Map | Help Help
Research Teaching
Fall12 | Sandbox
Biowiki > Teaching > GraduateClassProjects > StudentDesignedGraduateProjects > AnatCaspiRepetitiveElementSegmentationUsingConditionalRandomFieldsGraduateProject


Advanced search...


-- TWikiGuest - 18 Dec 2005

Repetitive Element Segmentation using CRFs


Given nucleotide base sequences of repetitive DNA elements in a particular genome, we address the task of determining the repeated sub-elements within them, and labeling these sub-repeats with the same label. Our ultimate goal is to cluster the repetitive DNA elements-- the labeling we identify here allows us to assign an edit distance among the sequences which can be used for the purpose of clustering. While our eventual goal is clustering, here we address the first task of identifying and labeling the repeat sub-structure. The task is complicated by the fact that the repeated sub-elements are not exact replicates of one another, and the sequences may be interrupted, duplicated, have some parts missing or be mangled in different ways. Given our relatively sparse understanding for transposable elements and their mechanism of replication, it has been difficult to make meaningful assumptions about their generation process. It has therefore been difficult to employ probabilistic models that are trained generatively (i.e., using particular assumptions about the generation of transposable elements) to segment and cluster these elements. Good performance on the task of identifying subclasses and element subclassification has, in practice, been difficult to achieve.

In this work, a model extending the CRF model to the semi-CRF case is used. This conditionally trained version of semi-HMMs is applied to produce the highest scoring sequence segmentations of the input sequences given the model. A semi-CRF produces a labeled segmentation of a repeat element sequence, S. Labels are assigned to substrings of S, where an assignment implies a relationship between this substring and all like-labeled subsequences in other repeats.

One of the advantages of using CRFs is their ability to incorporate into the same framework sets of arbitrary features. An additional advantage is that the parameters of the model can be trained with discriminative methods. This results in the ability to use training data to determine the relative importance of different features, and in principle, could induce new informative features dynamically from the labeled data.

The framework enables us to utilize a large number of domain-specific features for protein alignment. The model parameters weight the various local feature functions. We can learn the relative weights for this model in a discriminative framework exactly, using dynamic programming. Inference in this model is also done with dynamic programming. Besides commonly-used local features (for instance, looking for start- or end-codons in a 3-nucleotide window), features that are whole-sequence dependent (like the degree of similarity to other subregions in the sequence) are added to contribute to the overall score of a segmentation and labeling of a subregion.

The Repeat Sub-Segmentation Problem

Repeat elements make up a large fraction of many eukaryotic genomes. Within these regions, the occurrence of Transposable Elements is rampant. The term Transposable Elements (TEs) groups several subclasses of elements that repeatedly replicate in the genome, either through the reverse transcription of an RNA intermediate (class I elements), or autonomously from DNA to DNA by excision and repair (class II elements). Class I elements are further grouped by the presence (LTR elements) or absence (LINE and SINE elements) of long terminal repeats. Class II elements are largely comprised of elements with terminally inverted repeats (TIR elements).

Some knowledge about sequence structure in transposons is known to be related to their mechanism for replication. For example, class I elements typically contain open reading frames in the interior- coding for the retro-transposase enzyme that aids in their retrotransposition. Beyond these general outlines, the actual mechanisms for replication are poorly understood. Nonetheless, when studying TEs, it is helpful and customary to characterize specific instances (or insertions) by their mechanism of replication and segregate them into TE families--groups of elements that purportedly evolved from the same transposing sequence.

TE family classification is complicated by the fact that TE instances are shuffled and scrambled as they evolve. When copies of an invading TE family are not under selective pressure, mutations mangle the sequences of each insertion, resulting in related elements that are of different length, incomplete structure, and beyond recognition of sequence similarity. Furthermore, some autonomously transposing elements have evolved new classes of non-autonomously transposing elements (Vitte and Panaud, 2003). In such cases, copies of two different families show sequence similarity in sub-structures, but differ in their replication method, or in substructure order and content. TEs also show insertion site bias for transposing into adjacent positions in the genome or in a nested fashion (Freund and Meselson, 1984; Inouye et al. 1984; Losada et al. 1999). Consequently, many repeat regions are combinations of tandem and nested arrays of complete and partial copies of different TEs which may then transpose together- leading to complex relationships between originating elements and their replicants. These artifacts of TE evolution complicate the definition of element boundaries and cause problems in classifying related elements into subfamilies (Bao and Eddy, 2002).

We present a method to identify and label only those sub-segments that are integral to the replication mechanism of the repetitive element. We are motivated to identify only those regions and perform TE family classification based on them. While mutations occur throughout the entire sequence of the repeat, we hold that so long as the repeat was still 'active' (i.e., capable of mobility) the functional sub-repeat portions were better conserved than the remainder of the sequence; therefore, selecting clusters that represent functional subfamilies should be based on these functional sub-repeats.

Probabilistic model

We begin with V, the union of two sets of random variables,

V = X_\Union_Y,

where X comprises of our observed input sequences and Y is the set of output labels we wish to assign to X. For the purposes of classifying repeats, we are most interested in labeling those subregions of sequence (which we call sub-repeats) that play some role in repeat element mobility. Such functional sub-repeats include coding elements, DNA binding sites, and terminal repeats as outlined in Table 1. Every variable in X is an observed nucleotide from the discrete set of Watson-Crick base pairing nucleotides, \Omega = {A,G,C,T}. Every variable from Y takes an outcome from the discrete set of labels, S, which is defined in Table 1.

_Table 1_ Set S- Class (or State) types in our model

Class Type
general ORF
envelope gene
pol gene
gag gene
poly-A signal sequence
TATA box

Additionally, we can define a set of feature functions F = {f_xyk}. Each such real-valued function associates a particular label assignment with a real value. We can group all such feature functions for as an exponetiated weighted sum to result in a positive, real-valued 'score' for a particular assignment. We call the resulting functions 'compatibility functions' because they can be viewed as a score indicating the degree of compatibility a particular assignment has given the weights for the features.

\Phi_{ijk}(x_{ijk},y_{ijk}) = exp{\Sum_k \theta_{ijk}f_{ijk}(x_{ijk},y_{ijk})}

We are left with a set of n positive, real-valued compatibility functions {\phi_{ijk}} that map the combinatorial product space of XxY to n-dimensional positive, real-valued vectors in R^n.

We would like to use this scoring mechanism to induce a probability distribution on the space of all possible assignments in order to be able to evaluate the goodness of a particular assignment. Graphical models are particularly useful for a task like this since a graphical model describes a family of probability distributions that factorize according to an underlying graph. We must therefore choose an underlying graph, and use it in conjunction with our compatibility function to be able to perform our classification task in a probabilistic framework.

Classifiers like support verctor machines assume that the output variables in Y are independent and identically distributed, with a direct dependency on X. This type of directed grapical model factorizes as:

p(y,x) = \Product_{{x,y} \in V} p( x\| y)

In our application, however, the elements in x are ordered sequentially, and there is an obvious dependency among them. We could then arrange the output variables in a linear chain, as in a hidden Markov Model (HMM). An HMM consists of the state set S (as in Table 1) and an observation set, \Omega. Having linearized the input, X = {x_t}_{t=1...T}, we have the related linear output, Y = {y_t}_{t=1...T}. This dependency structure is still 'directed', and factorizes as:

p(y,x) = \Product_{t=1...T} p( y_t\| y_{t-1}) p(x_t \| y_t) = \frac{1}{Z} exp {\Sum_{k=1...K} \theta_k f_k(y_t, y_{t-1}, x_t)}

where the leftmost equivalence induces a probability distribution that factorizes according to the underlying first-order Markov chain graph by using the normalization factor Z and the feature functions introduced above. Z, often referred to as the partition function, is defined by

Z = \Sum_{x,y} exp {\Sum_{k=1...K} \theta_k f_k(y_t, y_{t-1}, x_t)}

In effect, Z sums the compatibility score values for all possible assignments of Y to the observations in X. In a linear chain graph, Z is tractable due to dynamic programming. In most graph configurations, however, Z is intractable and is the reason inference in many models is computationally infeasible.

Although this model no longer assumes that each assignment of Y to an element in X as an independent and identically distributed event, it still makes the 'naive Bayes' assumption that the observations are independent given the state sequence. This assumption often leads to decreased performance, because we are still attempting to model p(x)- the distribution that generated the observations. However, we don't know enough about the embedded structure of the states in S, nor do we have enough examples, to formulate a generative model (as in the HMM) to characterize the emission sequence for each label.

One solution is to model the conditional probability, p(y \|x) directly-- which is all that is necessary for the classification task. We remain with the same linear-chain graph topology, but remove the graph directionality (i.e., a linear chain undirected model). This yields a linear-Conditional Random Field (linear-CRF). This can be written in the form

p(y\|x) = \frac{1}{Z(x)} exp { (\Sum_{k=1...K} \theta_k f_k(y_t, y_{t-1}, x_t) }

In this case, Z(x) is dependent on the observations in X:

Z(x) = \Sum_{x \in X} exp { (\Sum_{k=1...K} \theta_k f_k(y_t, y_{t-1}, x_t) }

By modeling the conditional probability distribution directly, we need to make no assumptions about the complex dependencies among x, or the specific form of p(x). It is also important to note that the summation in Z(x) over all possible outcomes of Y, has an exponentially large number of terms, but is not intractable as it could be efficiently computed by forward-backward as shown by Lafferty et. al. (2001). This formulation allows us to use a much richer set of features that could model long-distance dependencies and extract more information about the structured data we observed.

However, we are still considering each label Y for each X, while in our problem domain of repeated elements, the same label (for instance, an open reading frame) extends sequentially for some length. Additionally, the sub-sequences underlying these labels have various collective sequence attributes (for instance, "sequence similarity to other known entities") that cannot be described by aggregates of features for each individual observed nucleotide. These features tend to characterize the structure of the label entities in S, and we would therefore like to use them.

semi-Conditional Random Field (semi-CRF) formulation is a natural choice for performing joint segmentation and collective labeling because it allows us to easily construct label-based features (for instance, "similarity to other known entities" or "sequence length"). We now let s = {x}_{1...p} denote a segmentation of x, where the segments are defined by a start position, x_s, and end position x_e, and a label y_s \in Y. This implies a the subsequence x_{s...e} is entirely labeled by y_s. This allows us to describe a vector of segment feature functions which map a segment-label association to a positive, real-valued score. Once again, the graph factorizes as a linear chain, only now we sum over all possible segmentations and label-associations. In this way, when we perform inference, we are simultaneously segmenting and labeling the sub-sequences.

Inference in a semi-CRF requires finding the best segmentation-and-labeling through

argmax_s P(s|x,\Theta) = argmax_s \frac{1}{Z(x) exp { (\Sum_{s} \theta_s f_s(x, s) }

which can be done efficiently by the Viterbi algorithm as described by Sarawagi and Cohen (2004).

For parameter estimation, the goal is to maximize the log-likelihood over the given training set T, which also defines the maximum entropy model for the given data (Sarawagi and Cohen, 2004). Since the optimization is still convex, this could either be done by direct optimization using conjugate gradient or by approaches similar to expectation maximization (Mc Callum et al, 2005).


Experimental results were evaluated on the hand-curated data set of transposable elements in D.melanogaster. This set has been used in the past to evaluate other common methods for TE segmentation and clustering (Bao and Eddy, 2002). This set is referred to as BDGPTrans and contains 1572 elements from D.melanogaster release 4.0. These elements are further subdivided into 92 families, of which 43.4% are LTR element, 10.9% are LINE-like elements, 23.7% are TIR elements and 2% are FB elements. Using the embl-format sequences available for the BDGPTrans set at

 http: // www .fruitfly .org / p_disrupt / TE.html 
, our strings were further segmented into fields such as "ORF for pol gene" or "three-prime-long terminal repeat." Each sequence is treated as a single unsegmented character sequence. We present our results on this data set with cross-validation, as well as predicted segmentations on segments recently found using a comparative-genomic-based method (Caspi and Pachter, 2005).

Cross-Family Validation A leave-family-out cross-validation was performed on the large Class I families of the BDGPTrans set. For each cross, elements in the one Class I family were placed in the test set while the remainder were placed in the training set.


In our experiments, we trained a semi-CRFs to mark entity segments with labels I from those listed in Table 1. Since this is a multi-class model, we did not introduce a separate "non-entity" label. Below we describe the features used to construct the semi-CRF for the application of repeat element sub-segmentation.


We have general knowledge about the requirements for a repeat to be functional- and have encoded each of these into types of regions for labeling as described in Table 1. However, since we are often dealing with mangled or rearranged sequences, we did not make a finite-state machine representation of the model that would restrict certain transitions in the label sequence. By not defining such restrictions, we allow the possible transitions to be inferred from the training data.


As noted, we gain a lot of flexibility in our model by including non-independent input features and removing the constraint of adhering to pre-specified edit operations. The input features in our experiments include features that capture local dependencies among bases (Table 2), those that capture long-range dependencies among bases (Table 3), and homology features obtained from outside sources (Table 4).

We note that for in-frame coding sequences, the most useful features were those computed over a window of 3, 4, and 5 bases. The window of bases, combined with the predicted label, allow the model to capture amino-acid structure of the genes commonly coded for by transposable elements.

Table 2 local dependency features

Features computed Window ranges
Is_A {i-4 ... i} {i-3 ... i+1} {i-2 ... i+2}
Is_G {i-1 ... i+3} {i ... i+4}
Is_C {{i-3 ... i} {i-2 ... i+1}
Is_T {i-1 ... i+2} {i ... i+3}
Is_Purine {i-2 .. i} {i-1 .. i+1} {i.. i+2}
Is_Pyrimidine {i-1 ..i} {i..i+1}

Table 3 long-range dependencies

Features computed Window ranges
A_count windows of multiples of 3
G_count from 3..48

Table 4 blastz feature similarities

Features computed
number of hits in same sequence
number of inverted hits in same sequence
number of hits in all other training sequences
number of inverted hits in all other training sequences
score of maximal hsp
length of hsp in present sequence
difference in length of query and hsp

Due to the prohibitive training time, the last set of features (in Table 4) were removed from the final implementation, and the number of windows checked in Table 3 was reduced to three (values used were 15, 30, and 45). The following table lists the sensitivity per base recorded on the cross-validation set. In all, five cross-validation experiments were performed.

*_Table 5_ Sensitivity results for each class, averaged over 5 cross-validation experiments.

Class Type Nucleotide Sensitivity (percent)
general ORF 70.3
envelope gene 15.4
pol gene 8.2
gag gene 4.1
five-prime-LTR 20.3
three-prime-LTR 25.6
terminal_inverted_repeat 12.6
poly-A signal sequence 89.3
primer_binding_site 2.1
TATA box 65.6

Computational considerations

While both inference and learning in CRFs is tractable, and dependent on dynamic programming (which is generall O(T^2,L^2) where T is the number of sequences, and L is the length of the sequence), it is important to note that in practice, training is computationally expensive. Using all 1572 transposable element sequences in drosophila melanogaster euchromatin, with a limited set of 100 features, took over 7 hours. The more latent variables there are, and the more outcomes there are (in our case, 10 output labels)- the greater the computational burden of needing to perform inference repeatedly during training.

Further Validation

The next steps would be to compare our segmentation with that performed by other programs for the same purpose. Volfovsky et al. (2001) and Bao and Eddy (2002) recently developed new heuristic algorithms for de novo repeat classification that perform well in practice (RepeatFinder and RECON). Additionally, Pevzner et al (2004) employed a graph optimization scheme using approximate de Bruijn graphs that represent repeats as mosaics of sub-repeats (FragmentGluer).

Future Work and Discussion

The problem of repeat element sub-segmentation and labeling was addressed using semi-CRFs. The use of semi-CRFs improves upon graphical models currently used for the task (namely, approximate de Bruijn graphs, FragmentGluer) because semi-CRFs can model long-distance dependencies that can be used to obtain better accuracy in joint segmentation and labeling tasks. In the de-Bruijn graph, segments can be missed as a result of insufficient sequence similarity between the relevant sub-segments. In the semi-CRF model, however, additional features and attributes of the entity are added to bolster candidates that may have other appropriate attributes, but have mutated far beyond the threshold set for sequence similarity. In effect, using our model removes the need to set threshold parameters, and allows use of a richer set of observed features of entities. Additionally, collective segmentation and labeling is done simultaneously, so that the dependencies between the two tasks are leveraged.

In future experiments, a refined set of labels ought to be used (those in Table 1). Specifically, the number of labels ought to be limited to allow for faster training. Additionally, I believe the start-codon should be a state on its own, allowing the addition of features that pertain to the remaining coding sequence being 'in-frame', and enforcing features that disallow the appearance of the start codon 'in-frame' in the remainder of the coding region. Splice-site boundary conditions ought to be similarly modeled.


  • Z. Bao and S.R. Eddy. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Research, 112: 12691276, 2002.

  • A. Caspi and L. Pachter. Identification of transposable elements using multiple alignments of related genomes. Genome Research, 10.1101/gr.4361206, 2005.

  • Aron Culotta, David Kulp, and Adrew Mc Callum. Gene Prediction with Conditional Random Fields. Technical Report UM-CS-2005-028, University of Massachusetts, Amherst, April 2005.

  • R. Freund and M Meselson. Long terminal repeat nucleotide sequence and specific insertion of the gypsy transposon. In Proc National Academy of Science, volume 81, pages 44624464, USA, 1984.

  • S. Inouye, S. Yuki, and K. Saigo. Sequence-specific insertion of the drosophila transposable genetic element 17.6. Nature, 310: 332333, 1984.

  • John Lafferty, Andrew Mc Callum, and Fernando Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. 18th International Conf. on Machine Learning(ICML-2001), pages 282289. Morgan Kaufmann, San Francisco, CA, 2001.

  • A. Losada, J.P. Abad, M. Agudo, and A. Villasante. The analysis of circe, an LTR retrotransposon of drosophila melanogaster, suggests that an insertion of non-LTR retrotransposons into LTR elements can create chimeric retroelements. Molecular Biology Evolution, 16: 13411346, 1999.

  • Andrew Mc Callum. Efficiently inducing features of conditional random fields. In Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI-2003), 2003.

  • Andrew Mc Callum, Kedar Bellare, and Fernando Pereira. A conditional random field for discriminatively-trained finite-state string edit distance. In Conference on Uncertainty in AI (UAI), 2005.

  • P.A. Pevzner, H. Tang, and G. Tesler. De novo repeat classification and fragment assembly. Genome Research, 14:17861796, 2004.

  • L.R. Rabiner. A tutorial on hidden Markov models. In IEEE, volume 77, pages 257286, 1989.

  • Sunita Sarawagi and William W. Cohen. Semi-markov conditional random fields for information extraction. In Advances in Neural Information Processing (NIPS17), 2004.

  • G. D. Stormo. Consensus patterns in DNA. Methods Enzymol, 183:21121, 1990.

  • C. Vitte and O. Panaud. Formation of solo-LTRs through unequal homoeologous recombination counterbalances amplifications of LTR retrotransposons in rice Oryza sativa. L. Mol. Biol. Evol., 20: 528540, 2003.

  • N. Volfovsky, B. Haas, and S. Salzberg. A clustering method for repeat analysis in DNA sequences. Genome Biol. 2: RESEARCH0027. Epub 2001.

Actions: Edit | Attach | New | Ref-By | Printable view | Raw view | Normal view | See diffs | Help | More...