This site is powered by the TWiki collaboration platform Powered by PerlBiowiki content is in the public domain.
Comments on this site? SushantSundareshHomeWork4">Send feedback

Biowiki . Fall09 . SushantSundareshHomeWork4

Biowiki . Fall09 . SushantSundareshHomeWork4

-- SushantSundaresh - 16 Oct 2009

Sushant Sundaresh (undergraduate)
login: be131-10 
Date submitted: 10.16.2009
HW4, RNA Folding Lab

Table of Contents

Verifying Properties of YES-1 Hammerhead Ribozyme Yes Gate by Breaker and Penchovsky

First off, new link to paper, so citations are all in order. I am not going to copy/post a picture of figure 2.a from the paper, as the pdf is copy protected and I'm not sure that's allowable with the rather lax citation style (though with full credit given, just not MLA format or anything) I've used below.

Penchovsky R, Breaker RR.  Computational design and experimental validation of oligonucleotide-sensing allosteric ribozymes.  Nat Biotechnol. 2005 Nov;23(11):1424-33. Epub 2005 Oct 23.

The YES-1 gate referenced is in figure 2, and the nucleotide sequence used was found later in the paper, under 2.1 in Methods in a sequence of the form





with xxxxxx... representing a 16-22nt sequence that is quasi-randomizable, as long as that sequence maintains a specified bound secondary structure that masks the active ribozyme site.

the sequence structure noted is supposed to be "similar" but obviously it isn't even the right length. i've checked this a few times, visually, so i think either i made a typographical mistake or its really just a rough specification for the future structure of the randomized sequence. it doesnt need to be as exact.. and since the xxxx is a 16-22 nt region, it makes sense that the length wouldn't be exact. its not a big deal for this hw, persay.

as I want to verify that when DNA-1 binds to those 16-22nt X's it will in fact generate the correct stem-loop structure to "open" the active ribozyme cleavage site (which basically means RNAfold -C with those X's forced non-bped, or x'ed) I ended up visually/physically copying down the 22nt XXX... sequence that is complementary to DNA-1 for this particular YES-1 gate as I wanted the sequence as exact as possible for the folding analysis.

that said, here are the sequences/files that are relevant, with the added note that the sequence is 80nts long and therefore is in proper fasta format.

file yes1.fasta

> yes1 hammerhead ribozyme apatamer (figure 2.a breaker paper)


file yes1_constrained.txt


so in the 'on' position (stem I, II, III properly formed) the entire sequence except "AGCUCGUCACUGUCCAGGUUCA," the OBS sequence binding to DNA-1, or DNA-1's marker, is unconstrained, with "AGCUCGUCACUGUCCAGGUUCA" being constrained to be unbound to anything else in this sequence (which in our case means bound to the DNA-1 oligonucleotide, and therefore theoretically in the "on" position)

in the 'off' position, the entire sequence is allowed to fold as it will, with, theoretically, the minimum free energy (MFE) solution to the folding problem corresponding to a secondary structure that blocks the formation of the II stem with the IV stem (which has the AGCUCGUCACUGUCCAGGUUCA sequence along with other nts bound up in a stem that prevents the proper stem II formation, which in turn blocks the self-cleaving activity

I pasted these into a notepad file named yes1.fasta and yes1_constrained.txt in my ~/hw/hw4/ directory, and (after installing the DART package and after Oscar wrote a .bashrc file referencing the ihh/...install directory path to my main PATH variables) ran the bash commands that follow ($ represents command line prompt from my hw/hw4 directory):

verifying "OFF position"

no constraints imposed upon AGCUCGUCACUGUCCAGGUUCA OBS sequence in RNA sequence when calculating MFE secondary structure.

.fold files are just .txt files that I tagged with the name .fold to let me know they contained secondary structures.

$ cat yes1.fasta | RNAfold > yes137C.fold

$ convert yes1_ss.ps yes1_37C_off.pdf

which created yes1_ss.ps and yes137C.fold files without constraints on binding, at default 37C temperature, which seemed reasonable, as the paper's methods section specifies running their sorting function for folds at 37C and ensuring with RNAheat that the structure held for the range of 20-40C, specifically also at 23C. also, the second command converted (using an image converter in the DART package) the yes1_ss.ps file to .pdf format, for later retrieval and visual comparison.

since RNAheat generates plots of specific heat and in this hw we care about structure, I'm just testing at 23C and 30C as proof of concept.

$ cat yes1.fasta | RNAfold -T 23 > yes123C.fold

$ cat yes1.fasta | RNAfold -T 30 > yes130C.fold

$ cat yes1.fasta | RNAfold

and here are the relevant structures. no need for .ps or .pdf images here, those come later. this is just to show that the structures are not significantly impacted in the given temperature range, 20-40C, as stated in the specs of the paper methodology. the last line ensures the yes1_ss.ps file points to the default 37C representation.

Temperature Stability of OFF configuration



((((((((((((((((((((.(..(((.......))).)))))))).))))).....(((((....))))).)))))))) (-47.29)



((((((((((((((((((((.(..(((.......))).)))))))).))))).....(((((....))))).)))))))) (-41.06)



((((((((((((((((((((.(..(((.......))).)))))))).))))).....(((((....))))).)))))))) (-35.10)

and a quick visual screen confirms that there is no difference in the folded structure (unconstrained) at the three temperatures, spanning approximately the 20-40C range specified in the paper.

alright, so now that temperature is no longer an issue for the off state, lets see what it actually looks like, its MFE visual secondary structure:

YES-1 37C OFF configuration

yes1-off-37C secondary structure hw4 rna folding

where in this image,

furthermore, a bp-probability plot (dot plot, for all we care) would be useful in comparison to figure 2.a in the paper.

$ cat yes1.fasta | RNAfold -p -d2

generates a yes1_dp.ps dotplot file of the 37C off configuration for yes-1, ensuring that mfe and -p use the same energy partitioning function, which we convert to pdf and view.

$ convert yes1_dp.ps yes1_37C_off_dotplot.pdf

YES-1 Dotplot 37C OFF configuration

yes1 off 37C dotplot

the "ghost" pixels represent lower probability base pairing events, presumably. the ones we care about are above the topleft-bottomright line demarcation, in strong black.

all right, so these two together pretty much verify the actual structure predicted/verified in the P&B paper. let's go through the structural elements one by one..

verifying "ON position"

same deal as before, just now as described initially, there is the added constraint that the OBS sequence complementary to DNA-1 (in the RNA sequence for the ribozyme, this sequence is AGCUCGUCACUGUCCAGGUUCA) cannot be bound in the resulting structure. by specifying that the resultant structure leave those nts unbound, we approximate this RNA binding to DNA-1, and therefore unable to utilize those 22nts in the creation of, say, a stem IV loop. that said.. plots! and bash commands. and analysis....

the following bash commands are basically word for word repeats of those described above, just with piped filenames changed. i'm not repeating the usage descriptions. see above if necessary.

$ cat yes1.fasta yes1_constrained.txt | RNAfold -C > yes137C_ON.fold

pass constraint text file i.e. ...xxx...xx.. to RNAfold. constraint function described previously, above.

$ convert yes1_ss.ps yes1_37C_on.pdf

$ cat yes1.fasta yes1_constrained.txt | RNAfold -C -T 23 > yes123C_ON.fold

$ cat yes1.fasta yes1_constrained.txt | RNAfold -C -T 30 > yes130C_ON.fold

$ cat yes1.fasta yes1_constrained.txt | RNAfold -C -p -d2

$ convert yes1_dp.ps yes1_37C_on_dotplot.pdf

Temperature Stability of ON configuration



((((((((.......((((((...........................))))))...(((((....))))).)))))))) (-36.98)



((((((((.......((((((...........................))))))...(((((....))))).)))))))) (-32.67)



((((((((.......((((((...........................))))))...(((((....))))).)))))))) (-28.53)

visual check confirms that there are no structural differences for the minimum free energy (MFE) secondary structure predicted by RNAfold under the given constraints (for the OBS sequence complementary to DNA-1) in the approximate range of 20-40C, as given the by P&B paper specs under Methods 2.1

see, this is even clearer:

((((((((.......((((((...........................))))))...(((((....))))).)))))))) (-36.98)

((((((((.......((((((...........................))))))...(((((....))))).)))))))) (-32.67)

((((((((.......((((((...........................))))))...(((((....))))).)))))))) (-28.53)

alright, so temperature dependence is not a problem in this case either. lets go through and check stems I, II, and III (don't want IV, but check also) in the predicted MFE secondary structures. dotplots and .ps images ahead:

YES-1 ON configuration 37C Secondary Structure


this image is obviously different-looking than the one presented in P&B's paper, as we just told RNAfold to constrain the OBS to NOT be bound, while they either simulated its binding to DNA-1, the complement to the OBS described numerous times in this answer, OR just made a pretty image where they showed DNA-1 bound reverse-complement style, linearly, to the 22nt OBS. in any case, semantics aside,

notice that, as described in the P&B paper, stems I and III seem largely untouched by the stemIV or stemII changes, and binding to the OBS in and around stem II/IV and the core. stems I and III are conserved regardless, in the given temperature ranges.

these stems are exactly as predicted for the given T and -C constraints in the paper. so far, so good. dotplot analysis next.

YES-1 ON configuration 37C Dotplot


Final Analysis of YES-1 Gate

The YES-1 secondary RNA structures in both the OFF (OBS unconstrained) and ON (OBS "unbound" or "bound to DNA-1, not YES-1 RNA") configurations are thermostable in the temperature range from 20C-40C approximately. In the OFF configuration, dotplot analysis and visual structural screen confirm that the stem IV substructure inhibits the formation of the stem II core-loop structurally-stabilizing element. In the ON configuration, the OBS that largely makes up stem IV "binds" to DNA-1 in our simulation, and actually so in in vivo studies, meaning stem IV is not favored and stem II is the most probable substructure to form in that same region, allowing for core-loop stabilization via stem II and, as per lecture and in the P&B paper, cleavage at the now-exposed (in the further opened core-loop) cleavage site, "GUAG," between "A" and "G."

One more note on temperature stability. Obviously, to be truly precise I should have used RNAheat in conjunction with randomly sampled temperatures that underwent the same analysis as the 37C off/on configurations above. I think that the point of this homework question was for me to show you that I know how to use RNAfold and its basic options. In the actual paper, from what I can gather from the methodology, the whole "sample multiple temperatures rigorously" thing is exactly what they did. Consider my 3 temperature samples a proof of concept.

In conclusion, the ON and OFF configurations and their respective temperature-dependent structural stabilities correspond to the specifications outlined in the methodology and results sections of the P&B paper.

Software pseudocode & comments for verifying YES gate

Description of Problem:

Outline the design for a Perl program that verifies the correctness of a YES gate of the form given in the previous part of the homework. You don't need to actually write the program, but you should describe all the key steps, giving code snippets where appropriate. The program should take as inputs:

  1. an RNA sequence (corresponding to the YES-1 sequence from the Penchovsky-Breaker example);
  2. the co-ordinates of the oligonucleotide binding site (the OBS subsequence from the P-B example).

The output of the program should be the truth table of the logic gate.


Well, first a note on my interpretation of the problem specifications.

_FIRST AND FOREMOST: oscar, sorry that you won't be seeing much color in the lines that follow. I had to choose between using < > every time I wanted a filehandle (which I needed a lot of) and making things pretty. In the interest of readability while writing the code, I decided to make code segments enclosed in <verbatim> blocks to allow the filehandles to show through, but this means my nice formatting turned into actual text. On the bright side, the important (previously highlighted parts) are still set off by those markers._

Next, when you say "corresponding to" or "from the .. example," I obviously assume that you don't mean those sequences exactly, as then the problem would boil down to checking the argument strings against known "correct" answers. That would be trivial, and wouldn't help anyone.

I assume instead that you intend for me to lay out a basic, top level design for a program that can take

and verify that before binding of some oligonucleotide to this OBS (read: unconstrained OBS), the RNA secondary structure is in an OFF configuration (no stem II structure formed, in a 20-40C temperature range). Verify also that after binding to the OBS (read: OBS constrained as 'unbound' upon RNAfold -C call) the secondary structure contains a clear stem I, II, and III, and that the GUAC region is unbound, in the same temperature range.

This strategy, while missing a few details, is incredibly similar to the algorithm that P&B implemented in and described in their Methods section for optimizing random OBS sequence-based hammerhead ribozyme apatamer generation. So I am effectively running a GIVEN sequence against their algorithm to ensure it matches their standards. That is the point of this question as I see it.

I use the same stemI, II, III implies YES gate functionality (and therefore cleavage site accessibility) (verified via some cleavage-product isolation based study) rationale because your problem specifications describe a "YES gate of the form given in the ... homework" which till this point has meant the YES-1 gate from the P&B paper, in figure 2.a.

Finally, the output: the truth table. The reason the P&B algorithm restarts its procedural generation of data each time a screen fails is because the output from that particular YES gate is not predictable to within a certain degree of tolerance, over a given temperature range and with respect to a given OBS. with that in mind, there is no reason that errors (input errors, etc.) OR failed screens should print truth tables. These would be meaningless.. all I could say is that 0-> 0,1 but without certainty as to when 0 or 1 is produced as Yes(0) or Yes(1). Therefore, I have opted to only print the truth table upon successful verification of the yes gate; see 1.11.

Solution; Code Snippets:

#!/usr/bin/perl -w

Administrative details

The script should take as user input 3 things, an RNA sequence (argument 1) and the coordinates [low, high] (two values, two arguments) inclusive bounds on the nucleotides that make up the OBS sequence.

obviously, depending on the length of the RNA sequence, the user might or might not want to type them into the STDIN, and from what I see a lot of these RNAfold Vienna package programs just take STDIN to allow people to pipe in data from their own text files. This program will act similarly; the user should be able to pipe in all the relevant input, as one RNA sequence file and two numerical integer values representing the OBS bounds.

Appropriate exceptions (die) calls will be made if, say, the RNA file is non-existent or unreadable, and if the bounds for the OBS are invalid (outside sequence length, below zero, etc.). Since I don't have to write this, let's just say this program takes DNA or RNA sequences of arbitrary length (though the longer it is, the longer this will take) in FASTA format sequences (parsers can be nicked from BioPerl and Cpan) and that sequences with thymines and no uracils will be transcribed into RNA (reverse complemented, lc($_) and =~ tr/t/u/ globally). If that isn't acceptable, I could also write this to simply do the transliteration after lower-casing, to preserve sequence order.

The program should also print a help message when called from the shell with a '-h' tag, providing a help message with valid user input before terminating.

All right, the code for this is would not be helpful in the slightest, its just a bunch of if, elsif calls and die calls when pre-conditions are not met. Administrative stuff out of the way. Now onto the hard part.. actually verifying, now that we know the input data is accurate.

Implementing the P&B Paper Methods (which describe algorithm for screening RNA structures of YES-gate form)

At some point I have to set basic bounds on the variation that is allowed. From what I learned in the first part of the homework, the P&B YES-gate generation algorithm only allowed for variation in the OBS sequence. The easiest test, therefore, to see if the given RNA sequence is a valid YES-gate of the form described in this homework, is to check if it matches against the following regex pattern; random whitespace and \n newline characters are not acceptable between true input in my basic implementation, but such functionality can be trivially added if necessary.

valid YES-gate structure, as per P&B Methods

GGGCGACCCUGAUGAGCUUGAGUUU                         AUCAGGCGAAACGGUGAAAGCCGUAGGUUGCCC <-- ribozyme stem I, II precursor, III, core loop, &cleavage site
                          AGCUCGUCACUGUCCAGGUUCA <-- the specific sequence used in YES-1 for P&B
                          NNNNNNNNNNNNNNNN[NNNNNN] <-- 16-22 nt randomized, given certain pre-conditions outlined below

I assume also that when items are piped through the standard input STDIN, if there are multiple files, the value of <> in each successive call is not 'undef' until all files have been exhausted. so I will use one while loop, with the obvious potential modification that if <> returns undef at the end of each file, I need to introduce trivial checks for that.

Basic Parsing

# a large part of the trickiness in parsing is due to the fact that the user will pipe in 1 argument file and 2 integer arguments to STDIN. how to #deal with that? i have to make a few assumptions about how such data is piped in, in lieu of actually writing the code, which I assume was not #intended for an exercise dealing with "code snippets"

#the first 74-80 nt (16-22 nt OBS + 58 nt ribozyme stems + core elements) must be RNA elements. The reverse-complement case is a detail that is best left to actual implementation, it adds nothing to write in in here with the bigger-picture regex parsing design, etc.

#the next non \s or \n element must be a number, as must the one after it. all other commands are ignored (or i could terminate. either way).

my ($rnaSeq, $lowerOBSBound, $upperOBSBound);
# boundTag lets us know whether there is invalid input of the form agagagaga 2 agagag
$boundTag = 0;
$startSequenceTag = 0;
my $description;

while (<>) {
 # want to maintain newlines, ensure proper, rigid formatting in this limited implementation. don't chomp.

 if (m/(>.*\n)/) {
   if ($startSequenceTag) {die "multiple fasta format files inputted on STDIN";}
   $startSequenceTag = 1;
   $description = $1;

 # again, only dealing with RNA here, not DNA. that is a special case that can be implemented later if necessary. also not dealing with random 
 # whitespace formatting. its easier to show the snippet this way, to not get bogged down in details.
 # for this small scale psuedo implementation, at least, the sequence is restricted to exactly what the YES gate is defined as in the P&B ==Methods==
  # no description line or >.. invalid format. such error checks can be extended
  if (! $startSequenceTag) {die "invalid fasta format";} 
  if ($rnaSeq) {die "multiple sequences in input. invalid. error. terminated.";}
  $rnaSeq .= $1; 
 } else {die "invalid YES-gate; incorrect number of nucleotides";)
 #obviously I havent checked for the agagagaga2 case, but again, details. input errors are to be checked, but for the program outline it is useless.
 # also, I am assuming that piping opens files into the standard input with separating newlines or spaces, so this will work. the general idea is to 
 # check for two integer arguments
 if (m/(\d+)[\s\n]*(\d+)/) {
  # if either are defined, then there were multiple bounds passed and we have an error. 
  if ($lowerOBSBound || $upperOBSbound) {die "numerical arguments passed already.";}
  # if neither is defined but there is only one number.. we have a problem
  if ($1 || $2 && (! ($1 && $2))) {die "incorrect number of integer arguments passed.";}
  # now we're fine, basic errors checked
  $lowerOBSBound = $1;
  $upperOBSBound = $2;
 # more error checking... etc.

# so now we have an rnaSeq scalar string with the give RNA sequence and the necessary OBS bounds.
if ($lowerOBSBound > $upperOBSBound || 
    $lowerOBSBound > length($rnaSeq) || 
    $upperOBSBound > length($rnaSeq)) {die "invalid bounds";}

# generate on state constraint strings, make actual method calling easier.

$onStateConstraint = "";
for (my $i = 0; $i < $lowerOBSBound; $i+=1) {
 $onStateConstraint .= ".";

# i'm pretty sure these bounds are right.. 
for (my $i = 0; $i < $upperOBSBound-$lowerOBSBound; $i+=1) {
 $onStateConstraint .= "x";
# again, pretty sure about my bounds, might need to add/subtract 1. basic idea gets through.
for (my $i = 0; $i < length ($rnaSeq) - $upperOBSBound; $i += 1) {
 $onStateConstraint .= ".";

# internal error checking
if (length ($onStateConstraint) != length ($offStateConstraint) &&
    length ($onStateConstraint) != length ($rnaSeq)) {die "bounds wrong on constraint generation for loops.";}

# now we can actually start following the rest of the P&B "check if a sequence is a valid YES-gate" algorithm


# 1.1 & 1.2 have been done for us as we are given a sequence with an OBS user-specified.

#1.3 in Methods describes calculating the free energy at 37C of the proper SETSEQUENCE[16-20 nt OBS]SETSEQUENCE RNA using the partition function -p.

#open a filehandle to an RNAfold program execution with tag -p (partition function & free energy calculation) with consistent energy functions (-d2)
open RNAFOLD, "| RNAfold -p -d2 |";

$fastaRNA = ">myprog\n" . $rnaSeq;
print RNAFOLD $fastaRNA;


# 1.4 make sure that nt 3-9 (numbering referenced to another paper, reference #52, in figure 1.a) 
# of the "hammerhead core" are base-paired in the off configuration. 

# now we have a filehandle to RNAfold which is printing to STDOUT that we can read via <RNAFOLD> and we can therefore parse the RNAFOLD output
# to find the partition function's returned free energy value and store it to a special variable, $free energy. 

# furthermore, we have access to the dominant structures from the dot matrix for probabilities of pairing (the .ps file of the form myprog_dp.ps) 
# which we have to parse out of the STDOUT, as the -p -d2 option prints the dominant forms ((..))... form secondary structure to the STDOUT.
# if we want to find out whether the necessary base-pairs are formed dominantly in the off state... we can easily parse this out of STDOUT.

while (<RNAFOLD>) {
  if (m/>.*\n/) {break;}
my $energy;
while (<RNAFOLD>) {
  # match to current structure (of the form ((((((...))) .... [energy value]. the digits it matches to are approximate form. for example, 2.3 matches, 
  # but .3 doesnt. that is also a reasonably easy fix though, just more | or cases. 
  $_ =~ m/(.*)\[([+-]\d+\.\d*)\]/i
  if (! $energy || $2 && $energy > $2) {
   $energy = $2;
   $mostProbableStructure = $1; # next line was one of most probable structures, or dominant OFF states

# so just found the minimum energy, or dominant state, MFE secondary structure for the OFF configuration, now 
# 1.4 parse the necessary nts from $mostProbableStructure, 3-9 via the referenced numbering system, and see if they are ( < or { as opposed to . if 
# they are base-paired, move on. otherwise quit, the given sequence is not valid. (this code is not included.. but consider this pseudo-code)

close RNAFOLD;


#1.5 use my $onStateConstraint string to set up ON configuration
open RNAFOLD_On, "| RNAfold -p -C -d2 |";

# not entirely sure I can do this. but it waits for me to enter first a sequence then a constraint in the normal stdin prompt.. so maybe it works.
print RNAFOLD_On $fastaRNA;
print RNAFOLD_On $onStateConstraint;

1.6, 1.7

# 1.6,1.7 then the same code above applies, with small changes as we are now looking for stems, so more than just 3-9 bps. 
# I won't rewrite the code, but just as we found the lowest energy structure and made sure that certain elements were
# base paired in the final structure, in the ON configuration, we want to make sure that stem I, stem II, stem III are all basepaired accurately, and 
# furthermore we want to make sure that the cleavage site is NOT base paired. then we can move on. in the end, it boils down to the same type of 
# "extract energies, find minimum, extract dominant sequence, then check specific locations of known sequence for basepairing, related to stems, etc."

close RNAFOLD_On;


# 1.8 now if the sequence has gotten this far we need to make sure that in the OFF state that if <30% or >70% of the OBS nts bp without an effector,
# again, in the OFF state, then we have problems. assuming I saved the dominant, lowest energy structure, and since I still have the bounds on the OBS,
# i can check the percentage that form base-pairs of some sort in the off state. lets still call it $mostProbableStructure.. assume in the 1.6,1.7 
# setup I called the variable something different.

$OBS_Struc_String = substr $mostProbableStructure, $lowerOBSBound, $upperOBSBound - $lowerOBSBound;
@OBS_Structure = split (//, $OBS_Struc_String);
$OBS_Length = @OBS_Structure + 0;
$OBS_BPs = 0;
for (my $i = 0; $i < $OBS_Length; $i+=1) {
 #i'm not including secondary base pair notation like { } | etc. though | is a type of constraint too... hm. well, again, this is just proof of  
 if ($OBS_Structure[$i] =~ m/[\(\)]/) {$OBS_BPs++;}

$fract_OBSbps = $OBS_BPs / $OBS_Length;
if ($fract_OBSbps < 0.3 || $fract_OBSbps > 0.7) {die "invalid YES-gate structure, too many OBS base pairs.";}


# 1.9 if the energy gap calculated between OFF and ON states ($energy and .. whatever I set it to in the referenced over 1.6,1.7 section) is 
# not between -6, -10 kcal / mol the sequence is invalid. the actual algorithm suggests going back to 1.1, but we just want to verify, not generate.
# lets just assume I've set $energy_Off and $energy_On to the respective values. see the partial implementation of 1.4 for what a given $energy is 
# (dominant structure energy from -p -d2 RNAfold stdoutput. 

# assuming energy values from RNAfold given in standard units, kcal / mol
if ($energy_On - $energy_Off < -10 || $energy_On - $energy_Off > -6) {die "energy gaps for OBS & RNA sequence are too high between ON and OFF";}


# 1.10 suggests running the program on RNAheat in 20-40C range, ensure that dominant off/on structures preserved in that range. 
# only problem is that RNAheat returns specific heats, not energys that I've previously calculated, and it doesn't as far as the 
# man page shows, have a -C function to allow me to specify the ON configuration. so I must have an older version than the one they used.
# anyways, the other way to test the same stability vs. temperature is to iterate through and calculate/compare manually. 
# i am going to make use of $mostProbableStructure_Off and $mostProbableStructure_On which refer the respective variants of the $mostProbableStructure
# generated in 1.4

$temp = 20;
for (; $temp < 41; $temp +=1) {

 open RNA_1_10_Off, "| RNAfold -T $temp |";

 # this is straight from 1.4, but in a final implementation, this would be properly modularized as a general "get most probable structure" subroutine
 print RNA_1_10_Off $fastaRNA;
 while (<RNA_1_10_Off>) {
   if (m/>.*\n/) {break;}
 my $energy_Off;
 my $tempProb_Off;
 while (<RNA_1_10_Off>) {
  $_ =~ m/(.*)\[([+-]\d+\.\d*)\]/i
  if (! $energy_Off || $2 && $energy_Off > $2) {
   $energy_Off = $2;
   $tempProb_Off = $1; 
 if ($tempProb_Off ne $mostProbableStructure_Off) {die "instable OFF state in varying temperature";}

 close RNA_1_10_Off;

 open RNA_1_10_On, "| RNAfold -C -T $temp |";
 print RNA_1_10_On $fastaRNA;
 print RNA_1_10_On $onStateConstraint;

 while (<RNA_1_10_On>) {
   if (m/>.*\n/) {break;}
 my $energy_On;
 my $tempProb_On;
 while (<RNA_1_10_On>) {
  $_ =~ m/(.*)\[([+-]\d+\.\d*)\]/i
  if (! $energy_On || $2 && $energy_On > $2) {
   $energy_On = $2;
   $tempProb_On = $1; 
 if ($tempProb_On ne $mostProbableStructure_On) {die "instable ON state in varying temperature";}
 close RNA_1_10_On;


# 1.11 if the ensemble diversity does not exceed 9 units for the OFF and ON states, it is a valid YES-gate structure. otherwise die, there is too
# much secondary structure variability. 
# need -p and -d2 to get standard energy partition and to print ensemble diversity to stdout, so it can be parsed.

open RNA_1_11_On, "| RNAfold -C -p -d2|";
print RNA_1_11_On $fastaRNA;
print RNA_1_11_On $onStateConstraint;

# we want to keep parsing till we get to a stage where we match to "ensemble diversity x.xx" where x are numbers. 
# i'm skewing the matching towards 2.3ish things again, vs. .3, which won't be caught, but again, its a matter of minutes
# to add in the relevant other cases (i.e. 2.3, .3, 2., and such if Perl doubles are anything like Java doubles).
# here it would unnecessary clutter the code, i just want the point to get across. 

while (<RNA_1_11_On>) {
 if ($_ =~ m/ensemble\sdiversity\s(\d+\.\d*)/) {
   $ens_On = $1;
close RNA_1_11_On;

open RNA_1_11_Off, "| RNAfold -p -d2|";
print RNA_1_11_Off $fastaRNA;
while (<RNA_1_11_Off>) {
 if ($_ =~ m/ensemble\sdiversity\s(\d+\.\d*)/) {
   $ens_Off = $1;
close RNA_1_11_Off;
# does not exceed is inclusive of the upper bound
if (! ($ens_On <= 9 && $ens_Off <= 9)) {die "ensemble diversity too high, sequence too structurally variable.";}

# else success, FINALLY print truth table (if there are errors, what is the use in printing? it isn't a truth table, and if the 
# rna structure has less than specified stability/specificity, there is no telling if there will be quantifiable, reproducible results
# every time. 

# print to STDOUT
$truthTable = "YES Gate Verified. For input A, YES(A) = A. Logic Function Confirmed\n";
$truthTable .= "A | YES(A)\n";
$truthTable .= "0 | 0\n";
$truthTable .= "1 | 1\n";
$truthTable .= "\n";
$truthTable .= "Program Terminated Successfully. Exit Code (0)\n";
print $truthTable; 

Stage 2: Reasons Implementation is Unnecessary

# stage 2 screening as described in ==Methods== by P&B
# is meant for generating distinct subsequences of OBS to generate variants of each OBS-YES-1 gate. if our sequence passes stage 1 testing, 
# as described in P&B ==Methods== then there are no more problems; we are not trying to procedurally generate new variants, we just wanted to verify 
# it was a valid Yes-Gate structure. If it passes the stage1 screens, for our purposes, it is. 

Program Skeleton & Pseudo-code & Comments Complete

# thus at this point, the problem is complete. in following 1.1 -> 1.11 of ==Methods== in the paper, if every time those
# methods call for "returning to step 1.x," we ensure program termination, then we are effectively screening a given sequence from the user
# against the criteria for valid Yes-Gate structure & stability -- verifying the structure of the sequence, with previous assumptions (stated above)
# holding. 

Hammerhead ribozyme structure

Problem Description:

Predict the MFE structure of the AF404053.1 hammerhead ribozyme sequence (nt 70-186) and compare to the structure given in Rfam (link given to type 1 hammerhead enzyme family in Rfam).


First things first: the sequences at Rfam do not include the species we were given as a homework problem. The closest to our particular species is 
a type I hammerhead ribozyme from the ==cf59== variety of the avocado viroid we are studying (cf39). therefore, this will be more comparative than
definitive. i will try and get a range for how this particular variety folds in different temperatures, pick a somewhat representative one, 
and compare against the "structure" given in Rfam where here I take structure to mean the composite, aggregate structures composed in Rfam. Keep in mind that since cf39 and cf59 are different species, it is possible that there are significant structural differences.

actually doing the structure prediction is exactly the same as in the first problem. i've copied the commands for ease
of view. i won't bother with constraints, these aren't engineered apatamers.. and RNAheat is not useful here, I don't have any specific goal in mind. I just want a  structure and a dotplot.

i saved the hammerhead 70-186 nt fragment to a .fasta file straight from Pubmed, but changed the description tag so my filenames turned out nice.
furthermore, since Rfold requires that even long sequences maintain single lines for a given sequence, without linebreaks, i needed to
change the file from fasta formatting to single-line sequence. i didnt have to do this before since it was a (max) 80 nt file for the last few problems.
file hammerheadcf39.fasta

$ cat hammerheadcf39.fasta | RNAfold -p -d2 > hhcf39_37.txt

$ cat hammerheadcf39.fasta | RNAfold -p -d2 -T 23 > hh23.txt

$ cat hammerheadcf39.fasta | RNAfold -p -d2 -T 30 > hh30.txt

so.. first off. structures at different temperatures, ranging from 20-40C. is the predicted structure stable?



...........((((.((((.((....(((.(((.....(.(((((((...((((......)))))))))))).)))))).))))))))))....(((((.........)))))... (-38.65)

frequency of mfe structure in ensemble 0.00443165; ensemble diversity 26.38



...........((((.((((.((((....))))((.((((.(((((((...((((......)))))))))))..))..)).))))))))))....(((((.........)))))... (-31.28)

frequency of mfe structure in ensemble 0.00271294; ensemble diversity 24.74



...........((((.(((((((((....)))))....((.(((((((...((((......)))))))))))..)).......))))))))....(((((.........)))))... (-24.80)

frequency of mfe structure in ensemble 0.00435267; ensemble diversity 23.79

ensemble diversity seems to drop as we increase Temp from 23->37, so structural variability decreases over that temperature span, to judge by how P&B described the term ensemble diversity.

...........((((.((((.((....(((.(((.....(.(((((((...((((......)))))))))))).)))))).))))))))))....(((((.........)))))... (-38.65)

...........((((.((((.((((....))))((.((((.(((((((...((((......)))))))))))..))..)).))))))))))....(((((.........)))))... (-31.28)

...........((((.(((((((((....)))))....((.(((((((...((((......)))))))))))..)).......))))))))....(((((.........)))))... (-24.80)

staring at these for a little while lets you see pretty quickly that the structure of the loops changes significantly with temperature.
the general structure is preserved but I would estimate at least 7% of bp changes between 23->37C. Therefore, I think we should go with
Temp = 23C just because the it is apparently the most stable of the lot (lowest energy kcal/mol) and in any case, this is comparative. 
the fact that there is such high ensemble diversity and the fact that the MFE structure changes so visibly over a 20C delta 
is weird but there's no reason that the structure hasn't evolved to be flexible. 
if it turns out that the 23C version has little in common with those secondary structures on Rfam, I can always re-analyze.
_$ cat hammerheadcf39.fasta | RNAfold -p -d2 -T 23

$ convert hhcf39_ss.ps hhcf39.pdf

$ convert hhcf39_dp.ps hhcf39_dp.pdf

In order from Left to Right: Hammerhead Ribozyme from Avocado Viroid genome nt 70-186, RNAfold -p -d2 -T [10, 16, 23, 30, 37, 47]

hh10C.JPG hh16C.JPG hh23C.JPG hh30C.JPG hh37C.JPG hh47C.JPG

i did this many because i noticed the variability (extreme, in some cases) due to temperature. before we start comparing, we need to choose a particular temperature as our "baseline" structural value. 

I would say that as this is a plant viroid, its natural temperature of expression is around 10-25C tops. so something in the range of 16-23C would be acceptable. 

that said, to match up to anything Rfam has to offer we would also expect some sort of stem-based 5' 3' loose end interaction, but interestingly what we have here is more reminiscent of a quasi-loop closure.. which really means that the stem is farther up, and the 5' 3' non-paired ends are just unpaired, excess nucleotides. the other alternative is that this is a new structure that we haven't seen before but I just notice that the structure gets farther and farther from the ideal 3stem/coreloop/cleavage site set up the more extreme the temperatures get, and that there are recognizable elements of the ribozyme structure in this sequence, given appropriate temperature ranges. 

t = 16 has elements of an acceptable 3stem structure, though the core loop is tight, but one of its stems has a region (as do pretty much all the others) with many overlapping bases, which seems unlikely.. though it was clearly a low energy structure in comparison. the issue is that I feel like I am stretching the physical bounds to get a structural image in the form of that which we viewed in the earlier parts of the homework. 

one interesting note on reverse complemented nonsense RNA (as far as I know):

in order from left to right: reverse complemented DNA sequence for the rev-com RNA sequence; Temp = 23C. [Structure, Dotplot]

hh_revcom_23C.JPG hh_revcom_dp_23C.JPG

these in particular are interesting because I notice that at a reasonable temperature there are the exact three stems, core loop, etc. 
that I've grown to know and love. Of course, I know from the online pubmed citation that the 117bp sequence is from a viral RNA genome, which means reverse-complementing it is meaningless in this case. still.. its interesting, that a meaningless sequence would have an appropriate structure.

picking a test structure for final comparison to Rfam

the 37C normal (not rev-commed) loop, though a little on the high end of temperatures for a plant viroid, seems to have the proper structure for a type-1 hammerhead ribozyme if we rotate the image across the axis pointing up through its plane.

specifically, consider the following structure & dotplot.

hh37C.JPG hh37C_dp.JPG

rotate the structure in your mind such that the leftmost stem out of the central core loop is pointing to the right. then the bottommost stem 
is pointing at an angle to the left, and the top stem, which branches on one line (once the stem stops being a double helix) to another stem/loop
structure, is now actually on top of the core loop. this structure then corresponds to the stemI, II, III around the core loop that we've come 
to know and expect from standard hammerhead type1 ribozymes. 

i say type 1 specifically (though I have just shown through images that the loop numbering comes out right) because we now have a case where
the ribozyme is composed of three stems, the top one of which (Stem I) is not closed, with 5' and 3' ends unpaired, in this case with one end extending
to further structures of its own. 

lets take a look at what t=37C gives us in terms of structure and dotplot... get a better feel for these values. 
the fact that the ensemble diversity readings are so much higher than engineered varieties indicates that we should see either various strong overlapping enclosure diagonals or many weak ones. some setup where there are multiple close-MFE energy values. 

notice that in the dotplot there are way, way more ghost pixels than we've seen previously in the engineered ribozymes. this might in part be because
of the higher temperatures, but I do notice that even with the reverse complemented version, which forms a similar structure around 23C, there is still a lot of noise.

perhaps in part this is due to unknown ligands. perhaps although the structure we see is the most stable structure, there are more stable or more
predictably stable (less ghost pixels, less alternative probable conformations altogether) due to interactions with unknown ligands. 

in any case, though there are varying spreads of ghost pixels it is clear from the bottom-triangular region (below grey line) that those most probable
have in fact been selected for the resultant structure. 

comparison to Rfam consensus structures

I am looking through Rfam Hammerhead Type-1 secondary structures' consensus sequences and secondary structure plots. I am not really considering
specifically the closest relative on the list (assumed the cf59 species) because Rfam posts the secondary structure descriptions for the ribozyme family as a whole.

Notice that the 'bpcons' notes only the 3 stems circling the core loop as having significantly conserved base pairing. in our particular case, in the 
rotated structure for cf39, stem II and stem I both have far more base pairs than noted in the consensus, but 
this might be because the bp_consensus is more of a minimal guideline than an upper bound. as another note, consider that the 'cov' plot shows that, again, the three stems are the most conserved sequences/base pairing regions of the entire structure, with "red" as 
a high covariance scaling color, all the stems are blue/green.

the 'norm' view just confirms that type-1 hammerhead ribozymes have stem 2 and 3 closed, with stem 1 open, with dangling 5' and 3' ends. again, this
seems more like a minimum bound than an upper bound, and it seems likely that the stem 1 structure in cf39, our sample, with the elongated end with its own stem/loop structure dangling, is legitimate, if not the most conserved structure. 

finally, we get to the 'cons' plot, describing sequence conservation. this comparison in general should be taken with a grain of salt; the cons plot 
is of a consensus sequence for the hammerhead type-1 ribozyme taken from the documented sequences in the RFam database. sequences do not need to have 
identity with the consensus to fall within the family declaration; structure determines function, and sequences are allowed to vary. what I do notice
though is that from the sequences Rfam has noted, the core loop sequence is colored red, for the most part, indicating very little room or flexibility
with respect to that core sequence. comparison to our predicted structure at 37C indicates that CUGA GAGC GAAAU, the darkest red/orange of the 
conserved sequences in the core loop ringed by the three stems, CUGC GC GAA are present in the loop structure, indicating one of two things: either

   1 the loop structure as folded is in an inactive state, as its core loop is in part deviating from the identity consensus core loop 
   2 the core loop structure is in an active state as folded, it requires no extra ligand/it is not an apatamer setup, and the consensus is not as strict for functionality as I would have assumed.

the first seems more likely; if structure determines function, and function is preserved evolutionarily, there should be a good reason why variants of this ribozyme across multiple species have the same consensus core loop. 

if our predicted structure at 37C (and mind you, the others are worse, without a core loop at all)
does not generate a close enough (who defines close enough?) match to the consensus identity core loop sequence when its MFE structure is predicted 
alone, then maybe an external ligand is not being taken into account, preventing certain residues from base pairing and significantly altering the 
core loop structure. it seems unlikely that the stem structures would be significantly modified, but stem II is clearly capable of being played with, 
and it is quite a bit longer than expected, to judge by consensus minimum bounds. so the same OBS situation as in the first parts of the hw could 
apply, with stemII being stabilized by some external oligonucleotide or ligand, allowing a better core loop structure to form. 

Conclusions for Comparative Analysis

The cf39 avocado viroid hammerhead type-1 ribozyme deviates from the consensus identity sequences for all three stems and the core loop structure
in its minimum free energy state of folding as predicted by RNAfold with partition option -p -d2 at the default 37C. 

However, at 37C (which might not be a reasonable physical temperature for a plant outside a greenhouse) it is clear from a pictorial representation of 
the secondary structure and the dotplot (with ghost pixels prevalent but not nearing the probability levels of the MFE structures) that the 
cf39 ribozyme is composed of three distinct stem loops, with the topmost one (read: stem I) containing the dangling 5' and 3' ends of the RNA sequence ,as required for a type-1 hammerhead ribozyme. 

Thus, the general structure is we would expect from comparison to the consensus structures on RFam are preserved without the actual consensus sequences
being conserved, even partially in some stem cases, though with degeneracy obviously certain residues were conserved. 

The general feeling I get from this structure is that stem II, the bottommost stem in the picture above, pre-mental-rotation, is long, long enough
to participate in the same sort of ligand/OBS response function as described in the Breaker paper. Perhaps the predicted RNAfold MFE structure is true
in the "off" state, but for the cf39 type-1 hammerhead ribozyme to achieve cleavage, I feel like the core loop sequences must be better preserved,
and that perhaps with additional constraints on the binding of the stemII structure it could be intuited which OBS/ligand binding induces proper 
stem II stability & core loop formation. 

the alternative is that perfect identity to the consensus core loop is not required for function and that this particular viroid ribozyme is just
adapted to cleave with a longer stem II. perhaps the cleavage site is even different..

The conclusion is effectively that through consensus sequence/structure/bp models we can intuit that the general structure 
of cf39 preserves the type-1 hammerhead ribozyme prerequisites, but without further analysis of the situation it is unclear 
whether our particular MFE is in the OFF or ON configuration, or whether there is a difference between the two.. whether 
there are more favorable conditions that our particular phage has adapted to.

End of RNA folding Homework

I Attachment History Action Size Date Who Comment
JpgJPG hh10C.JPG r1 manage 27.7 K 2009-10-16 - 16:52 SushantSundaresh  
JpgJPG hh16C.JPG r1 manage 30.4 K 2009-10-16 - 17:04 SushantSundaresh  
JpgJPG hh23C.JPG r1 manage 30.8 K 2009-10-16 - 15:27 SushantSundaresh  
JpgJPG hh30C.JPG r1 manage 27.2 K 2009-10-16 - 16:54 SushantSundaresh  
JpgJPG hh37C.JPG r1 manage 36.1 K 2009-10-16 - 16:55 SushantSundaresh  
JpgJPG hh37C_dp.JPG r1 manage 118.2 K 2009-10-16 - 22:04 SushantSundaresh  
JpgJPG hh47C.JPG r1 manage 37.0 K 2009-10-16 - 16:56 SushantSundaresh  
JpgJPG hh_revcom_23C.JPG r1 manage 32.5 K 2009-10-16 - 21:42 SushantSundaresh  
JpgJPG hh_revcom_dp_23C.JPG r1 manage 115.2 K 2009-10-16 - 21:43 SushantSundaresh  
JpgJPG yes1_37C_off.JPG r1 manage 51.3 K 2009-10-16 - 06:45 SushantSundaresh yes1-off-37C secondary structure hw4 rna folding
JpgJPG yes1_37C_off_dotplot.JPG r2 r1 manage 97.2 K 2009-10-16 - 07:14 SushantSundaresh yes1 off 37C dotplot
JpgJPG yes1_37C_on.JPG r1 manage 32.7 K 2009-10-16 - 08:14 SushantSundaresh  
JpgJPG yes1_37C_on_dotplot.JPG r1 manage 88.6 K 2009-10-16 - 08:14 SushantSundaresh  

----- Revision r80 - 2009-10-20 - 04:48:50 - SushantSundaresh
Biowiki content is in the public domain.
Comments on this site? SushantSundareshHomeWork4">Send feedback