Research    Teaching    Fall10 


Sean McFarland: Homework 5 (Nussinov Algorithm)
Nussinov Algorithm ImplementationThe program I wrote to implement the Nussinov algorithm for basepair reporting is shown below (and attached as nussinov.pl). Please note that it takes its sequence argument from the command line:
#!/usr/bin/perl # Get the start time. $start = time(); use List::Util qw(max); # Check the user submitted a sequence. if (!exists($ARGV[0])) { print "There was no sequence submitted. To use this program, enter the \nsequence as an argument, for example:\n\n\t\$ nussinov.pl AAAAUUUU\n\n"; exit; } # Read in the sequence and ensure it is only A, C, G, or U. $seq = $ARGV[0]; chomp($seq); $seq = uc($seq); if ($seq !~ m/[A,C,G,U]+$/) { print "The sequence you entered had characters other than A, C, G, or U. \nPlease enter a sequence containing only these chars (upper or lower).\n\n"; exit; } # Calculate the length, then generate a list of each nucleotide. $length = length($seq); @sequence = split("", $seq); # Initiate the Nussinov 2d matrix with zeroes on the main diagonal and the one below it. for ($i = 0; $i < $length  1; $i++) { $array[$i][$i] = 0; $array[$i+1][$i] = 0; } $array[$length1][$length1] = 0; # Loop through the remaining positions diagonally and calculate the maximum pairings. # Thus, at each position, generate a list of possible scores and pick the max. # For each diagonal. for ($d = 0; $d < $length  1; $d++) { # For position in the respective diagonal. for ($p = 0; $p < $length  1  $d; $p++) { $i = 0 + $p; $j = 1 + $d + $p; # Clear the scores for that position. @scores = (); # Generate the S(i + 1, j  1) + w(i, j) score. push (@scores, $array[$i+1][$j1] + paircheck()); # Generate the S(i + 1, j) score. push (@scores, $array[$i+1][$j]); # Generate the S(i, j  1) score. push (@scores, $array[$i][$j1]); # Generate the max (S(i,k) + S(k + 1, j)) score for each k such that i<k<j. if (($j  $i) > 2) { for ($k = $i + 1; $k < $j1; $k++) { push (@scores, $array[$i][$k] + $array[$k+1][$j]); } } # print "$i, $j = @scores\n"; # Pick the max and store for that position. $array[$i][$j] = max (@scores); } } # Subroutine to check for base pairing between two bases. sub paircheck { if (($sequence[$i] eq "A") && ($sequence[$j] eq "U")) { return 1; } elsif (($sequence[$i] eq "C") && ($sequence[$j] eq "G")) { return 1; } elsif (($sequence[$i] eq "G") && ($sequence[$j] eq "C")) { return 1; } elsif (($sequence[$i] eq "U") && ($sequence[$j] eq "A")) { return 1; } else { return 0; } } # Get the end time and calculate the elapsed. $end = time(); $elapsed = $end  $start; print "The max # of bp's = $array[0][$length1]\n"; print "The elapsed time = $elapsed seconds.\n\n"; There's not really a whole lot to say about it, as I didn't do anything fancy. To my knowledge with the tests I've done, it accepts A,C,G,U (or a,c,g,u) input off of the command line, runs through the sequence using Nussinov, and reports the maximum number of potential base pairs. I also have it print the elapsed time at the end.
Benchmarking the ProgramFor testing the performance of nussinov.pl, I wrote a quick random sequence generator (randseq.pl):
$length = $ARGV[0]; for ($i = 0; $i < $length; $i++) { $number = rand(3); if ($number < 1) { push (@seq, "A"); } elsif ($number < 2) { push (@seq, "C"); } elsif ($number < 3) { push (@seq, "G"); } else { push (@seq, "U"); } } open (FILEOUT, ">>tempseq.txt"); $sequence = join ("", @seq); print FILEOUT "$length\n$sequence\n\n"; close(FILEOUT); From this, I generated sequences for 8, 16, 32, 64, 128, 256, 512, 1024, and 2048 nucleotides (see attached file, tempseq.txt); These were then fed into nussinov.pl, and the times were recorded. I'm not sure if it matters too much, but I ran these on my laptop with 4GB RAM and ~2.4 GHz quad core (though the program obviously only uses one core at a time). Below, find a graph of the results, followed by a table of the raw data:
Clearly the number of base pairs definitely adds to the calculation time. Which leads to perhaps the most interesting part about this; it almost exactly mirrors the predicted trend. That is, the complexity of the calculation increases as N^3 (note the equation given on the graph is a bread and butter 3rd order polynomial, and that the R^2 has 4 9's!).
Generating Nonfolding SequencesFor generating an optimized nonfolding sequence, I mocked up a simple geneticesque algorithm (see attached genetic.pl) that accepts the length of the desired sequence as its argument (I used 100, must be at least 80), and uses a population of 5 parents. Parents were generated by adding 20 A's, C's, G's, and U's, then random nuc's for the rest and FisherYates shuffling. Their fitness was gauged based upon which possessed the lowest Nussinov score. The relative fitnesses were then used to choose parents for generating each of the 5 children in the next generation (with the chosen parents having an equal chance to contribute their nucleotide for any given position). In the interest of time and simplicity, I didn't allow for any mutation events (or anything else fancy, like the possibility of crossgenerational breeding). For the termination condition of 10 consecutive generations with no improvement in fitness, I got the following results:
I then tried it for 5 and 20 nonfitter generations, respectively:
It would appear I either got lucky just randomly picking 10 the first time around, or got lucky with the random sequences, as that condition seemed to generate the most consistently optimized outcome. They are all probably within the margin of error of each other, however, so the generations of nonfitter individuals may be able to be reduced beyond 5 without much loss in optimization, but with significant pickup in time. Though I would have thought increased generation time would be more consistently optimized, it may be the case that for a 100 nucleotide sequence the best achievable is somewhere on the order of 2022, though of this I cannot be sure. At least in terms of the set generation termination number, it would appear complexity increases as N. I did not examine how changing sequence length would affect calculation times, but it would be easy to do by varying the command line argument. I'd imagine that the Nussinov calculation is still the most ratelimiting step, so it would probably go ~ as N^3.
 SeanMcFarland  21 Oct 2010 
 

Fall10.SeanMcFarlandHomework5 r13  20101025  02:32:28  SeanMcFarland  Biowiki content is in the public domain. Comments on this site? SeanMcFarlandHomework5">Send feedback 