click on the Biowiki logo to go to homepage Home Home | EditEdit | Attach Attach | New New | Site Map Site Map | Help Help
Research | Teaching | Fall10
Biowiki > Fall10 > TWikiUsers > SeanMcFarland > SeanMcFarlandHomework5

Search

Advanced search...

Topics
BioE131
BioE241
%SESSION_IF_AUTHENTICATED% %GET_SESSION_VARIABLE{AUTHUSER}% %SESSION_ELSE% Login
Register %SESSION_ENDIF%
Changes
Help

Sean McFarland: Homework 5 (Nussinov Algorithm)



Nussinov Algorithm Implementation

The program I wrote to implement the Nussinov algorithm for base-pair reporting is shown below (and attached as nussinov.pl). Please note that it takes its sequence argument from the command line:

#!/usr/bin/perl

# Get the start time.
$start = time();

use List::Util qw(max);

# Check the user submitted a sequence.
if (!exists($ARGV[0])) {
   print "There was no sequence submitted. To use this program, enter the
               \nsequence as an argument, for example:\n\n\t\$ nussinov.pl AAAAUUUU\n\n";
   exit; 
}

# Read in the sequence and ensure it is only A, C, G, or U.
$seq = $ARGV[0];
chomp($seq);
$seq = uc($seq);

if ($seq !~ m/[A,C,G,U]+$/) {
   print "The sequence you entered had characters other than A, C, G, or U.
               \nPlease enter a sequence containing only these chars (upper or lower).\n\n";
   exit;
}

# Calculate the length, then generate a list of each nucleotide.
$length = length($seq);
@sequence = split("", $seq);

# Initiate the Nussinov 2d matrix with zeroes on the main diagonal and the one below it.
for ($i = 0; $i < $length - 1; $i++) {
   $array[$i][$i] = 0;
   $array[$i+1][$i] = 0;
}
$array[$length-1][$length-1] = 0;

# Loop through the remaining positions diagonally and calculate the maximum pairings.
# Thus, at each position, generate a list of possible scores and pick the max.
# For each diagonal.
for ($d = 0; $d < $length - 1; $d++) {
   # For position in the respective diagonal.
   for ($p = 0; $p < $length - 1 - $d; $p++) {
      
      $i = 0 + $p;
      $j = 1 + $d + $p;

      # Clear the scores for that position.
      @scores = ();

      # Generate the S(i + 1, j - 1) + w(i, j) score.
      push (@scores, $array[$i+1][$j-1] + paircheck());

      # Generate the S(i + 1, j) score.
      push (@scores, $array[$i+1][$j]);

      # Generate the S(i, j - 1) score.
      push (@scores, $array[$i][$j-1]);

      # Generate the max (S(i,k) + S(k + 1, j)) score for each k such that i<k<j.
      if (($j - $i) > 2) {
         for ($k = $i + 1; $k < $j-1; $k++) {
            push (@scores, $array[$i][$k] + $array[$k+1][$j]);
         }
      }

      # print "$i, $j = @scores\n";

      # Pick the max and store for that position.
      $array[$i][$j] = max (@scores);
   }
}

# Subroutine to check for base pairing between two bases.
sub paircheck {
   if (($sequence[$i] eq "A") && ($sequence[$j] eq "U")) {
      return 1;
   }
   elsif (($sequence[$i] eq "C") && ($sequence[$j] eq "G")) {
      return 1;
   }
   elsif (($sequence[$i] eq "G") && ($sequence[$j] eq "C")) {
      return 1;
   }
   elsif (($sequence[$i] eq "U") && ($sequence[$j] eq "A")) {
      return 1;
   }
   else {
      return 0;
   }
}

# Get the end time and calculate the elapsed.
$end = time();
$elapsed = $end - $start;

print "The max # of bp's = $array[0][$length-1]\n";
print "The elapsed time = $elapsed seconds.\n\n";

There's not really a whole lot to say about it, as I didn't do anything fancy. To my knowledge with the tests I've done, it accepts A,C,G,U (or a,c,g,u) input off of the command line, runs through the sequence using Nussinov, and reports the maximum number of potential base pairs. I also have it print the elapsed time at the end.


Benchmarking the Program

For testing the performance of nussinov.pl, I wrote a quick random sequence generator (randseq.pl):


$length = $ARGV[0];

for ($i = 0; $i < $length; $i++) {
   $number = rand(3);
   if ($number < 1) {
      push (@seq, "A");
   }
   elsif ($number < 2) {
      push (@seq, "C");
   }
   elsif ($number < 3) {
      push (@seq, "G");
   }
   else {
      push (@seq, "U");
   }
}

open (FILEOUT, ">>tempseq.txt");

$sequence = join ("", @seq);

print FILEOUT "$length\n$sequence\n\n";

close(FILEOUT);

From this, I generated sequences for 8, 16, 32, 64, 128, 256, 512, 1024, and 2048 nucleotides (see attached file, tempseq.txt);

These were then fed into nussinov.pl, and the times were recorded. I'm not sure if it matters too much, but I ran these on my laptop with 4GB RAM and ~2.4 GHz quad core (though the program obviously only uses one core at a time). Below, find a graph of the results, followed by a table of the raw data:

Graph.png

# of Nucs Reported BPs Elapsed Time (s)
8 1 0
16 5 0
32 10 0
64 24 0
128 45 0
256 83 2
512 159 18
1024 321 150
2048 665 1257

Clearly the number of base pairs definitely adds to the calculation time. Which leads to perhaps the most interesting part about this; it almost exactly mirrors the predicted trend. That is, the complexity of the calculation increases as N^3 (note the equation given on the graph is a bread and butter 3rd order polynomial, and that the R^2 has 4 9's!).


Generating Non-folding Sequences

For generating an optimized non-folding sequence, I mocked up a simple genetic-esque algorithm (see attached genetic.pl) that accepts the length of the desired sequence as its argument (I used 100, must be at least 80), and uses a population of 5 parents. Parents were generated by adding 20 A's, C's, G's, and U's, then random nuc's for the rest and Fisher-Yates shuffling. Their fitness was gauged based upon which possessed the lowest Nussinov score. The relative fitnesses were then used to choose parents for generating each of the 5 children in the next generation (with the chosen parents having an equal chance to contribute their nucleotide for any given position). In the interest of time and simplicity, I didn't allow for any mutation events (or anything else fancy, like the possibility of cross-generational breeding). For the termination condition of 10 consecutive generations with no improvement in fitness, I got the following results:

# of bp's Time (s)
23 8
22 9
22 8
25 8
21 9

I then tried it for 5 and 20 non-fitter generations, respectively:

# of bp's Time (s)
24 4
23 5
25 4
28 4
23 4

# of bp's Time (s)
26 15
23 15
22 16
24 15
25 16

It would appear I either got lucky just randomly picking 10 the first time around, or got lucky with the random sequences, as that condition seemed to generate the most consistently optimized outcome. They are all probably within the margin of error of each other, however, so the generations of non-fitter individuals may be able to be reduced beyond 5 without much loss in optimization, but with significant pickup in time. Though I would have thought increased generation time would be more consistently optimized, it may be the case that for a 100 nucleotide sequence the best achievable is somewhere on the order of 20-22, though of this I cannot be sure.

At least in terms of the set generation termination number, it would appear complexity increases as N. I did not examine how changing sequence length would affect calculation times, but it would be easy to do by varying the command line argument. I'd imagine that the Nussinov calculation is still the most rate-limiting step, so it would probably go ~ as N^3.


-- SeanMcFarland - 21 Oct 2010

I Attachment History Action SizeSorted ascending Date Who Comment
Txttxt nussinov.pl.txt r1 manage 2.4 K 2010-10-21 - 15:43 SeanMcFarland  
Txttxt tempseq.txt r1 manage 4.0 K 2010-10-21 - 15:44 SeanMcFarland  
Txttxt genetic.pl.txt r1 manage 6.7 K 2010-10-21 - 17:32 SeanMcFarland  
Pngpng Graph.png r1 manage 31.3 K 2010-10-21 - 03:31 SeanMcFarland  
Actions: Edit | Attach | New | Ref-By | Printable view | Raw view | Normal view | See diffs | Help | More...