Home - this site is powered by TWiki(R)
Fall10 > TWikiUsers > SeanMcFarland > SeanMcFarlandHomework5
TWiki webs: Main | TWiki | Sandbox   Log In or Register

Changes | Index | Search | Go

Sean McFarland: Homework 5 (Nussinov Algorithm)



Nussinov Algorithm Implementation

The program I wrote to implement the Nussinov algorithm for base-pair reporting is shown below (and attached as nussinov.pl). Please note that it takes its sequence argument from the command line:

#!/usr/bin/perl

# Get the start time.
$start = time();

use List::Util qw(max);

# Check the user submitted a sequence.
if (!exists($ARGV[0])) {
   print "There was no sequence submitted. To use this program, enter the
               \nsequence as an argument, for example:\n\n\t\$ nussinov.pl AAAAUUUU\n\n";
   exit; 
}

# Read in the sequence and ensure it is only A, C, G, or U.
$seq = $ARGV[0];
chomp($seq);
$seq = uc($seq);

if ($seq !~ m/[A,C,G,U]+$/) {
   print "The sequence you entered had characters other than A, C, G, or U.
               \nPlease enter a sequence containing only these chars (upper or lower).\n\n";
   exit;
}

# Calculate the length, then generate a list of each nucleotide.
$length = length($seq);
@sequence = split("", $seq);

# Initiate the Nussinov 2d matrix with zeroes on the main diagonal and the one below it.
for ($i = 0; $i < $length - 1; $i++) {
   $array[$i][$i] = 0;
   $array[$i+1][$i] = 0;
}
$array[$length-1][$length-1] = 0;

# Loop through the remaining positions diagonally and calculate the maximum pairings.
# Thus, at each position, generate a list of possible scores and pick the max.
# For each diagonal.
for ($d = 0; $d < $length - 1; $d++) {
   # For position in the respective diagonal.
   for ($p = 0; $p < $length - 1 - $d; $p++) {
      
      $i = 0 + $p;
      $j = 1 + $d + $p;

      # Clear the scores for that position.
      @scores = ();

      # Generate the S(i + 1, j - 1) + w(i, j) score.
      push (@scores, $array[$i+1][$j-1] + paircheck());

      # Generate the S(i + 1, j) score.
      push (@scores, $array[$i+1][$j]);

      # Generate the S(i, j - 1) score.
      push (@scores, $array[$i][$j-1]);

      # Generate the max (S(i,k) + S(k + 1, j)) score for each k such that i<k<j.
      if (($j - $i) > 2) {
         for ($k = $i + 1; $k < $j-1; $k++) {
            push (@scores, $array[$i][$k] + $array[$k+1][$j]);
         }
      }

      # print "$i, $j = @scores\n";

      # Pick the max and store for that position.
      $array[$i][$j] = max (@scores);
   }
}

# Subroutine to check for base pairing between two bases.
sub paircheck {
   if (($sequence[$i] eq "A") && ($sequence[$j] eq "U")) {
      return 1;
   }
   elsif (($sequence[$i] eq "C") && ($sequence[$j] eq "G")) {
      return 1;
   }
   elsif (($sequence[$i] eq "G") && ($sequence[$j] eq "C")) {
      return 1;
   }
   elsif (($sequence[$i] eq "U") && ($sequence[$j] eq "A")) {
      return 1;
   }
   else {
      return 0;
   }
}

# Get the end time and calculate the elapsed.
$end = time();
$elapsed = $end - $start;

print "The max # of bp's = $array[0][$length-1]\n";
print "The elapsed time = $elapsed seconds.\n\n";

There's not really a whole lot to say about it, as I didn't do anything fancy. To my knowledge with the tests I've done, it accepts A,C,G,U (or a,c,g,u) input off of the command line, runs through the sequence using Nussinov, and reports the maximum number of potential base pairs. I also have it print the elapsed time at the end.


Benchmarking the Program

For testing the performance of nussinov.pl, I wrote a quick random sequence generator (randseq.pl):


$length = $ARGV[0];

for ($i = 0; $i < $length; $i++) {
   $number = rand(3);
   if ($number < 1) {
      push (@seq, "A");
   }
   elsif ($number < 2) {
      push (@seq, "C");
   }
   elsif ($number < 3) {
      push (@seq, "G");
   }
   else {
      push (@seq, "U");
   }
}

open (FILEOUT, ">>tempseq.txt");

$sequence = join ("", @seq);

print FILEOUT "$length\n$sequence\n\n";

close(FILEOUT);

From this, I generated sequences for 8, 16, 32, 64, 128, 256, 512, 1024, and 2048 nucleotides (see attached file, tempseq.txt);

These were then fed into nussinov.pl, and the times were recorded. I'm not sure if it matters too much, but I ran these on my laptop with 4GB RAM and ~2.4 GHz quad core (though the program obviously only uses one core at a time). Below, find a graph of the results, followed by a table of the raw data:

Graph.png

# of Nucs Reported BPs Elapsed Time (s)
8 1 0
16 5 0
32 10 0
64 24 0
128 45 0
256 83 2
512 159 18
1024 321 150
2048 665 1257

Clearly the number of base pairs definitely adds to the calculation time. Which leads to perhaps the most interesting part about this; it almost exactly mirrors the predicted trend. That is, the complexity of the calculation increases as N^3 (note the equation given on the graph is a bread and butter 3rd order polynomial, and that the R^2 has 4 9's!).


Generating Non-folding Sequences

For generating an optimized non-folding sequence, I mocked up a simple genetic-esque algorithm (see attached genetic.pl) that accepts the length of the desired sequence as its argument (I used 100, must be at least 80), and uses a population of 5 parents. Parents were generated by adding 20 A's, C's, G's, and U's, then random nuc's for the rest and Fisher-Yates shuffling. Their fitness was gauged based upon which possessed the lowest Nussinov score. The relative fitnesses were then used to choose parents for generating each of the 5 children in the next generation (with the chosen parents having an equal chance to contribute their nucleotide for any given position). In the interest of time and simplicity, I didn't allow for any mutation events (or anything else fancy, like the possibility of cross-generational breeding). For the termination condition of 10 consecutive generations with no improvement in fitness, I got the following results:

# of bp's Time (s)
23 8
22 9
22 8
25 8
21 9

I then tried it for 5 and 20 non-fitter generations, respectively:

# of bp's Time (s)
24 4
23 5
25 4
28 4
23 4

# of bp's Time (s)
26 15
23 15
22 16
24 15
25 16

It would appear I either got lucky just randomly picking 10 the first time around, or got lucky with the random sequences, as that condition seemed to generate the most consistently optimized outcome. They are all probably within the margin of error of each other, however, so the generations of non-fitter individuals may be able to be reduced beyond 5 without much loss in optimization, but with significant pickup in time. Though I would have thought increased generation time would be more consistently optimized, it may be the case that for a 100 nucleotide sequence the best achievable is somewhere on the order of 20-22, though of this I cannot be sure.

At least in terms of the set generation termination number, it would appear complexity increases as N. I did not examine how changing sequence length would affect calculation times, but it would be easy to do by varying the command line argument. I'd imagine that the Nussinov calculation is still the most rate-limiting step, so it would probably go ~ as N^3.


-- %TEACHINGWEB%.SeanMcFarland - 21 Oct 2010

I Attachment Action Size DateSorted ascending Who Comment
Pngpng Graph.png manage 31.3 K 2010-10-21 - 03:31 SeanMcFarland  
Txttxt genetic.pl.txt manage 6.7 K 2010-10-21 - 17:32 SeanMcFarland  
Txttxt nussinov.pl.txt manage 2.4 K 2010-10-21 - 15:43 SeanMcFarland  
Txttxt tempseq.txt manage 4.0 K 2010-10-21 - 15:44 SeanMcFarland  
Edit | Attach | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | Raw edit | More topic actions

This site is powered by the TWiki collaboration platformCopyright © 2008-2014 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
TWiki Appliance - Powered by TurnKey Linux