click on the Biowiki logo to go to homepage
Edit Raw Print
Links Diffs RSS
About Stats Recent


Research Teaching Blog
Fall09 | Sandbox
Biowiki > Teaching > Undergraduate Class > PerlExercise5Oct2005

Search

Advanced search...

Topics

PageRank Checker

... no changes ... no changes ... no changes ... no changes ... no changes ... no changes ... no changes ... no changes ... no changes ... no changes ... no changes ... [Back to UndergraduateClass]

Calculate CG frequencies

Write a Perl program that does all of the following:

  • Read a FastaFormat file of DNA sequences, taking the first command-line argument to be the FASTA filename. That is, if the program is called myprog and the FASTA file is myfile, then the program can be invoked by typing myprog myfile
  • For each sequence, do the following. Suppose first that
    • N is the number of nucleotides in the sequence;
    • n(x) is the frequency of nucleotide x;
    • n(xy) is the frequency of dinucleotide xy.
  • By calculating all the above quantities for all nucleotides x and y, or otherwise, compute and print the following quantity: n(CG)*N/(n(C)*n(G))
    • Run your program on this FASTA file containing human and bacterial gene sequences, and comment on the results.

#! /usr/bin/perl -w

if (@ARGV < 1) {
        die "usage: gc.pl <filename>";
}

%seqs = readFasta($ARGV[0]);

# for each sequence, calculate
#       n = number of nucleotides
#       nx = freq of nucleotide x
#       nxy = freq of dinucleotide xy

foreach $id (keys %seqs) {

        $seq = $seqs{$id};
        $n = length($seq);

        # i use two hash tables to collect counts for nucleotide/dinucleotide frequencies
        %nx = %nxy = ();   # reset the counts in the hashes
        for ($i=0; $i<$n; $i++) {
                $nx{lc(substr($seq, $i, 1))}++;  # use the nucleotide as keys into hash
                if ($i < $n-1) {    # skip the very last base 
                        $nxy{lc(substr($seq, $i, 2))}++;  # use dinuc as keys into hash
                }
        }
     
        # print out counts 
        print "$id:
   N (# of nucleotides) = $n

";
        foreach $nuc (keys %nx) {
                print "   $nuc --> $nx{$nuc}
";
        }
        print "
";
        foreach $dinuc (keys %nxy) {
                print "   $dinuc --> $nxy{$dinuc}
";
        }
        print "
";

        # print out requested quantity
        $freq = ($nxy{"cg"}*$n)/($nx{"c"})/($nx{"g"});
        print "cg frequencies = $freq
";
}

# ------
# subroutine readFasta
#       input : filename
#       output : hashtable containing all the sequences

sub readFasta {
        my %sequences;    # make this a local variable
        my $filename = $_[0];

        open (FILE, $filename) or die "Cannot open $filename";

        while (<FILE>) {
                chomp;
                if (/^>s*(.+)/) {   # if it's the name
                        $name = $1;
                } elsif ($_ ne "") {
                        $sequences{$name} .= $_;
                }
        }
        return %sequences;
}

The output of the script on the given FASTA file is:

AY672772 Bacillus subtilis sporulation protein:
        N (# of nucleotides) = 773
        cg frequency = 1.05656079146056
NM_199420 Homo sapiens polymerase:
        N (# of nucleotides) = 8790
        cg frequency = 0.284026597867252

The quantity n(CG)*N/(n(C)*n(G)) can be understood in the following way:

  • To convert absolute counts into frequencies, we divide all n(x) and n(xy) values by N. Note that n(xy)/N and n(x)/N then approximates the probabilities of finding xy (or x) in the sequence.

  • On the denominator, we have n(C) multiplied by n(G). From above, we can also view this as p(C)*p(G). When you multiply two probabilities, you're making an independence assumption, i.e. when you see a C in a position, it does not make it more or less likely to see a G in the next position.

  • The numerator is the actual probability of observing the pair CG in the sequence.

  • THe ratio then will tell us how the actual probability of CG compares with the probability of CG assuming independence. So if the ratio is close to 1, then we can say that our observation is close to what we expect. If the ratio is less than 1, we see CG less than we would expect. If the ratio is greater than 1, we see CG more than we would expect.

So in the given FASTA file, we can see that while the Bacillus protein has a ratio close to 1, the human polymerase actually contains much less CG occurrences than we would expect. In fact, this is a documented phenomenon. From the Sanger Institute website:

    The Cs of most CpG dinucleotides in the human genome are methylated. Methyl-C tends to mutate to T, and so CpG dinucleotides tend to decay to TpG / CpA. This is believed to account for the fact that in bulk human DNA CpG dinucleotides occur about five times less frequently than expected (Bird, 1980, Jones et al 1992).

The methylation of CpG dinucleotides is a form of epigenetic regulation of gene expression, absent in bacteria.

Actions: Edit | Attach | New | Ref-By | Printable view | Raw view | Normal view | See diffs | Help | More...