|
| ... no changes ... no changes ... no changes ... no changes ... no changes ... no changes ... no changes ... no changes ... no changes ... no changes ... no changes ... [Back to UndergraduateClass]
Calculate CG frequencies
Write a Perl program that does all of the following:
- Read a FastaFormat file of DNA sequences, taking the first command-line argument to be the FASTA filename. That is, if the program is called
myprog and the FASTA file is myfile, then the program can be invoked by typing myprog myfile
- For each sequence, do the following. Suppose first that
- N is the number of nucleotides in the sequence;
- n(x) is the frequency of nucleotide x;
- n(xy) is the frequency of dinucleotide xy.
- By calculating all the above quantities for all nucleotides x and y, or otherwise, compute and print the following quantity: n(CG)*N/(n(C)*n(G))
- Run your program on this FASTA file containing human and bacterial gene sequences, and comment on the results.
#! /usr/bin/perl -w
if (@ARGV < 1) {
die "usage: gc.pl <filename>";
}
%seqs = readFasta($ARGV[0]);
# for each sequence, calculate
# n = number of nucleotides
# nx = freq of nucleotide x
# nxy = freq of dinucleotide xy
foreach $id (keys %seqs) {
$seq = $seqs{$id};
$n = length($seq);
# i use two hash tables to collect counts for nucleotide/dinucleotide frequencies
%nx = %nxy = (); # reset the counts in the hashes
for ($i=0; $i<$n; $i++) {
$nx{lc(substr($seq, $i, 1))}++; # use the nucleotide as keys into hash
if ($i < $n-1) { # skip the very last base
$nxy{lc(substr($seq, $i, 2))}++; # use dinuc as keys into hash
}
}
# print out counts
print "$id:
N (# of nucleotides) = $n
";
foreach $nuc (keys %nx) {
print " $nuc --> $nx{$nuc}
";
}
print "
";
foreach $dinuc (keys %nxy) {
print " $dinuc --> $nxy{$dinuc}
";
}
print "
";
# print out requested quantity
$freq = ($nxy{"cg"}*$n)/($nx{"c"})/($nx{"g"});
print "cg frequencies = $freq
";
}
# ------
# subroutine readFasta
# input : filename
# output : hashtable containing all the sequences
sub readFasta {
my %sequences; # make this a local variable
my $filename = $_[0];
open (FILE, $filename) or die "Cannot open $filename";
while (<FILE>) {
chomp;
if (/^>s*(.+)/) { # if it's the name
$name = $1;
} elsif ($_ ne "") {
$sequences{$name} .= $_;
}
}
return %sequences;
}
The output of the script on the given FASTA file is:
AY672772 Bacillus subtilis sporulation protein:
N (# of nucleotides) = 773
cg frequency = 1.05656079146056
NM_199420 Homo sapiens polymerase:
N (# of nucleotides) = 8790
cg frequency = 0.284026597867252
The quantity n(CG)*N/(n(C)*n(G)) can be understood in the following way:
- To convert absolute counts into frequencies, we divide all n(x) and n(xy) values by N. Note that n(xy)/N and n(x)/N then approximates the probabilities of finding xy (or x) in the sequence.
- On the denominator, we have n(C) multiplied by n(G). From above, we can also view this as p(C)*p(G). When you multiply two probabilities, you're making an independence assumption, i.e. when you see a C in a position, it does not make it more or less likely to see a G in the next position.
- The numerator is the actual probability of observing the pair CG in the sequence.
- THe ratio then will tell us how the actual probability of CG compares with the probability of CG assuming independence. So if the ratio is close to 1, then we can say that our observation is close to what we expect. If the ratio is less than 1, we see CG less than we would expect. If the ratio is greater than 1, we see CG more than we would expect.
So in the given FASTA file, we can see that while the Bacillus protein has a ratio close to 1, the human polymerase actually contains much less CG occurrences than we would expect. In fact, this is a documented phenomenon. From the Sanger Institute website:
The Cs of most CpG dinucleotides in the human genome are methylated. Methyl-C tends to mutate to T, and so CpG dinucleotides tend to decay to TpG / CpA. This is believed to account for the fact that in bulk human DNA CpG dinucleotides occur about five times less frequently than expected (Bird, 1980, Jones et al 1992).
The methylation of CpG dinucleotides is a form of epigenetic regulation of gene expression, absent in bacteria. |