[back to UndergraduateClass]
Remember that there are many ways to write a program in Perl, so this is just one possible way to write it
Compare two FASTA files
Write a Perl program to test if two FASTA files contain exactly the same set of sequences (and sequence names). If the files are equivalent, the program should return without error; if not, it should return with
an error.
The program should not care if the sequences are in a different order in each file, and it should not be case-sensitive with respect to the sequences; e.g. if one file contains sequences in upper-case, and the other contains sequences in lower-case, the program should not complain. The program should be invoked by typing something like the following:
programname filename1 filename2
Solution
In the following solution, I read the first FASTA file into a hash. Then as I'm reading the 2nd file, I check whether the same entry (name and sequence pair) exists in the hash for the first file. By using the last command, I stop checking as soon as I find a mismatch, since there's no need to go further.
Also, this program prints out more messages than stated in the problem, but this should help you see what are some useful messages to place in your code as you're writing it. It's also easier to debug your code if you have some idea what's going on in your program. To finish the final program, you can simply comment out or remove these debugging messages.
#! /usr/bin/perl -w
# bioe 131/231:
#
# this script compares two fasta files to see if they're the same
# this program expects a filename as an argument, so throw an error if there is no filename
# provided
if (@ARGV < 2) {
die "usage: compare.pl <filename1> <filename2>\n";
}
$file1 = $ARGV[0];
$file2 = $ARGV[1];
open (ONE, "<$file1") || die "Cannot open $file1\n";
open (TWO, "<$file2") || die "Cannot open $file2\n";
print "\n\n";
# -------
# read in the seqs from the first file
while ($line = <ONE>) {
chomp $line;
if ($line =~ /^>\s*(.+)/) { #id line - just grab the name w/o the >
if (defined $seq) { #skip this the first time
$list1{$id} = $seq;
}
$seq = "";
$id = $1;
print "READING SEQUENCE FOR $id ... \n";
} else {
$seq = $seq . $line;
}
}
$list1{$id} = $seq;
print "\n\n";
close ONE;
# -----
# read each sequence in second file and compare w/ first file
$flag = 1; #start out as true
$numSeqs = 0;
while ($line = <TWO>) {
chomp $line;
if ($line =~ /^>\s*(.+)/) { #id line - copy w/o change
if (defined $seq2) { #skip this the first time
if (!defined $list1{$id}) {
$flag = 0;
print "COULD NOT FIND SEQ FOR $id\n";
last;
} elsif ($list1{$id} ne $seq2) {
$flag=0;
print "SEQ DIFFERENCES FOR $id\n";
last;
} else {
print "FOUND SEQ FOR $id \n";
}
}
$seq2 = "";
$id = $1;
$numSeqs++;
} else {
$seq2 = $seq2 . $line;
}
}
if ($flag == 1) { # only do this if everything else so far has matched
if (!defined $list1{$id}) {
$flag = 0;
print "COULD NOT FIND SEQ FOR $id\n";
} elsif ($list1{$id} ne $seq2) {
$flag=0;
print "SEQ DIFFERENCES FOR $id\n";
} else {
print "FOUND SEQ FOR $id\n";
}
# there's a chance that file2 has less sequences than file1, so should check for that
if ($numSeqs != keys(%list1)) {
$flag = 0;
print "DIFF NUM OF SEQS\n";
}
}
close TWO;
print "\n\n-----------------\n";
if ($flag) {
print "The two files are equal\n";
} else {
print "The two files are not equal\n";
}
-- AngiChau - 07 Oct 2005 |