Home - this site is powered by TWiki(R)
Teaching > PerlExercise12Oct2005
TWiki webs: Main | TWiki | Sandbox   Log In or Register

[Back to UndergraduateClass]

Part one

Write a Perl program that does all of the following:

  • Read in a FastaFormat file whose filename is given by the first command-line argument, as above.
  • For each sequence, test if the following three conditions are true:
    • The sequence contains only DNA/RNA characters (A C G T or U);
    • The sequence length is a multiple of three;
    • The sequence contains no in-frame stop codons.
  • On two separate lines, print lists of
    1. the names of all sequences in the FASTA file that satisfy all three of the above conditions;
    2. the names of all sequences that end with a run of 3 or more A's (even if they do not satisfy those three conditions)

Note that (for example) the sequence AAATAG does contain an in-frame stop codon (TAG), but the sequence AATAGA does not, since the TAG is not "in-frame" (i.e. the first T is not offset from the beginning of the sequence by a multiple of 3 nucleotides).

Solution:

    if (@ARGV < 1) {
            die "usage: seqtest.pl <filename>";
    }
    
    %seqs = readFasta($ARGV[0]);
    
    @firstList = ();
    @secondList = ();
    foreach $name (keys %seqs) {
            $seq = uc($seqs{$name});
    
            # i will actually translate all the Us to Ts to make it easier to look for stop codons
            $seq =~ tr/U/T/;
    
            # check if sequence satisfies all three conditions
            $ok = 1;  
            if ($seq =~ /[^ACGT]/) {   # this will look for any non-DNA/RNA characters
                    $ ok = 0;
            } elsif (length($seq)%3 != 0) {  # using elsif causes this to only be tested if ok=1
                    $ ok  = 0;
            } elsif (@codons = $seq =~ /[A-Z]{3}/g) {  #split into an array of in-frame codons
                    @stops = grep((($_ eq "TAG") or ($_ eq "TAA") or ($_ eq "TGA")), @codons);
                    if (@stops > 0) {
                            $ok = 0;
                    }
            }
            if ($ok == 1) {
                    push @firstList, $name;
            }
    
    
            # check if sequence ends in 3+ As
            if ($seq =~ /A{3,}$/) {
                    push @secondList, $name;
            }
    }
    
    
    print "matches all three conditions: @firstList\n\n";
    print "polyA tails: @secondList\n\n";
    
    
    # ------
    # subroutine readFasta
    #       input : filename
    #       output : hashtable containing all the sequences
    
    sub readFasta {
            my %sequences = (); 
            my $name;    # make this a local variable
            my $filename = $_[0];
    
            open (FILE, $filename) or die "Cannot open $filename";
    
            while (<FILE>) {
                    chomp;
                    if (/^>\s*(.+)/) {   # if it's the name
                            $name = $1;
                    } elsif ($_ ne "") {
                            $sequences{$name} .= $_;
                    }
            }
            return %sequences;
    }
    
    

Part two

  • Write a Perl program that reads a GenBankFormat file from the standard input and prints the corresponding FastaFormat to the standard ouput (printing the sequence starting at the ORIGIN keyword, and using the Genbank accession number (ACCESSION field) as the sequence name).
    • GenBankFormat uses // as the record separator, i.e. you can store more than one sequence in a Genbank file by separating them with //, just as FastaFormat allows more than one sequence per file if separated by >. Ensure that your program can cope with multiple Genbank records.

Solution:

    #! /usr/bin/perl -w
    
    $readSeq = 0;
    while (<>) {
            if (/ACCESSION\s+(.+)/) {
                    print ">$1\n";
            } 
            if (/ORIGIN/) {
                    $readSeq = 1;
            }
            if ($readSeq==1 and /^\s+[0-9]+\s+(.+)$/) {
                    $currSection = $1;
                    $currSection =~ s/\s//g;
                    print "$currSection\n";
            }
            if (/^\/\//) {
                    $readSeq = 0;
            }
    }
    
Edit | Attach | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | Raw edit | More topic actions

This site is powered by the TWiki collaboration platformCopyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
TWiki Appliance - Powered by TurnKey Linux