click on the Biowiki logo to go to homepage
Edit Raw Print
Links Diffs RSS
About Stats Recent


Research Teaching Blog
Fall10 | Sandbox
Biowiki > Teaching > LabArraysHashesF05

Search

Advanced search...

Topics

PageRank Checker

[Back to UndergraduateClass]

Lab 4: Arrays and Hashes in Perl (9/28/2005)

By the end of this lab, you should know:

  • how to use arrays and hashes in Perl
  • one way to write the reverse complement exercise
  • some things to keep in mind when doing error-checking

This lab is a bit long because I want to go through the solutions to the reverse complement exercise in a lot of details, given that it's the first Perl script many of you have written. I try to go through how I would go about writing this exercise step-by-step. (This won't be done for every future exercise.)


0. Before we get started...

  • In the following lab, I will use a $ symbol to signify a UNIX prompt. The prompt you see in your terminal window may/may not end in a $, but just remember that you don't actually type in the $, just the stuff after it.

  • This lab assumes that you are now comfortable with using a text editor to write a Perl script, the proper format of a Perl script, and the UNIX commands you have to issue to make your script executable. In addition, you should already know how to read arguments from the command line, read data from user input (i.e. standard input), and read from files. If you need a refresher/reminders along the way, please refer back to Lab 3: Perl Basics.

1. Arrays

By now, you already know the basics of writing and executing Perl scripts, and you even wrote one for reverse complementing sequences in a FASTA file. (If you had lots of problems with the reverse complement exercise and/or error-handling, I've included in detail how I would write that script and some tips on error-checking in the next two sections below).

As you begin to write more complex programs, data structures like arrays and hashes will become increasingly useful. These list structures are great for storing a set of items and providing easy access to them. At a very simplistic level, the major difference between arrays and hashes is just the way you access the elements within them - in an array, you access elements using integer indices and in a hash, you access elements using keys.

The word "key" is ambiguous on purpose, because nearly anything can be a key. What happens on a very low level when you ask for the element with the key X in a hash is to convert X into some numerical value using some built-in hash function and then use that numerical value to access an internal array. Why then, you might ask, wouldn't I write my own array and my own hash function? Well, you could and in some programming languages, you have to if you want to use such a structure. But, the purpose of hashes (aka hash tables) is to provide very quick access to its elements and this all comes down to a good hash function. In computer science classes, you can spend weeks talking about what's a good hash function, so for most of us, using built-in hashes means not having to worry about any of this.

So with that in mind, let's start with arrays. Arrays are useful for storing a list of similar items, e.g. a list of numbers, a list of strings, and even a list of lists. You've already used arrays in the last lab - the @ARGV you used to retrieve command line arguments is an array of strings that Perl creates for you. Perl has a lot of built-in functions for working with arrays and lists in general and we're definitely not going to go through all of them. So if you're looking for something, I suggest referring to Prof Holmes' lecture notes, the "Learning Perl" textbook or some online Perl references.

  • Creating/Using Arrays - In lecture, we talked about numerous ways to create arrays and how to access array elements. Let's briefly review:
    # all of the following are valid ways to create an array
    
    @a = (1,2,3,4,5); 
    @b = ('a','c','g','t'); 
    @c = 1..5;
    @d = qw(a c g t);
    
    # to access an element in an array, you use the square brackets []. and since each element
    # is a scalar, you precede it with a $ instead of the @ used for arrays
    
    print "$a[0]\n";    # remember that Perl array indices start at 0
    
    $i = 2;
    print "$a[$i]\n";   # the index you use to access the array element can even be stored in another variable
    

    Hopefully, this all sounds vaguely familiar. OK, let's try making our own array! We'll work again with our favorite file format (FASTA). Let's try to read in all the sequence names from a FASTA file and store them in an array:

    #! /usr/bin/perl -w
    
    open FASTAFILE, $ARGV[0];    # the user will enter the filename from the command line
    $index = 0;
    while (<FASTAFILE>) {
            if (/^>/) {
                    chomp;       # what happens if you leave out this chomp?
                    $names[$index++] = $_;    # notice how you can increment the index at the same time
            }
    }
    print "@names\n";       # check out how easy it is to print out an array in Perl!
    close FASTAFILE;
    

    (If you tried to test out the above program with the hemoglobin.fasta in the ~be131/fasta_files/ directory, it'll print out a pretty big mess of names. Try using fakegenes.fasta instead, which contains, to no one's surprise, fake genes.)

    When you try the above script, you'll notice that it saves the > into the array as well, which is not that nice. How do we get rid of that first character? There are a couple of ways to do it, the simplest of which involves using regular expressions. But since we won't be talking too much about regular expressions until next week's lab, let's do it a different way using the substr function, which extracts a portion of a string, starting at a specified offset:

                    $names[$index++] = substr($_, 1);
    

  • List context vs. scalar context - Prof Holmes mentioned this briefly in lecture and there's a whole section on 'Context' in the 'Learning Perl' book (also here, here, or on this newsgroup posting). I would recommend reading more about context if you're still confused after this lab since it's a pretty important concept and can cause seemingly mysterious bugs in your programs otherwise.
    Without going into too much detail, every expression in Perl is evaluated in a particular context, scalar and list being the two major ones. For example, contrast these two statements:
    @array = EXPRESSION;        # EXPRESSION is expected to produce a list (list context)
    $scalar = EXPRESSION;       # EXPRESSION is expected to produce a scalar (scalar context)
    

    Simple enough. But here is when it gets a bit confusing. An expression like @names can be evaluated in either contexts and in each, it produces a different value. Let's add onto our script and try this out:

    @array = @names;
    $scalar = @names;
    print "list context: @array\n";
    print "scalar context: $scalar\n";
    

    Notice how the expression @names in a scalar context returns the length of the array! This is actually pretty useful, so it's a handy tip to remember. Check out some of the links above for more complicated examples on context.

  • Iterating through arrays - Once you've stored data into an array, you probably want some way to get the data back out. We already saw how to print out the whole array, because the print statement is smart and does this for us automatically. But there will be many other situations when you will need to go through each element of the array one-by-one (aka iterate through an array). Let's say in our FASTA example, we want to check if we have a sequence for 'protease' in the file. We already read in the file and stored the names in an array, so we just need to check each element in the array and see if any one of them says 'protease'. An easy way to iterate through an array is using the foreach loop:
    $found = 0;
    foreach $name (@names) {      # this will keep setting $name to a different element in the array
            if ($name eq "protease") {
                    $found = 1;
            }
    }
    print "found protease in file!\n" if $found;
    

    In fact, this is not a great way to look for something because you have to look through the entire array before you find what you're looking for. Even if we were clever and we added some code to stop going through the array when you've found "protease", you may still have to go through the entire array if "protease" is at the end. There are lots of ways to improve this algorithm but they are mostly outside the scope of this class; some of these have to do how you perform the search and some have to do with how you store the data in the first place. When we talk about hashes next, you'll see one way to improve the performance of this search.

  • Doing something to the whole array - Sometimes, you want to do the same thing to everything in the array. Obviously, you can iterate through the array and do what you want to each element of the array, but Perl provides an even simpler way to do this, via a function map. Say in our case, we want to turn all of our names to uppercases.
    @uc_names = map(uc($_), @names);
    print "@uc_names\n";
    

    What the map function does is essentially iterate through the @names array for you, setting each element to the default variable $_ and then applying the function you specified, i.e. uc, on it. The map function returns the results in another array, so it doesn't actually modify the original array you passed in.

    Another useful function is grep, which is very similar to map. Instead of applying a function to each element of an array, it checks whether each element satisfies a specified condition and returns only those elements that satisfied it. For example, if we want only those names that are longer than 10 characters,

    @long_names = grep(length($_) > 10, @names);
    print "@long_names\n";
    

2. Hashes

Now that we've played around with arrays, let's move on to hashes. As mentioned before, hashes are pretty similar to arrays, except that instead of using integers to access them, you use keys.

  • Creating/Using hashes - There are a couple of ways you can directly assign values into a hash. Specifically, in lecture, we talked about
    %comp = ('Cyp12a5' => 'Mitochondrion', 
             'MRG15' => 'Nucleus', 
             'Cop' => 'Golgi', 
             'bor' => 'Cytoplasm', 
             'Bx42' => 'Nucleus');
    

    Working with the same FASTA file we were using for our arrays example, let's now read the data into a hash. This time, we'll read both the names and the sequences and store them into the hash accordingly, with the names used as keys. Rememeber that elements in a hash are accessed using a set of curly braces { } instead of the square brackets for arrays. (You may also notice that the structure of this script is quite similar to how you wrote the reverse complement exercise - if you don't understand why the program is structured this way, read the next section for an explanation.)

    #! /usr/bin/perl -w
    
    open (INPUT, $ARGV[0]);
    while (<INPUT>) {
            chomp;
            if (/^>/) {
                    if (defined($seq)) {
                            $sequences{$name} = $seq;
                            $seq = "";
                    }
                    $name = substr($_, 1);
            } else {
                    $seq .= $_;
            }
    }
    $sequences{$name} = $seq;
    close INPUT;
    

    Unfortunately, unlike arrays, if you try to put %sequences into a print statement, it won't actually print out the hash. One way to check whether you put in all the data is to use the built-in functions keys and values for hashes, which return arrays that you can just print out:

    @seq_keys = keys %sequences;
    print "@seq_keys\n";
    

  • Iterating through hashes - Like arrays, sometimes you will need to go through each element in your hash one-by-one. The easiest way to do this is via the keys and values functions, which return an array that you can iterate through using foreach. Say we want to see if there's a sequence for 'protease' in the file again:
    $found = 0;
    foreach $name (keys %sequences) {
            if ($name eq "dna polymerase") {
                    $found = 1;
            }
    }
    print "found protease in file!\n" if $found;
    

  • But why iterate when you can... - So now we come to one of the nice things about hashes. In the above example, we wanted to find out if there is a protease in the sequence file. We did this by reading the file into a hash and then going through each element of the hash and seeing if its key (name) matches "protease". But because hash elements are accessed by their keys, we can actually just try to access the value associated with "protease". If there is no element associated with a particular key, we'll get an undef value returned to us.
    print "there is a protease in file\n" if defined($sequences{"protease"});
    print "there is no kinase in file\n" if !defined($sequences{"kinase"});
    
    Nice, huh? Not only is the code you write shorter, but the time it took to find out whether something exists in the hash is much much shorter than it took to look through an array. This doesn't matter so much for this short example but when you have massive amounts of data, you'll appreciate this performance difference.

3. Reverse Complement a FASTA file

Last week, you were given the task to write a Perl script that would read a FASTA file, specified by the user, and print to the screen the reverse complement of every sequence in the file in FASTA format. Now that you've tried your hand at this, let's go through one possible way to write this script together.

Let's start off simple and try to write such a program assuming there is only one sequence in the FASTA file first. As many of you found out, to reverse complement a sequence that spans multiple lines does not mean reverse complementing each line individually. So we need to store each line of the sequence and then when we've reached the end, do the reverse complement.

So let's describe what we want to do in English: "We want to open the file specified by the user and read lines from it. If we see a line that begins with >, we don't need to process it and can just print it out (remember, this is because our output has to also be in FASTA format). If a line doesn't begin with >, it's part of the sequence and we want to store it. Once we're done with reading the sequence (in this case, when we're done with the file because we're assuming 1 sequence per file), we will do the reverse complement."

Describing what we want to do in English helps us plan out our program before we start worrying about the specifics of Perl syntax. This is a really good habit to get into. You can even go one step further and plan out the structure of your program by writing "pseudo-code":

    open file specified by user
    for each line in the file {
         if starts with > {
              print line to screen w/o processing
         } else {
              save the line as part of the sequence
         }
    }
    reverse complement the sequence
    

Why would you go through the trouble to write this psuedo-code? Well, first it makes you plan out the program as a whole. And second, it breaks down your program into small tasks - if you start by writing this out in your text editor, all you have to do next is to replace each line of the pseudo-code with the set of Perl commands that will accomplish each small task. (Some programmers even like to save the pseudo-code as comments in a program, because pseudo-code is much easier to read than actual code.)

So now let's go through each line of the pseudo-code:

  • open file specified by user : No problem! You did this during the last lab

      open FASTAFILE, $ARGV[0]

  • for each line in the file : This can be accomplished using a while loop and with the < > operators. You can save each line to either a named variable or the default variable of $_. We'll use the default variable method.

      while (<FASTAFILE>)

  • if starts with > : We can do this by pattern matching the >

      if (/^>/)

  • print line w/o processing : That's a simple print statement

      print;

  • save the line as part of the sequence : This task is a little trickier. Since we don't know how long the sequence will be, we basically need to keep adding onto the sequence until the file ends. Fortunately, Perl provides an easy way to do this with the string concatenation operator . Also, each line we read using the < > brackets will end in a newline character, which we don't want to keep in the sequence.

      chomp;
      $seq .= $_;

    You may have noticed that we introduced a new variable $seq here. Since Perl doesn't require that we declare variables ahead of time, this should be OK. Until we use this variable the first time, i.e. the first time we read a sequence line, $seq will simply be undefined.

  • reverse complement the sequence : OK, at this point in the program, we've stored the full sequence in $seq and we can do the actual reverse complement

      $rev = reverse(uc($seq));
      $rev =~ tr/ACTG/TGAC/;
      print "$rev\n";

So the full script so far is:

    #! /usr/bin/perl
    
    open FASTAFILE, $ARGV[0];
    while (<FASTAFILE>) {
            if (/^>/) {
                    print;
            } else {
                    chomp;
                    $seq .= $_;
            }
    }
    $rev = reverse(uc($seq));
    $rev =~ tr/ACTG/TGAC/;
    print "$rev\n";
    

Now that we can take care of one sequence in the file, we can extend the program to handle multiple sequences. What needs to change? Well, in a file with multiple sequences, we can't just reverse complement when we reach the end of the file; we need to do it each time we're done reading in one whole sequence. How do we know when we're done reading in one whole sequence? We know when we either reach the end of the file or encounter another line that starts with >! So we can add to the if statement:

    #! /usr/bin/perl
    
    open FASTAFILE, $ARGV[0];
    while (<FASTAFILE>) {
            if (/^>/) {
                    $rev = reverse(uc($seq));
                    $rev =~ tr/ACTG/TGAC/;
                    print "$rev\n";
                    $seq = "";
                    print;
            } else {
                    chomp;
                    $seq .= $_;
            }
    }
    $rev = reverse(uc($seq));
    $rev =~ tr/ACTG/TGAC/;
    print "$rev\n";
    

One last problem: With the program above, the first time we encounter a line starting with >, the script will try to reverse complement $seq. But we haven't even read a sequence in yet! So, we should put in a check, to make sure we've actually saved a sequence before we try to process it. So the final program is

    #! /usr/bin/perl
    
    open FILE, $ARGV[0];
    while (<FILE>) {
            if (/^>/) {
                    if (defined($seq)) {
                            $rev = reverse(uc($seq));
                            $rev =~ tr/ACTG/TGAC/;
                            print "$rev\n";
                            $seq = "";
                    }
                    print;
            } else {
                    chomp;
                    $seq .= $_;
            }
    }
    $rev = reverse(uc($seq));
    $rev =~ tr/ACTG/TGAC/;
    print "$rev\n";
    

4. Checking for errors

In last week's Perl exercise, you were also asked to think of a list of errors that can occur when a user runs your program (aka runtime errors) and how you might handle some of these errors. What kinds of runtime errors can occur? How do you think of possible errors? There's no easy way to answer these questions, and as I mentioned in the last lab, learning how to handle errors is a skill that you learn and get better at through experience.

When you first start thinking about the errors that can happen in your program, you should think about the assumptions you make in your program and what a user can do to make these assumptions be wrong and cause the program to behave unexpectedly (e.g. crashing). One obvious way the user can cause errors in your program is through the input s/he passes to your program. When I think about error-handling, I like to ask myself lots of 'What if' questions. In the case of our reverse complement script, I might ask myself:

  • What if the filename the user specifies is not a real file? (My program assumes that it can open the file using the filename from the command line)
  • What if the file the user told me to reverse complement is not a FASTA file? (My program assumes that it'll be processing a FASTA file)
  • What if the file is in FASTA format, but it doesn't contain DNA sequences? (My program assumes that the sequences will only contain ACGTs but a user could pass in protein sequences or random, non-sensical sequences)
  • What if the sequences are in all lowercase, all uppercase, or mixed cases? (In the program we wrote above, this won't be a problem because we convert everything to uppercase before we reverse complement, but this could be an issue if the program we write assumes all lowercase, for example)
  • What if the user calls the program without any arguments? (My program assumes that $ARGV[0] contains the filename)
  • What if the user calls the program with too many arguments? (Usually this case won't break your program, but you might still want to handle it by telling the user what's the proper way to use your program)

The lists of errors can become very long for complex programs and usually, it's pretty hard to think of every single thing the user can do to break your program. In this class, we won't be too harsh on making you do exhaustive error-handling. But you should handle very obvious errors (e.g. not passing the right number of arguments into the program, file operation errors, file format errors, etc).

Note that runtime errors are different from bugs in the program. Bugs are caused by code that's not written correctly and when a user passes in a valid input, your program freaks out and crashes or does something weird. In this class, by the time a user uses your program, it should be bug-free. In the real world, software companies also try to make bug-free programs but when a program is so large that it's written by hundreds of people, that gets very difficult to do. In large and complex programs, runtime errors and bugs are usually discovered during testing (remember how I said testing is important in the last lab?). Software companies employ whole groups of people, "testers" or "QA engineers" or whatever, to try to find bugs/errors in programs. Any of you who have ever used a product in alpha or beta has helped out in the testing process (well, if you reported the bugs you found back to the software developers, that is). Video game companies, too, employ large groups of testers to play video games all day to test out their latest games. (It's not actually as fun as it sounds) Even with all the time and money spent on testing, bugs still get through - I'm sure all of you have had programs on your home computer crash on you before.

But back to our list of errors...Once you have a list of errors, you need to think about how to handle them gracefully. For runtime errors, a very common error-handling strategy is to print out some sort of useful error message. For example, when a user passes incorrect number of arguments to your program, it might be nice to tell him/her what's a proper way to use your program. If a user gives you a filename that's not the name of a real file, it would be good to let him/her know so s/he can check if there was a typo. Obviously, a useful error message should not be something like "The program crashed," which won't help the user figure out what s/he did wrong.

An example: For the reverse complement program, if the user gives you a filename that's not a real file, you can just print out an error message "The file you specified does not exist." This can be done very simply using the || (or) operator, because of the way these expressions are evaluated. As Prof Holmes mentioned in lecture, in a statement like (A) || (B), if (A) evaluates to true, then Perl will not bother evaluating the second part of the expression (B) since that expression is already guaranteed to be true. So to handle nonexistent file errors:

    open(FASTAFILE, $ARGV[0]) || die "The file you specified does not exist";

5. Time to do Perl exercise 2 (comparing two FASTA files)

Now, you get to try writing the second Perl exercise, for comparing two FASTA files and determining whether they contain the same set of sequences. You should submit? your scripts by next Thursday 10/6 by 5pm. Some things to keep in mind:

  • The sequences need not be in the same order in the two files.

  • We will expect some basic error-handling in scripts from now on. So be sure to code that into your scripts.

  • You can have two (or more) files opened in a Perl script at a time, as long as you name their file handles differently in the open command. The file handle is the first argument following open and is how you refer to the file throughout the script, e.g. in

      open INPUT, "sequence.txt"

    INPUT is the name of the file handle for the corresponding file "sequence.txt".

  • You will need to make your own test files to test out your programs. The files in ~be131/fasta_files/ are still available to you but probably won't be enough to make sure that your program works according to the specifications. Again, the NCBI website is a great place to find sequences, specifically nucleotide sequences in this case. Search for some protein you know of to get a long listing of results, then select a couple sequences from the list (again, avoid the 'whole genome' ones and stick to the 'mRNA' ones so you don't end up trying to process ridiculously huge files). Then on the dropdown boxes near the top, you have the option to show the selected sequences in FASTA format and also to save them in a file.

-- AngiChau - 25 Sep 2005

Actions: Edit | Attach | New | Ref-By | Printable view | Raw view | Normal view | See diffs | Help | More...