Home - this site is powered by TWiki(R)
Teaching > BioE131 > ArraysAndHashesLab
TWiki webs: Main | TWiki | Sandbox   Log In or Register

Arrays and Hashes in Perl

By the end of this lab, you should know:

  • how to use arrays and hashes in Perl
  • one way to write the reverse complement exercise
  • some things to keep in mind when doing error-checking


0. Before we get started...

  • In the following lab, we will use a $ symbol to signify a UNIX prompt. The prompt you see in your terminal window may/may not end in a $, but just remember that you don't actually type in the $, just the stuff after it.

  • This lab assumes that you are now comfortable with using a text editor to write a Perl script, the proper format of a Perl script, and the UNIX commands you have to issue to make your script executable. In addition, you should already know how to read arguments from the command line, read data from user input (i.e. standard input), and read from files. If you need a refresher/reminders along the way, please refer back to Perl Basics.

1. Arrays

By now, you already know the basics of writing and executing Perl scripts, and you even wrote one for reverse complementing sequences in a FASTA file.

As you begin to write more complex programs, data structures like arrays and hashes will become increasingly useful. These list structures are great for storing a set of items and providing easy access to them. At a very simplistic level, the major difference between arrays and hashes is just the way you access the elements within them - in an array, you access elements using integer indices and in a hash, you access elements using keys.

The word "key" is ambiguous on purpose, because nearly anything can be a key. What happens on a very low level when you ask for the element with the key X in a hash is to convert X into some numerical value using some built-in hash function and then use that numerical value to access an internal array. Why then, you might ask, wouldn't I write my own array and my own hash function? Well, you could and in some programming languages, you have to if you want to use such a structure. But, the purpose of hashes (aka hash tables) is to provide very quick access to its elements and this all comes down to a good hash function. In computer science classes, you can spend weeks talking about what's a good hash function, so for most of us, using built-in hashes means not having to worry about any of this.

So with that in mind, let's start with arrays. Arrays are useful for storing a list of similar items, e.g. a list of numbers, a list of strings, and even a list of lists. You've already used arrays in the last lab - the @ARGV you used to retrieve command line arguments is an array of strings that Perl creates for you. Perl has a lot of built-in functions for working with arrays and lists in general and we're definitely not going to go through all of them. See the perldoc at perl.org for a complete listing.

  • Creating/Using Arrays - In lecture, we talked about numerous ways to create arrays and how to access array elements. Let's briefly review:
    # all of the following are valid ways to create an array
    
    @a = (1,2,3,4,5); 
    @b = ('a','c','g','t'); 
    @c = 1..5;
    @d = qw(a c g t);
    
    # to access an element in an array, you use the square brackets []. and since each element
    # is a scalar, you precede it with a $ instead of the @ used for arrays
    
    print "$a[0]\n";    # remember that Perl array indices start at 0
    
    $i = 2;
    print "$a[$i]\n";   # the index you use to access the array element can even be stored in another variable
    

    Hopefully, this all sounds vaguely familiar. OK, let's try making our own array! We'll work again with our favorite file format (FASTA). Let's try to read in all the sequence names from a FASTA file and store them in an array (we'll refer to this script as seqNamesArray.pl):

    #! /usr/bin/perl -w
    
    open (FASTAFILE, $ARGV[0]);    # the user will enter the filename from the command line
    $index = 0;
    while ($line = <FASTAFILE>) {
            if ($line =~ /^>/) {
                    chomp($line);       # what happens if you leave out this chomp?
                    $names[$index++] = $line;    # notice how you can increment the index at the same time.  Also, we never defined @names; what does perl assume we mean?
            }
    }
    print "@names\n";       # check out how easy it is to print out an array in Perl!
    close (FASTAFILE);
    

    Test this script out with hemoglobin.fasta. You'll notice that it'll print out a pretty big mess of sequence names, one write after the other separated by only a space. You'll notice that it saves the > character into the array as well, which is not that nice. How do we get rid of that first character? There are a couple of ways to do it, the simplest of which involves using regular expressions. But since we won't be talking too much about regular expressions until next week's lab, let's do it a different way using the substr function, which extracts a portion of a string, starting at a specified offset:

                    $names[$index++] = substr($line, 1);
    
    Try this out. The first character of the string is at offset 0, so substr will return everything after the > character. We'll go over a better way of printing out this array shortly.

  • List context vs. scalar context - Prof Holmes mentioned this briefly in lecture and there's a whole section on 'Context' in the 'Learning Perl' book (also here, here, or on this newsgroup posting). We recommend reading more about the concept of Perl context if you're still confused after this lab since it's pretty important and can cause seemingly mysterious bugs in your programs otherwise.
    Without going into too much detail, every expression in Perl is evaluated in a particular context; scalar and list being the two major ones. For example, contrast these two statements:
    @array = EXPRESSION;        # EXPRESSION is expected to produce a list (list context)
    $scalar = EXPRESSION;       # EXPRESSION is expected to produce a scalar (scalar context)
    

    Here, the placeholder "EXPRESSION" stands for anything that can be a valid Perl expression. For example, an expression like @names can be evaluated in either list or scalar contexts. Depending on which context it finds itself in, this expression produces a different value. Let's add onto our script and try this out:

    @array = @names;
    $scalar = @names;
    print "list context: @array\n";
    print "scalar context: $scalar\n";
    

    Notice how the expression @names in a scalar context returns the length of the array! This is actually pretty useful, so it's a handy tip to remember. Check out some of the links above for more complicated examples on context.

  • Iterating through arrays - Once you've stored data into an array, you probably want some way to get the data back out. We already saw how to print out the whole array, because the print statement is smart and does this for us automatically. But there will be many other situations when you will need to go through each element of the array one-by-one (aka iterate through an array). Let's say in our FASTA example, we want to check if we have a sequence for 'protease' in the file. We already read in the file and stored the names in an array, so we just need to check each element in the array and see if any one of them is 'protease'. An easy way to iterate through an array is using the foreach loop:
    $found = 0;
    foreach $name (@names) {      # this will keep setting $name to a different element in the array
            if ($name eq "protease") {
                    $found = 1;
            }
    }
    print "found protease in file!\n" if $found; # Note that here we are using Perl's alternative if statement syntax
    

    In fact, this is not a great way to look for something because you have to look through the entire array before you find what you're looking for. Even if we were clever and we added some code to stop going through the array when you've found "protease", you may still have to go through the entire array if "protease" is at the end. There are lots of ways to improve this algorithm but they are mostly outside the scope of this class; some of these have to do how you perform the search and some have to do with how you store the data in the first place. When we talk about hashes next, you'll see one way to improve the performance of this search.

    Now modify seqNamesArray.pl to use foreach to print out the name of each sequence on a separate line.

  • Doing something to the whole array - Sometimes, you want to do the same thing to everything in the array. Obviously, you can iterate through the array and do what you want to each element of the array, but Perl provides an even simpler way to do this, via a function map. Say in our case, we want to turn all of our names to uppercases.
    @uc_names = map(uc($_), @names);
    print "@uc_names\n";
    

    What the map function does is essentially iterate through the @names array for you, setting each element to the default variable $_ and then applying the function you specified, i.e. uc, on it. The map function returns the results in another array, so it doesn't actually modify the original array you passed in.

    Another useful function is grep, which is very similar to map. Instead of applying a function to each element of an array, it checks whether each element satisfies a specified condition and returns only those elements that satisfied it. For example, if we want only those names that are longer than 10 characters,

    @long_names = grep(length($_) > 60, @names);
    print "Long Names: @long_names\n";
    

2. Hashes

Now that we've played around with arrays, let's move on to hashes. As mentioned before, hashes are pretty similar to arrays, except that instead of using integers to access them, you use keys.

  • Creating/Using hashes - There are a couple of ways you can directly assign values into a hash. Specifically, in lecture, we talked about
    %comp = ('Cyp12a5' => 'Mitochondrion', 
             'MRG15' => 'Nucleus', 
             'Cop' => 'Golgi', 
             'bor' => 'Cytoplasm', 
             'Bx42' => 'Nucleus');
    

    Working with the same FASTA file we were using for seqNamesArray.pl, let's now read the data into a hash. This time, we'll read both the names and the sequences and store them into the hash accordingly, with the names used as keys. Remember that elements in a hash are accessed using a set of curly braces { } instead of the square brackets for arrays. (You may also notice that the structure of this script is quite similar to the solution for the reverse complement exercise -- if you don't understand why the program is structured this way, read the next section for an explanation.)

    #! /usr/bin/perl -w
    
    open (INPUT, $ARGV[0]);
    while ($line = <INPUT>) 
    {
            chomp $line;
            if ($line =~ /^>/) 
            {
                    if (defined($seq)) 
                    {
                            $sequences{$name} = $seq;
                            $seq = "";
                    }
                    $name = substr($line, 1);
            } 
            else 
            {
                    $seq .= $line;
            }
    }
    $sequences{$name} = $seq;
    close (INPUT);
    

    Unfortunately, unlike arrays, if you try to put %sequences into a print statement, it doesn't print out the results in a pretty way (just each key and corresponding value mashed into each other). One way to check whether you put in all the data is to use the built-in functions keys and values for hashes, which return arrays that you can just print out nicely:

    @seq_keys = keys %sequences;
    print "@seq_keys\n";
    

  • Iterating through hashes - Like arrays, sometimes you will need to go through each element in your hash one-by-one. The easiest way to do this is via the keys and values functions, which return an array that you can iterate through using foreach. Say we want to see if there's a sequence for 'protease' in the file again:
    $found = 0;
    foreach $name (keys %sequences) {
            if ($name eq "dna polymerase") {
                    $found = 1;
            }
    }
    print "found protease in file!\n" if $found;
    

  • But why iterate when you can... - So now we come to one of the nice things about hashes. In the above example, we wanted to find out if there is a protease in the sequence file. We did this by reading the file into a hash and then going through each element of the hash and seeing if its key (name) matches "protease". But because hash elements are accessed by their keys, we can actually just try to access the value associated with "protease". If there is no element associated with a particular key, we'll get an undef value returned to us.
    print "there is a protease in file\n" if defined($sequences{"protease"});
    print "there is no kinase in file\n" if !defined($sequences{"kinase"});
    
    Nice, huh? Not only is the code you write shorter, but the time it took to find out whether something exists in the hash is much much shorter than it took to look through an entire array. This doesn't matter so much for this short example but when you have massive amounts of data, you'll appreciate this performance difference.

Time to do a Perl exercise (Homework)

PerlSequenceSimulator

Administrative Concerns:

  • You will be graded on correctness (90%) and style (10%). Please see the StyleGuidelines for expectations about your style.

  • Grading for correctness will be automated, so if your output format does not match the format described above, you will lose points. It's ok if you have a little extra whitespace at the ends of lines or between different classes of sequences.

  • Turn in your program by uploading the .pl file to your personal wiki page.
  • As mentioned in lecture, you may work with 1 other student if you so choose. (Remember, you can turn in at most 3 other assignments with the same student). If you do work with another student, put "I worked with xxxx" in a comment at the top of your .pl file. Each person should turn in code, even if it's the same.

  • Information about LATE assignments: You lose 20% of the points of the assignment for every day it's late. Contact Mohammad or Professor Holmes at least 48 hours before the due date if you have extenuating circumstances, if reasonably possible.

-- AngiChau - 25 Sep 2005


  • Add discussion of split and join to array section.
Edit | Attach | Print version | History: r141 < r140 < r139 < r138 < r137 | Backlinks | Raw View | Raw edit | More topic actions


Parents: BioE131
This site is powered by the TWiki collaboration platformCopyright © 2008-2014 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
TWiki Appliance - Powered by TurnKey Linux