click on the Biowiki logo to go to homepage
Edit Raw Print
Links Diffs RSS
About Stats Recent


Research Teaching Blog
Fall09 | Sandbox
Biowiki > Teaching > Bio E 131 > PerlBasicsLab

Search

Advanced search...

Topics

PageRank Checker

Perl Basics

By the end of this lab, you should know how to:

  • write and execute simple Perl scripts in the UNIX environment
  • use arguments passed in from the command line in your Perl scripts
  • read data from text files and user input
  • use the NCBI website to create FASTA files


0. Before we get started...

  • The prompt you get depends on the UNIX shell you're running and other things like preference files. In the following lab, I will use a $ symbol to signify a UNIX prompt. The prompt you see in your terminal window may/may not end in a $, but just remember that you don't actually type in the $, just the stuff after it.

  • We won't be going over every single thing on Perl that Prof. Holmes talked about in lecture. We just don't have enough time to do that in lab. We'll try to concentrate on some of the trickier concepts, but you cannot rely on just the things we use in the lab examples for the exercises/homework assignments. A big part of becoming a good programmer is knowing when and how to look up documentation and help on commands and expressions you may not have used before - so when you're unsure about something, google to see if you can find some help online or better yet, just try it! For example:
    • Do you know what the difference is when you use the print command and include a \n as opposed to not?
    • What happens when you use double quotes "" as opposed to single quotes '' or backticks ``? Do you know why you get the different behaviors?
    • When you have strings with numbers e.g. "1", "56", what's the difference when you compare them using the numeric comparison operators as opposed to the string comparison operators?

  • If you don't have a Perl book, look up some websites with command references and bookmark them. You can refer to these when you're unsure of the syntax of a command or you forgot exactly how to write a foreach loop or something. Here are a few:
    • Rex Swain's HTML Perl guide <-- This is great to have open while you code!
    • Perldoc
    • perl.com. In addition to this documentation, they also have a six part Perl tutorial accessible from the home page. The first three parts are highly relevant to this course. If you find yourself struggling with just the examples we've provided, I suggest you go through the tutorial.
    • Learning Perl

1. Adding numbers

Let's start off with a simple program, just to get familiar with the process of writing and executing a Perl script. We want to write a program that's basically a calculator and actually, not even a very good one. It can only add (i.e. don't throw away your current calculator because you probably won't be using this program to replace your calculator).

Here are the specifications: When a user calls our program from the command line, s/he will also need to tell us how many numbers we will be adding. Then we will ask the user type in that many numbers, and we will print out the sum at the end.

  • OK, so let's get started. First, fire up your favorite text editor.

    Aside: If you're using emacs in the lab, it'll bring up a new window and the terminal in which you called the program will not be useable until you exit emacs. That's pretty annoying, so to avoid that from happening, run emacs "in the background." Essentially, just put a & at the end and what that does is it tells UNIX to "fork a new process" for this program -- here's more about UNIX processes and fork.

    $ emacs &

    Alternatively, if you prefer something simpler like pico, which doesn't pop up in a new window, don't put the & or you won't actually be able to use the pico program you just started. If you accidentally did it, you'll need to kill the process using the kill command at the prompt with the process ID (PID) of the pico process you started. You can look up PIDs using the ps command.

    $ ps
    $ kill XXXX

  • Now that you have a space in which to type your script, it's time to designate this as a Perl script. By convention, you save Perl scripts with the extension .pl, but that's not actually what tells the UNIX system that this is a Perl script. There are two ways to get your .pl file treated as a perl file. You can:

1) Tell the operating system how to execute the script. Set the very first line of your Perl script to (Note: the character after the pound sign (#) is an exclamation point (!) not the number one (1). If this text is hard to read, increase your font size using your browser's View->Text Size menu option. )

#! /usr/bin/perl -w

This line must be line 1 of your script (i.e. you can't even put in a blank line before it - it's worth trying to put a blank line before it and seeing the error it gives later on when you try to run the program, just so you can recognize this error if you see it later on). What's the -w for? That turns on the warnings for the Perl interpreter, so you get more useful error messages if your program fails to run. It's a good idea to turn this on when you're starting off writing a script, so the Perl interpreter can help you find bugs in your scripts.

2) Directly use the perl command to run your program. Instead of executing your perl file, you execute the perl command, and pass your file as an argument. In a close analogy to the case above, you can type into the console (from the directory your perl file is in):

$ perl -w myPerlFile.pl

  • One good way to learn programming is to do things step-by-step, i.e. get one small part working before adding onto it. Unless you've done a lot of programming in a particular language, it's usually not a good idea (in my opinion) to try to write the full program in one go, and then figure out everything that is wrong with it. In the spirit of this method, let's just start off with reading an argument from the command line and we'll just print it out. Not too useful of a program but it's a good start.

    So how do we read in arguments from the command line? This requires the use of a special array variable called @ARGV. When a user calls your program from the command line and puts arguments after it, Perl automatically saves these into an array called ARGV, so all you have to do is read from this array. Remember that arrays in Perl have a starting index of 0.

    $number = $ARGV[0];
    print "you typed in $number\n";


    This will save the thing the user typed in into the variable called $number and then print it out to the screen. Aside: Do you know why there's a $ in front of ARGV[0] instead of a @, which is the usual symbol for arrays?

  • So now, we want to test out this little part of the program. First save our script - say, add.pl. (You may want to open up another Terminal window, if you're using pico, so you don't have to keep quitting the text editor every time you want to test your script.) Before we can run the program (assuming we want to execute the file, rather than passing the file to perl), though, we need to change its file permissions because UNIX doesn't know that this is an executable program - currently it just thinks it's a text file that you can read from and write to. We just need to tell UNIX that you can actually execute this file because it's a program. You do this using the command chmod:

    $ chmod +x add.pl or
    $ chmod 755 add.pl

    Aside: If you're wondering what those numbers after chmod mean, check out this tutorial on chmod. Basically the three numbers correspond to the three categories of users defined in Unix: user, group, and world. Each one of these categories can have read (r), write (w), or execute (x) permission on a file. If we write 7 in binary, it will be 111. This is equivalent to "rwx", full permissions for the user. The permissions for the group and world are 5=101="rx", so users in those categories will only have read/execute permissions.

    OK, so now UNIX knows it's an executable file. Let's execute it!

    $ add.pl 5

    Did it do what you expect?

In some systems, we would have had to have typed ./add.pl. Why might we have to put the ./ in front of add.pl in some cases? This has to do with something called your PATH variable in UNIX. When you type in a command in UNIX, the system searches all the directories in your PATH to locate this program. On our systems, the current directory is in your PATH. If it weren't, you'd tell UNIX where it is, i.e. ./add.pl, "please execute the add.pl program that is in the . directory aka my current directory".

  • Congrats! You now know how to read arguments from the command line! But now how to we make the actual adder? Well, first, from the command line argument, we know how many numbers the user will type in. Since we know how many times we need to ask the user for a number, a good control structure to use is a for loop:
    for ($i=0; $i<$number; $i++) { 
         print "Please enter a number: ";    # do you know why we don't put a \n here?
         $userNum = <STDIN>;
    }
    

  • Great, we can read in the numbers. But now we need to add them up. As Prof. Holmes mentioned in class, in Perl, there are many different ways to do something. Here, for example, we could save each of the number the user types in somewhere (maybe an array?) and then once we're done asking the user for numbers, we can just go through each number in the array and add them up. But, why don't we just keep a cumulative running sum of everything the user types in? So let's try that:
    $sum = 0;    # why is this necessary?
    for ($i=0; $i<$number; $i++) { 
         print "Please enter a number: "; 
         $userNum = <STDIN>;
         $sum += $userNum;
    }
    

  • Finally, we want to print out our result. So our full program is
    $number = $ARGV[0];       # read in the number of numbers to add up from the command line
    $sum = 0;  
    for ($i=0; $i<$number; $i++) { 
         print "Please enter a number: "; 
         $userNum = <STDIN>;
         $sum += $userNum;    # keep a running sum
    }
    print "The sum is $sum\n";
    
    Notice the # comments, which Perl ignores when it's reading through this program. Commenting your code is good programming style and helps to explain what you're doing in the code to other programmers and also to you later on, when you come back to this program a year later and may have forgotten why you did things a certain way. Comments are discussed further in the StyleGuidelines

  • Congrats! You've just written an adder! Test it out to make sure it adds numbers like you expect. But... What would happen if someone typed in $ add.pl hello from the command line? What if they don't enter numbers when you ask them to? To find out, you can always test out your programs with "weird" input to see what your program does. Remember that not all users are informed about what your add.pl is supposed to do. Right now, when something unexpected is entered by the user, you should get some complicated looking message (if you used the -w option in your script). That's not very nice - or in computer speak, your program is not handling errors "gracefully". What would be more useful?

  • A little more practice ... Extend your program by making a special feature if the user decides to enter 3 numbers. In addition to printing the sum, print out the numbers in descending order.

2. A slightly better calculator

Let's try something slightly different. Let's write a very similar program, but instead of making someone type in the numbers one by one, we'll read the numbers from a text file. The name of the text file will be passed in through a command line argument. And since we're the designers of this program, we can be a little annoying and impose the following rules:

  1. Each number should be on a separate line.
  2. We'll give a couple more functions to our calculator, specifically add, subtract, multiply, and divide.
  3. Each line should start with one of +, -, *, /, followed by whitespace, followed by a number. This will tell our program to perform to perform that operation on the running total and that number (for now, we're not worrying about orders of operation)
  4. If a line does not start with one of the above symbols, we'll ignore that line

Here we go...

  • Make a new file in the text editor and give it a descriptive name, e.g. calculate.pl

  • One of the first things we need to figure out how to do is to read from a file. Prof. Holmes talked about this in class. To refresh your memory, we need to first open a file handle, then use the < and > symbols to read lines from the file, and close the file when we're done reading. Again, in the spirit of starting simple, let's just start with a script that reads line from a file and print them on the screen.
    #! /usr/bin/perl -w
    
    $filename = $ARGV[0];
    open(INPUT, $filename);
    while($line=<INPUT>) {
         print "$line";     # do you know why we don't put a \n here? what happens if you do?
    }
    
    Here's a version using the default variable $_, which will automatically contain the line read in from the INPUT filehandle. Then calling print without an argument will output $_.
    #! /usr/bin/perl -w
    
    $filename = $ARGV[0];
    open(INPUT, $filename);
    while(<INPUT>) {
         print;               
    }
    

While this may be a little shorter, it's a little less clear exactly what's being done. For the sake of clarity, the first version is preferred to the second - at least in this course smile

  • In order to try out this file-reading program, we have to talk about testing. Part of being a programmer is knowing how and when to test out your programs, which includes designing test cases. A good programmer constructs enough test cases to make sure that every part of the code works (i.e. making sure every condition in the program is tested and trying out as many unexpected cases as s/he can think of so even really weird inputs won't crash the program but give useful error messages, etc). Learning how to test code is a skill gained through time and experience, so we don't expect you to be experts right now. But it's good to keep in mind that a programmer's responsibility includes testing.

    So how do we test out our script so far? Obviously, we need to have some test files, since our program reads from a file. I won't walk you through the steps in making a test file, but go ahead and make one now and test out your script so far, e.g. if I named my test file test.txt:

    $ ./calculate.pl test.txt

    Does your program do what you expect?

  • Ok, so we can read from the file....now, how do we actually do calculations with it? Let's think about the steps involved in English (this is a really good habit to get into and especially important for larger programs. Planning out your program in regular English helps you organize your thoughts before you start worrying about the actual syntax of a programming language. It's also helpful to then write an outline in pseudocode, so that you can see which parts of what you want to do are going to be dependent on one another, and get an idea of the eventual structure of your program.)

    Every time I read a line from the file, I want to look at the first character to figure out what mathematical operation to perform on the following number. If I don't see a mathematical operator, I ignore the line. If I do see one, I read in the number and then do the appropriate mathematical operation using that number and the current total. Hey, this seems like a great place for some of that pattern matching we talked about in lecture (here's more info on regular expressions and Perl regex examples):
    #! /usr/bin/perl -w
    
    open(INPUT, $ARGV[0]);
    while($line = <INPUT>) 
    {
         if ($line=~ /^\+\s([0-9]+)/ ) # do you know why this regex can match what we want?  What's with the \+?
         {        
               print "add $1\n";        # for now, just print out text to check that we're matching stuff right
         } 
         elsif ... # fill in the cases for the other math symbols
         {                  
               
         } 
         else 
         {
               print "ignored line $line";
         }
    }
    

Can you think of any other ways of organizing the logic? Maybe using [\+\-\*\/] somewhere? Is the output exactly what you expect? Could you fix the newlines if you wanted to (think chomp!).

  • Now that we can read the lines in the file, the only step left is to actually perform the calculations and print out the results. Finish this on your own, and show me when you're finished.

3. Time to do a Perl exercise (Homework)!

It's time for you to try out writing a script on your own because as Prof Holmes said in class, the best way to learn programming is by doing it. Here your task is to reverse complement a FASTA file. Part of the solution has already been described in the Perl lecture notes. Here are the requirements for your script:

  • Open a file, whose filename is specified by the user as a command-line argument. That is, if the name of your Perl script is programname and the name of the file is filename, then the script should be run by typing the following at the Unix command line: programname filename
  • Do some basic error handling: verify that the user entered a filename and that the file can be opened. If not, print informative error messages and exit.
  • Read the contents of the file, assuming it is a FASTA file of DNA sequences, and as you're doing so, print the name and reverse-complement of every sequence on the standard output, in FASTA format. This means that you have to output no more than 80 characters per line! (You only have to worry about this for the actual sequence, not the description line).
  • Enable a command line argument and the program logic to output the complement as an RNA sequence. That is, if you use the command line to type:
    $ perl -w programname filename rna
    the program should output the complement as RNA instead of DNA (U's instead of T's).
  • You should add the sequence length L in basepairs to the end of the sequence label line that starts with ">" using a format of ", L bp". If the line was originally "> GFP, mut3", which had a dna length of 450, it should now read "> GFP, mut3, 450 bp".

Administrative Concerns:

  • You will be graded on correctness (90%) and style (10%). Please see the StyleGuidelines for expectations about your style. You are not expected to use subroutines for this exercise, though you certainly may, so the "no redundancy" requirement is relaxed.
  • Turn in your program by e-mailing the .pl file to Oscar or uploading it to your individual wiki page. Be sure to include your name and university e-mail address in your e-mail!
  • As mentioned in lecture, you may work with 1 other student if you so choose. (Remember, you can turn in at most 3 other assignments with the same student). If you do work with another student, put "I worked with xxxx" in a comment at the top of your .pl file. Each person should turn in code, even if it's the same.
  • Information about LATE assignments: You lose 20% of the points of the assignment for every day it's late. Contact Oscar or Professor Holmes at least 48 hours before the due date if you have extenuating circumstances, if reasonably possible.

Here are some hints/things to think about:

  • Create your own test files by visiting the NCBI website to find nucleotide sequences. Search for some protein you know of (eg, hemoglobin) to get a long listing of results, then select a couple sequences from the list (avoid the 'whole genome' sequences and stick to the 'mRNA' sequences so you don't end up trying to process ridiculously huge files). Then on the dropdown boxes near the top, you have the option to show the selected sequences in FASTA format and also to save them in a file.

  • Keep in mind that a valid FASTA file can contain 1 or more sequences. Test your script first with one sequence and then add more to your test file.

  • If a sequence is longer than a line, you cannot just do a line-by-line reverse complement

  • Don't get frustrated if your program doesn't work the way you want in the beginning. Even with many years of programming, a program rarely works on the first try! smile

-- AngiChau - 18 Sep 2005

Actions: Edit | Attach | New | Ref-By | Printable view | Raw view | Normal view | See diffs | Help | More...