[Back to
UndergraduateClass]
Lab 3: Perl Basics Revisited (9/21/2005)
By the end of this lab, you should know how to:
- write and execute simple Perl scripts in the UNIX environment
- use arguments passed in from the command line in your Perl scripts
- read data from text files and user input
- use the NCBI website to create FASTA files
0. Before we get started...
- The prompt you get depends on the UNIX shell you're running and other things like preference files. In the following lab, I will use a
$ symbol to signify a UNIX prompt. The prompt you see in your terminal window may/may not end in a $, but just remember that you don't actually type in the $, just the stuff after it.
- We won't be going over every single thing on Perl that Prof. Holmes talked about in lecture. We just don't have enough time to do that in lab. We'll try to concentrate on some of the trickier concepts, but you cannot rely on just the things we use in the lab examples for the exercises/homework assignments. A big part of becoming a good programmer is knowing when and how to look up documentation and help on commands and expressions you may not have used before - so when you're unsure about something, google to see if you can find some help online or better yet, just try it! For example:
- Do you know what the difference is when you use the
print command and include a \n as opposed to not?
- What happens when you use double quotes
"" as opposed to single quotes '' or backticks ``? Do you know why you get the different behaviors?
- When you have strings with numbers e.g.
"1", "56", what's the difference when you compare them using the numeric comparison operators as opposed to the string comparison operators?
- If you don't have/want to buy a Perl book, look up some websites with command references and bookmark them. You can refer to these when you're unsure of the syntax of a command or you forgot exactly how to write a
foreach loop or something. Here are a couple I use often: perl.com, Rex Swain's HTML Perl guide, A brief guide from CMU.
1. Adding numbers
Let's start off with a simple program, just to get familiar with the process of writing and executing a Perl script. We want to write a program that's basically a calculator and actually, not even a very good one. It can only add (i.e. don't throw away your current calculator because you probably won't be using this program to replace your calculator).
Here are the specifications: When a user calls our program from the command line, s/he will also need to tell us how many numbers we will be adding. Then we will ask the user type in that many numbers, and we will print out the sum at the end.
- OK, so let's get started. First, fire up your favorite text editor.
Aside: If you're using emacs in the lab, it'll bring up a new window and the terminal in which you called the program will not be useable until you exit emacs. That's pretty annoying, so to avoid that from happening, run emacs "in the background." Essentially, just put a & at the end and what that does is it tells UNIX to "fork a new process" for this program - you can read more than you probably care to know about UNIX processes here.
$ emacs &
Alternatively, if you prefer something simpler like pico, which doesn't pop up in a new window, don't put the & or you won't actually be able to use the pico program you just started. If you accidentally did it, you'll need to kill the process using the kill command at the prompt with the process ID (PID) of the pico process you started. You can look up PIDs using the ps command.
$ ps
$ kill XXXX
- Now that you have a space in which to type your script, it's time to designate this as a Perl script. By convention, you save Perl scripts with the extension
.pl, but that's not actually what tells the UNIX system that this is a Perl script. What does it is the very very first line of your Perl script, which must read
#! /usr/bin/perl -w
This line must be line 1 of your script (i.e. you can't even put in a blank line before it - it's worth trying to put a blank line before it and seeing the error it gives later on when you try to run the program, just so you can recognize this error if you see it later on). What's the -w for? That turns on the warnings for the Perl interpreter, so you get more useful error messages if your program fails to run. It's a good idea to turn this on when you're starting off writing a script, so the Perl interpreter can help you find bugs in your scripts.
- One good way to learn programming is to do things step-by-step, i.e. get one small part working before adding onto it. Unless you've done a lot of programming in a particular language, it's usually not a good idea (in my opinion) to try to write the full program in one go, and then figure out everything that is wrong with it. In the spirit of this method, let's just start off with reading an argument from the command line and we'll just print it out. Not too useful of a program but it's a good start.
So how do we read in arguments from the command line? This requires the use of a special array variable called @ARGV. When a user calls your program from the command line and puts arguments after it, Perl automatically saves these into an array called ARGV, so all you have to do is read from this array. Remember that arrays in Perl have a starting index of 0.
$number = $ARGV[0];
print "you typed in $number\n";
This will save the thing the user typed in into the variable called $number and then print it out to the screen. Aside: Do you know why there's a $ in front of ARGV[0] instead of a @, which is the usual symbol for arrays?
- So now, we want to test out this little part of the program. First save our script - say,
add.pl. (You may want to open up another Terminal window, if you're using pico, so you don't have to keep quitting the text editor every time you want to test your script.) Before we can run the program, though, we need to change its file permissions because UNIX doesn't know that this is an executable program - currently it just thinks it's a text file that you can read from and write to. We just need to tell UNIX that you can actually execute this file because it's a program. You do this using the command chmod:
$ chmod +x add.pl or
$ chmod 755 add.pl
Aside: If you're wondering what those numbers after chmod mean, check out this tutorial on chmod.
OK, so now UNIX knows it's an executable file. Let's execute it!
$ ./add.pl 5
Did it do what you expect? Why do we have to put the ./ in front of add.pl? This has to do with something called your PATH variable in UNIX. When you type in a command in UNIX, the system searches all the directories in your PATH to locate this program. But most likely, the directory where you have placed add.pl is not in your PATH, so that's why you're telling UNIX where it is, i.e. ./add.pl, "please execute the add.pl program that is in the . directory aka my current directory".
- Congrats! You now know how to read arguments from the command line! But now how to we make the actual adder? Well, first, from the command line argument, we know how many numbers the user will type in. Since we know how many times we need to ask the user for a number, a good control structure to use is a
for loop:
for ($i=0; $i<$number; $i++) {
print "Please enter a number"; # do you know why we don't put a \n here?
$userNum = <STDIN>;
}
- Great, we can read in the numbers. But now we need to add them up. As Prof. Holmes mentioned in class, in Perl, there are many different ways to do something. Here, for example, we could save each of the number the user types in somewhere (maybe an array?) and then once we're done asking the user for numbers, we can just go through each number in the array and add them up. But, why don't we just keep a cumulative running sum of everything the user types in? So let's try that:
$sum = 0; # why is this necessary?
for ($i=0; $i<$number; $i++) {
print "Please enter a number";
$userNum = <STDIN>;
$sum += $userNum;
}
- Finally, we want to print out our result. So our full program is
$number = $ARGV[0]; # read in the number of numbers to add up from the command line
$sum = 0;
for ($i=0; $i<$number; $i++) {
print "Please enter a number";
$userNum = <STDIN>;
$sum += $userNum; # keep a running sum
}
print "The sum is $sum\n";
Notice the
# comments, which Perl ignores when it's reading through this program. Commenting your code is good programming style and helps to explain what you're doing in the code to other programmers and also to you later on, when you come back to this program a year later and may have forgotten why you did things a certain way.
- Congrats! You've just written an adder! Test it out to make sure it adds numbers like you expect. But... What would happen if someone typed in
$ ./add.pl hello from the command line? What if they don't enter numbers when you ask them to? To find out, you can always test out your programs with "weird" input to see what your program does. Remember that not all users are informed about what your add.pl is supposed to do. Right now, when something unexpected is entered by the user, you should get some complicated looking message (if you used the -w option in your script). That's not very nice - or in computer speak, your program is not handling errors "gracefully". What would be more useful? (We'll talk about this later, when you've thought about it for your first assignment.)
2. A slightly better calculator
Let's try something slightly different. Let's write a very similar program, but instead of making someone type in the numbers one by one, we'll read the numbers from a text file. The name of the text file will be passed in through a command line argument. And since we're the designers of this program, we can be a little annoying and impose the following rules:
- Each number should be on a separate line.
- We'll give a couple more functions to our calculator, specifically add, subtract, multiply, and divide.
- Each line should start with one of
+, -, *, /, followed by whitespace, followed by a number. This will tell our program to perform to perform that operation on the running total and that number (for now, we're not worrying about orders of operation)
- If a line does not start with one of the above symbols, we'll ignore that line
Here we go...
- Make a new file in the text editor and give it a descriptive name, e.g.
calculate.pl
- One of the first things we need to figure out how to do is to read from a file. Prof. Holmes talked about this in class. To refresh your memory, we need to first open a file handle, then use the < and > symbols to read lines from the file, and close the file when we're done reading. Again, in the spirit of starting simple, let's just start with a script that reads line from a file and print them on the screen.
#! /usr/bin/perl -w
open(INPUT, $ARGV[0]);
while($line=<INPUT>) {
print "$line"; # do you know why we don't put a \n here? what happens if you do?
}
or if you prefer to use the default variable
$_
#! /usr/bin/perl -w
open(INPUT, $ARGV[0]);
while(<INPUT>) {
print;
}
- In order to try out this file-reading program, we have to talk about testing. Part of being a programmer is knowing how and when to test out your programs, which includes designing test cases. A good programmer constructs enough test cases to make sure that every part of his/her code works (i.e. making sure s/he tests out every condition in the program and doesn't leave an
else statement untested, test out as many unexpected cases as s/he can think of so even really weird inputs won't crash the program but gives useful error messages, etc). Learning how to test code is a skill gained through time and experience, so we don't expect you to be experts right now. But it's good to keep in mind that a programmer's responsibility includes testing.
So how do we test out our script so far? Obviously, we need to have some test files, since our program reads from a file. I won't walk you through the steps in making a test file, but go ahead and make one now and test out your script so far, e.g. if I named my test file test.txt:
$ ./calculate.pl test.txt
Does your program do what you expect?
- Ok, so we can read from the file....now, how do we actually do calculations with it? Let's think about the steps involved in English (this is a really good habit to get into and especially important for larger programs. Planning out your program in regular English helps you organize your thoughts before you start worrying about the actual syntax of a programming language.)
Every time I read a line from the file, I want to look at the first character to figure out what mathematical operation to perform on the following number. If I don't see a mathematical operator, I ignore the line. If I do see one, I read in the number and then do the appropriate mathematical operation using that number and the current total. Hey, this seems like a great place for some of that pattern matching we talked about in lecture:
#! /usr/bin/perl -w
open(INPUT, $ARGV[0]);
while(<INPUT>) {
if ( /^\+\s([0-9]+)/ ) { # do you know why this regex can match what we want?
print "add $1\n"; # for now, just print out text to check that we're matching stuff right
} elsif ... { # fill in the cases for the other math symbols
} else {
print "ignored line $_";
}
}
- Now that we can read the lines in the file, the only step left is to actually perform the calculations and print out the results. I'll let you guys finish this on your own.
3. Time to do Perl exercise 1 (reverse complement a FASTA file)!
It's time for you to try out writing a script on your own -- specifically, the first Perl exercise for reverse complementing a FASTA file. You should
submit your scripts next Tuesday 9/27 by 5pm, because we'll be going over one way to write this program during next week's lab. For now, here are some hints/things to think about:
- Test files: There are some FASTA files in the
~be131/fasta_files/ directory that you are free to copy/use to test your scripts. I would avoid the files ending in .fna because those contain full genome sequences for organisms, so they tend to be large. The other two .fasta files should be a good starting point.
- More test files: Two test files aren't that many, so you probably want to create your own. The NCBI website is a great place to find sequences, specifically nucleotide sequences in this case. Search for some protein you know of to get a long listing of results, then select a couple sequences from the list (again, avoid the 'whole genome' ones and stick to the 'mRNA' ones so you don't end up trying to process ridiculously huge files). Then on the dropdown boxes near the top, you have the option to show the selected sequences in FASTA format and also to save them in a file.
- Review the specifications for a valid FASTA file and make sure your program can process all valid FASTA files (e.g. a FASTA file can contain 1 or more sequences)
- If a sequence is longer than a line, you cannot just do a line-by-line reverse complement
- Don't get frustrated if your program doesn't work the way you want in the beginning. Even with all the years I've been programming, I rarely write a program that works on the first try!
--
AngiChau - 18 Sep 2005

Copyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback