click on the Biowiki logo to go to homepage
Edit Raw Print
Links Diffs RSS
About Stats Recent


Research Teaching Blog
Fall09 | Sandbox
Biowiki > Teaching > LabRegExpF05

Search

Advanced search...

Topics

PageRank Checker

[Back to UndergraduateClass]

Lab 5: Regular Expressions in Perl (10/5/2005)

By the end of this lab, you should know:

  • how to write some simple regular expressions in Perl
  • how to read/interpret a regular expression you see in Perl
  • how to "capture" information from regular expressions


0. Before we get started...

  • In the following lab, I will use a $ symbol to signify a UNIX prompt. The prompt you see in your terminal window may/may not end in a $, but just remember that you don't actually type in the $, just the stuff after it.

  • This lab assumes that you are now comfortable with using a text editor to write a Perl script, the proper format of a Perl script, and the UNIX commands you have to issue to make your script executable. In addition, you should already know how to read arguments from the command line, read data from user input (i.e. standard input), and read from files. If you need a refresher/reminders along the way, please refer back to Lab 3: Perl Basics.

1. An interactive program

In order to learn about regular expressions in Perl, it's easiest to first write a short interactive program that will allow us to type in text and print out messages informing us whether the text we typed matches patterns in our regular expressions. Then as we learn more and more regular expressions, we can add them into the script and see how the matching process works.

The interactive program should just continually read in lines from standard input (usually the keyboard, although you can use a file by doing input redirection in UNIX too), and for now, just print it back out. So basically, this is just an echo program, i.e. I want to be able to

$ ./interactive.pl
hello me!
hello me!
are you there?
are you there?

Try writing this script yourself and if you're running into problems, let me know. (Remember that to exit out of such a program, you will need to use Ctrl-D.)

1. Matching basics

Now that we have our "testbed" set up, let's try to write some simple regular expressions for matching numbers. Before we get down to the details, let's review a couple of points from lecture first. (By the way, from now on, I will refer to regular expressions as 'regexps')

  • Why do we use regexps? : Regexps is one of the things that makes Perl such a powerful language for text processing. In languages that don't support regexps as seamlessly as Perl, you might need to write a huge program just to parse a simple FASTA file.

  • When do we use regexps? : To put it simply, we use them when we want to test if some text matches some pattern. So you would use it when you want to parse text, when you want to do some sort of search-and-replace on chunks of text, etc. It's similar to the 'Find...' function in a lot of word-processing documents, except much much more powerful.

  • But what's a pattern? : Well, that's just some way to specify how things are formatted and it varies with each program you write. But we actually define patterns every day, although not necessarily out loud. For example, when I want to figure out if a small chunk of text I'm looking at is a temperature, how would I do it? Maybe I would ask myself "is it some numbers followed by a tiny circle followed by either a C or an F?" That's it! I just defined a pattern (in English). Another example: I want to see if a small chunk of text is a valid protein sequence, so I might want to use a pattern like "a string of characters that contain only the 20 amino acid codes". The great thing about Perl is that once you define a pattern to yourself in English, it's a relatively straightforward translation into Perl syntax. So no need to be intimidated by those lines of cryptic symbols you see in Perl - they all translate directly back to English.

  • Pattern binding operator : You've already seen this operator =~ in some of the previous labs/exercises but we didn't really point it out. =~ is similar to other operators in Perl (e.g. || for or, eq for string comparison, etc) except you use it when you want to ask whether the thing on the left hand side of the operator matches the pattern on the right hand side of the operator. Many times, you will see this sort of statement
    if ($someString =~ /somePattern/) {
       do something;
    }
    
    What this means is that if $someString matches somePattern successfully, the stuff inside the parentheses will evaluate to true and you will enter the block containing do something;. If you look back at some of the old labs and exercises, you can see that we used exactly this operator to read a FASTA file and figure out whether the current line starts with a > or not.

    One additional note about this operator is that it is implied when you use it with the default variable $_. For example,

    if (/somePattern/) {
       do something;
    }
    
    Since you don't have to specifically write out the default variable, you also don't have to write out the =~ operator. If you find this confusing or just aesthetically unpleasing (I do), just remember that you never have to use the default variable if you don't want to.

OK, with those points fresh in your mind, let's have a go at building our first regexp for matching different types of numbers, starting off with decimals. How would you define a pattern for a decimal? How about "some digits, followed by a decimal point, followed by more digits"? That seems like a good start. So now to translate that into Perl, we need to remember a couple of technical details;

  • character classes : In a regexp, you can define character classes using square brackets [ ], where you put all the characters you want to group into a class inside the brackets. (Fortunately, Perl is smart and won't ever make you type in all 26 letters of the alphabet if you want a class containing all uppercase letters.) A character class basically says "any character within this class is OK for this part of the pattern". You can also put a carrot ^ right after the first bracket if you want to say which characters are not OK for this part of the pattern. Here are some examples of character classes:
    • [A-Z] : a class containing all uppercase letters
    • [A-Za-z0-9] : a class containing all upper- and lowercase letters, plus all the numerical digits
    • [ACTG] : a class containing only 4 letters representing the allowed DNA bases
    • [10] : a class containing only 2 digits, probably used to find binary numbers
    • [^AEIOUaeiou] : a class containing only the consonants of the alphabet
    In fact, for some of the more commonly used classes, Perl provides shortcuts for them. For example, if you want to match alphanumeric characters, you can use \w instead of [a-zA-Z0-9] and to match whitespace characters, you can use \s. Pretty handy!

  • quantifiers : When we defined patterns in English, we used a lot of "some", "one", etc. Well, in Perl, these would be translated into quantifiers. Quantifiers apply to the element preceding the quantifier. The three more commonly used quantifiers are:
    • * : match the preceding element 0 or more times
    • + : match the preceding element 1 or more times
    • ? : match the preceding element 0 or 1 time
    (If you're still unsure what quantifiers mean, keep reading on to see them being used in the decimal example and it might make more sense)

  • escaping special characters : Some non-alphanumeric characters in Perl carry special meanings when they're used in regexps. For example, the . actually means "any character is ok, except for a newline". So in our case, when we want to actually match the decimal point, we need to escape the . by adding a \ in front of it.

Those are the only syntax points we need to remember to translate our decimal pattern (some digits, decimal point, more digits) into Perl:

/[0-9]+\.[0-9]+/

Now, try incorporating this into your testbed using an if statement and the pattern binding =~ operator. If the pattern matches, you can just do something simple like print out a message saying you just saw a decimal or something.

When you try to test out the pattern above, you may notice that it's actually quite limited in what it considers a decimal. For example, ".2" and "1." won't be recognized as decimals. How can you change the regexp to allow these other decimal formats?

What about other number formats? Try adding regexps to match things like integers, fractions, etc to your testbed...remember, if you don't know where to begin, always start by figuring out a pattern in English and then worry about translating it to Perl as the next step. It's also helpful to have a short cheat sheet like this for regexp syntax (I still use one when I'm writing regexps).

2. More regexp practice

Hopefully, by now, you feel more comfortable with writing some simple Perl expressions. You're probably tired of hearing me say this, but again, the more you practice, the more you'll figure out how to write regexps. And although I can keep showing you how I would write regexps, it doesn't really help until you try to write them yourself. I think writing regexps is a bit like solving puzzles, so it's hard to learn unless you actually do it.

In the spirit of this, how would you match

  • DNA/RNA/protein sequences?
  • time? dates?
  • email addresses? URLs?

One more important syntax note: Perl regexps have two location "anchors": ^ and $. You may remember that ^ inside of the square brackets negates a character class. Outside of the square brackets, it actually has another usage. If you add it to the beginning of the pattern, it means "the string must start with ...". For example, /^T/ will only match strings starting with a capital T. Similarly, the $ matches the end of the string. (With this new syntax, how can you modify the decimal pattern to be more specific? e.g. if you just used the basic pattern above, the lines '0.678aaa' and 'hello6.57' will actually match. Can you stop that from happening?)

There are a lot more syntax rules for regexps that you might find useful, but we won't be going over most of them. If you're curious or are looking for more ways to make up patterns, I would suggest looking at the 'Learning Perl' textbook and/or some of the online Perl references.

3. Capturing

Being able to match patterns is great but most of the time, we actually want to extract information from the text when a pattern matches. For example, if something matches an email address, maybe we would like to know what the person's username is (i.e. the stuff before the @ symbol).

Let's use a more compbio-esque example to figure out how to retrieve information using regexps. Suppose you are working on some computational biology project that requires you to figure out which organisms have homologous sequences to some gene you discovered. You send the sequence of your gene to your friend, who's an expert at finding homologous sequences but you forgot to mention that all you wanted were the names of the organisms with homologous genes (don't ask why). Your friend sends you the results before going on a month-long vacation and now you're stuck with this huge file containing all the homologous sequences to your gene.

Luckily, you realize that the results are actually in FASTA format and in particular, the lines containing the names of the sequences actually contain the names of the organisms and they all seem to be in a specific format:

    >gi|{id number}|{organism name}

So now you have two choices: one is to sit there and copy down all the names of the organisms in the file (not fun) and the other is to write a Perl script to do it for you. I hope all of you decide to go for the latter.

So the first step is to come up with a pattern to match the format above. Mine looks like this:

    /^>gi\|[0-9]+\|.+/   # do you understand why each of these symbols are here?
In the above regexp, the part we're interested in, i.e. the organism name, is actually being matched by the .+, so that's the part we want to 'capture'. Capturing is done using round parentheses ( ) inside the regexp:

    /^>gi\|[0-9]+\|(.+)/
What does it mean to capture? It means that Perl will grab the part that matched the .+ and save it to some scalar variable, ready for you to use. These scalar variables are easy to remember because they're called $1, $2, $3, and so on. In other words, the first set of ( ) captures the stuff surrounded by the parentheses into $1, the second set into $2, ...

So to print out the list of organism names, we can just use an if block that looks like this inside of a script that reads the file (which you should know how to write by now):

    if (/^>gi\|[0-9]+\|(.+)/) {
           print "$1\n";
    }
    

(To test out this program, you can use the FASTA file ~be131/fasta_files/homologous.fasta.)

-- AngiChau - 03 Oct 2005

Actions: Edit | Attach | New | Ref-By | Printable view | Raw view | Normal view | See diffs | Help | More...