Home - this site is powered by TWiki(R)
Teaching > BioE131 > PatternMatchingLab
TWiki webs: Main | TWiki | Sandbox   Log In or Register

Pattern Matching in Perl

By the end of this lab, you should know:

  • how to write some simple regular expressions in Perl
  • how to read/interpret a regular expression you see in Perl
  • how to "capture" information from regular expressions


Before we get started...

  • In the following lab, we will use a $ symbol to signify a UNIX prompt. The prompt you see in your terminal window may/may not end in a $, but just remember that you don't actually type in the $, just the stuff after it.

  • This lab assumes that you are now comfortable with using a text editor to write a Perl script, the proper format of a Perl script, and the UNIX commands you have to issue to make your script executable. In addition, you should already know how to read arguments from the command line, read data from user input (i.e. standard input), and read from files. If you need a refresher/reminders along the way, please refer back to: Perl Basics.

  • The solution to lab 4's hw will appear in next week's lab.

An interactive program

In order to learn about regular expressions in Perl, it's easiest to first write a short interactive program that will allow us to type in text and print out messages informing us whether the text we typed matches patterns in our regular expressions. Then as we learn more and more regular expressions, we can add them into the script and see how the matching process works.

The interactive program should just continually read in lines from standard input (usually the keyboard, although you can use a file by doing input redirection in UNIX too), and for now, just print it back out. So basically, this is just an echo program, i.e. your script should be able to do this:

$ interactive.pl
hello me!
hello me!
are you there?
are you there?

Try writing this script yourself. Remember that to exit out of such a program, you will need to use Ctrl-D to send the End Of File (EOF) signal (or Ctrl-Z in windows).

Matching basics

Now that we have our "testbed" set up, let's try to write some simple regular expressions for matching numbers. Before we get down to the details, let's review a couple of points from lecture first. (By the way, from now on, we will refer to regular expressions as 'regexps'.)

  • Why do we use regexps? : Regexps is one of the things that makes Perl such a powerful language for text processing. In languages that don't support regexps as seamlessly as Perl, you might need to write a huge program just to parse a simple FASTA file.

  • When do we use regexps? : To put it simply, we use them when we want to test if some text matches some pattern. So you would use it when you want to parse text, when you want to do some sort of search-and-replace on chunks of text, etc. It's similar to the 'Find...' function in a lot of word-processing documents, except much much more powerful.

  • But what's a pattern? : Well, that's just some way to specify how things are formatted and it varies with each program you write. But we actually define patterns every day, although not necessarily out loud. For example, when we figure out if a small chunk of text we're looking at is a temperature, how do we do it? Maybe we tell ourselves: "Is it a number followed by a tiny circle followed by either a C or an F?" We have just defined a pattern (in English). Another example: we want to see if a small chunk of text is a valid protein sequence, so we might want to use a pattern like "a string of characters that contain only the 20 amino acid codes". The great thing about Perl is that once you define a pattern to yourself in English, it's a relatively straightforward translation into Perl syntax. So no need to be intimidated by those lines of cryptic symbols you see in Perl - they all translate directly back to English.

  • Pattern binding operator : You've already seen this operator =~ in some of the previous labs/exercises but we didn't really point it out. =~ is similar to other operators in Perl (e.g. || for or, eq for string comparison, etc) except you use it when you want to ask whether the thing on the left hand side of the operator matches the pattern on the right hand side of the operator. Many times, you will see this sort of statement
    if ($someString =~ /somePattern/) {
       do something;
    }
    
    What this means is that if $someString matches somePattern successfully, the stuff inside the parentheses will evaluate to true and you will enter the block containing do something;. If you look back at some of the old labs and exercises, you can see that we used exactly this operator to read a FASTA file and figure out whether the current line starts with a > or not.

    Note that this operator is implied when you use it with the default variable $_. For example,

    if (/somePattern/) {
       do something;
    }
    
    Since you don't have to specifically write out the default variable, you also don't have to write out the =~ operator. Although we want you to recognize it when other people use this syntax, remember that for assignments it's preferred that you avoid the default variable.

    Note also the the negation of =~ is !~. This operator returns true if the string does not match the pattern.

OK, with those points fresh in your mind, let's have a go at building our first regexp for matching different types of numbers, starting off with decimals. How would you define a pattern for a decimal? How about "some digits, followed by a decimal point, followed by more digits"? That seems like a good start. So now to translate that into Perl, we need to remember a couple of technical details;

  • character classes : In a regexp, you can define character classes using square brackets [ ], where you put all the characters you want to group into a class inside the brackets. (Fortunately, Perl is smart and won't ever make you type in all 26 letters of the alphabet if you want a class containing all uppercase letters.) A character class basically says "any character within this class is OK for this part of the pattern". You can also put a caret ^ right after the first bracket if you want to say which characters are not OK for this part of the pattern. Here are some examples of character classes:
    • [A-Z] : a class containing all uppercase letters
    • [A-Za-z0-9] : a class containing all upper- and lowercase letters, plus all the numerical digits
    • [ACTG] : a class containing only 4 letters representing the allowed DNA bases
    • [10] : a class containing only 2 digits, probably used to find binary numbers
    • [^AEIOUaeiou] : a class containing no vowels
    In fact, for some of the more commonly used classes, Perl provides shortcuts for them. For example, if you want to match alphanumeric characters, you can use \w instead of [a-zA-Z0-9] and to match whitespace characters, you can use \s. Pretty handy!

  • quantifiers : When we defined patterns in English, we used a lot of "some", "one", etc. Well, in Perl, these would be translated into quantifiers. Quantifiers apply to the element preceding the quantifier. The more commonly used quantifiers are:
    • * : match the preceding element 0 or more times
    • + : match the preceding element 1 or more times
    • ? : match the preceding element 0 or 1 time
    • {N} : match the preceding element exactly N times
    (If you're still unsure what quantifiers mean, keep reading on to see them being used in the decimal example and it might make more sense)

  • metacharacters : A metacharacter represents a whole class of characters. For example a single dot (.) matches any character except the newline '\n'. Here are a few common metacharacters (note that a metacharacter and its complement differ by capitalization):
    • . : any character except newline
    • ^ : the beginning of a line
    • $ : the end of a line
    • \w : any word character (nonpunctuation, nonwhitespace)
    • \W : any nonword character
    • \s : whitespace (spaces, tabs, newline)
    • \S : nonwhitespace
    • \d : any digit
    • \D : any nondigit

    For example, here's two ways we can match a 9 digit zip code:
      $address =~ /\d\d\d\d\d-\d\d\d\d/;
      or
      $address =~ /\d{5}-\d{4}/;
      

  • escaping special characters : Some non-alphanumeric characters in Perl carry special meanings when they're used in regexps. So, for example, when we want to actually match the decimal point, we need to escape the . by adding a \ in front of it.

  • alternative patterns : To have Perl search for more than one pattern, separate them with a vertical bar (|). For example, we can search for two types of restriction sites in DNA like this:
    $dna =~ /GAATTC|AAGCTT/;
    

The or operator here acts like the logical or we've discussed before. If the first pattern (on the left of the |) is found, perl won't even bother looking for the pattern on the right.

Here's how we can translate our decimal pattern (some digits, decimal point, more digits) into Perl:

/[0-9]+\.[0-9]+/  # How could we use a metacharacter here instead?

Now, try incorporating this into your testbed using an if statement and the pattern binding =~ operator. If the pattern matches, you can just do something simple like print out a message saying you just saw a decimal or something. Does this work when the line entered is only a number? If the number is in a sentence? In the middle of a word?

When you try to test out the pattern above, you may notice that it's actually quite limited in what it considers a decimal. For example, ".2" and "1." won't be recognized as decimals. How can you change the regexp to allow these other decimal formats?

What about other number formats? Try adding regexps to match things like integers, fractions, etc to your testbed...remember, if you don't know where to begin, always start by figuring out a pattern in English and then worry about translating it to Perl as the next step. It's also helpful to have a short cheat sheet like this for regexp syntax. See also Perl regular expression examples.

More regexp practice

Hopefully, by now, you feel more comfortable with writing some simple Perl expressions. As always, the more you practice, the more you'll figure out how to write regexps. And although we can keep showing you how we would write regexps, it doesn't really help until you try to write them yourself. Writing regexps is a bit like solving puzzles, so it's hard to learn unless you actually do it.

In the spirit of this, how would you match

  • DNA/RNA/protein sequences?
  • time? dates?
  • email addresses? URLs?

Anchors

Perl regexps have two location "anchors": ^ and $. You may remember that ^ inside of the square brackets negates a character class. Outside of the square brackets, it actually has another usage. If you add it to the beginning of the pattern, it means "the string must start with ...". For example, /^T/ will only match strings starting with a capital T. Similarly, the $ can be placed at the end of a pattern and matches the end of the string. (With this new syntax, how can you modify the decimal pattern to be more specific? e.g. if you just used the basic pattern above, the lines '0.678aaa' and 'hello6.57' will actually match. Can you stop that from happening?)

Capturing

Being able to match patterns is great but most of the time, we actually want to extract information from the text when a pattern matches. For example, if something matches an email address, maybe we would like to know what the person's username is (i.e. the stuff before the @ symbol).

Let's use a more compbio-esque example to figure out how to retrieve information using regexps. Suppose you are working on some computational biology project that requires you to figure out which organisms have homologous sequences to some gene you discovered. You send the sequence of your gene to your friend, who's an expert at finding homologous sequences but you forgot to mention that all you wanted were the names of the organisms with homologous genes (don't ask why). Your friend sends you the results before going on a month-long vacation and now you're stuck with this huge file containing all the homologous sequences to your gene.

Luckily, you realize that the results are actually in FASTA format and in particular, the lines containing the names of the sequences actually contain the names of the organisms and they all seem to be in a specific format:

    >gi|{id number}|{organism name}

So now you have two choices: one is to sit there and copy down all the names of the organisms in the file (not fun) and the other is to write a Perl script to do it for you. We hope all of you decide to go for the latter.

So the first step is to come up with a pattern to match the format above:

    /^>gi\|[0-9]+\|.+/   # do you understand why each of these symbols are here?
In the above regexp, the part we're interested in, i.e. the organism name, is actually being matched by the .+, so that's the part we want to 'capture'. Capturing is done using round parentheses ( ) inside the regexp:

    /^>gi\|[0-9]+\|(.+)/
What does it mean to capture? It means that Perl will grab the part that matched the .+ and save it to some scalar variable, ready for you to use. These scalar variables are easy to remember because they're called $1, $2, $3, and so on. In other words, the first set of ( ) captures the stuff surrounded by the parentheses into $1, the second set into $2, ...

So to print out the list of organism names, we can just use an if block that looks like this inside of a script that reads the file (go ahead and write on now):

    if ($line =~ /^>gi\|[0-9]+\|(.+)/) {
           print "$1\n";
    }
    

To test out this program, you can use this FASTA file. When you do this, you'll notice that there's still some part of the sequence id left (it starts with "ref"). What we want to match instead is anything that comes after the last vertical bar followed by a space. Try this out:

    if ($line =~ /^>gi\|.*\| (.+)/) {
           print "$1\n";
    }
    

Pattern Modifiers

We saw an example of these in homework 1. There are several modifiers which can be added to a pattern after its ending delimiter (/). The two modifiers we focused on are:

  • /PATTERN/i = make pattern be case insensitive
  • /PATTERN/g = globally find all matches. In list context, it will return a list of all matches found.

Note that we can combine the pattern modifiers, eg: /PATTERN/gi.

Other Pattern Operators

Up to now, when we type something like:

    if ($someString =~ /somePattern/) {
       do something;
    }
    

we are actually implicitly calling the pattern match operator, which is designated by m. That is, the above is equivalent to writing:

    if ($someString =~ m/somePattern/) {
       do something;
    }
    

There are a couple of other pattern operators that we've seen before in the lecture notes and exercises.

Substitution operator

We use this operator when we want to replace text that matches a pattern.

    s/Pattern/ReplacementText/
    

For example, here's how we can get rid of the ">" character in the sequence name:

    $sequenceName =~ s/>//; # The replacement text in this case is the null string.
    

We can use the pattern modifiers here too. Here's an example of replacing in a DNA sequence all occurrences of "ta[acgt]g" (or "Ta[AcGt]g", etc, since the pattern is case insensitive) with the value "stop".

    $dna =~ s/ta.g/stop/gi;
    

Transliteration operator

We use this operator when when want to exchange one character for another.

    tr/SearchList/ReplacementList/
    

We saw this one in action in the reverse complement exercise. Every character of the search list (A, C, T, G) will get replaced by the corresponding character in the replacement list (T, G, A, C):

    $rev =~ tr/ACTG/TGAC/;
    

Reference

There's alot more to regular expressions and pattern matching than we can cover here. To read a thorough description of Perl pattern matching see the chapter in Programming Perl (ISBN:0596000278).

-- AngiChau - 03 Oct 2005

I AttachmentSorted ascending Action Size Date Who Comment
Fastafasta receptors6.fasta manage 13.5 K 2007-10-09 - 01:02 JoshKittleson  
Edit | Attach | Print version | History: r106 < r105 < r104 < r103 < r102 | Backlinks | Raw View | Raw edit | More topic actions


Parents: BioE131
This site is powered by the TWiki collaboration platformCopyright © 2008-2014 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
TWiki Appliance - Powered by TurnKey Linux