Matching & Regular Expressions


Perl = Practical Extraction and Report Language


Almost all of bioinformatics is text processing with a little bit of math mixed in. While many other languages have Perl beat in terms of speed in number crunching, Perl is hands-down the clear choice for text processing. This is largely due to Perl's pattern matching and extraction capabilities. Thus most bioinformatics is done is Perl.

Matching


Some Vocabulary


If you trying to match part of a string using a pattern, that pattern is called a regular expression. Regular expression = pattern.

Perl is very fast at finding matches to regular expressions. It does this using what is called its regular expression engine. Engine = mechanism by which Perl finds a match.

We won't really talk about the engine but we do need to know how to write regular expressions properly so the engine can use them to find what you are looking for.

Basic Syntax


The syntax for matching is m/REGEX/ where REGEX is your regular expression.

You're probably thinking that something is missing. What is this regular expression being matched against?

So the complete matching syntax is:

$target_string =~ m/REGEX/;

This still might feel incomplete. You have a target string, you are trying to match your regular expression to it. How do I know if it did in fact match?

You know if a regular expression successfully matched if the whole statement above evaluates to TRUE. Likewise if there isn't a match it evaluates to FALSE. Thus you usually use the matching syntax in the context of control structures like "if" statements and loops.

Let's look at some examples to understand the syntax of the actual regular expression pattern.

#!/usr/bin/perl
use strict;
use warnings;
 
# make a target string
my $target = 'Dan Jaime Lenny';
 
# use an if-else statement to see if something matches the target string
# match Jaime
if ($target =~ m/Jaime/) {
    print "we got a match!\n";
}
else {
    print "no match bud.\n";
}
 
__END__

At first we set up a target string that is just your instructors' names listed out with spaces separating them. We can then use matching syntax inside an "if" statement to see if a regular expression matches the target. In this case the regular expression is "Jaime". Those exact characters do appear in that order in the target so this evaluates to TRUE and the print statement says "we got a match!".

Notice I said "those exact characters in that order". What if the regular expression was "Danny"? Those characters are present in that order.

#!/usr/bin/perl
use strict;
use warnings;
 
# make a target string
my $target = 'Dan Jaime Lenny';
 
# use an if-else statement to see if something matches the target string
# match Danny
if ($target =~ m/Danny/) {
    print "we got a match!\n";
}
else {
    print "no match bud.\n";
}
 
__END__

So that didn't match. Contiguous characters in the regular expression must be contiguous in the target.

What about trying to match "len"?

#!/usr/bin/perl
use strict;
use warnings;
 
# make a target string
my $target = 'Dan Jaime Lenny';
 
# use an if-else statement to see if something matches the target string
# match len
if ($target =~ m/len/) {
    print "we got a match!\n";
}
else {
    print "no match bud.\n";
}
 
__END__

Pattern Modifiers


This also doesn't match. Regular expressions are case-sensitive. This restriction can be lifted using the pattern modifier "//i".

#!/usr/bin/perl
use strict;
use warnings;
 
# make a target string
my $target = 'Dan Jaime Lenny';
 
# use an if-else statement to see if something matches the target string
# match len regardless of case
if ($target =~ m/len/i) {
    print "we got a match!\n";
}
else {
    print "no match bud.\n";
}
 
__END__

This did match, now that we used the "i" pattern modifier.

Storing Patterns


Patterns can actually be stored in a variable and used multiple times.

#!/usr/bin/perl
use strict;
use warnings;
 
# make two target strings
my $target1 = 'Dan Jaime Lenny';
my $target2 = 'Billy Jody Ksenia';
 
# make a query string
# match n
my $query = 'n';
 
# use an if-else statement to see if something matches target1
if ($target1 =~ m/$query/) {
    print "we got a match in target1! $target1\n";
}
else {
    print "no match in target1 bud.\n";
}
 
# use an if-else statement to see if something matches target2
if ($target2 =~ m/$query/) {
    print "we got a match in target2! $target2\n";
}
else {
    print "no match in target2 bud.\n";
}
 
__END__

Now we have two targets - instructors and teaching assistants. We set the variable $query equal to "n" and then check both of the targets to see if they match it. Both target strings contain at least one n so both evaluate to TRUE.

Logic In Patterns


What if we want to be less specific and match one thing or another. You don't use "or" but you use the regular expression syntax for "or", which is "|" (pipe).

#!/usr/bin/perl
use strict;
use warnings;
 
# make two target strings
my $target1 = 'Dan Jaime Lenny';
my $target2 = 'Billy Jody Ksenia';
 
# make a query string
# match Dan or Ksenia
my $query = 'Dan|Ksenia';
 
# use an if-else statement to see if something matches target1
if ($target1 =~ m/$query/) {
    print "we got a match in target1! $target1\n";
}
else {
    print "no match in target1 bud.\n";
}
 
# use an if-else statement to see if something matches target2
if ($target2 =~ m/$query/) {
    print "we got a match in target2! $target2\n";
}
else {
    print "no match in target2 bud.\n";
}
 
__END__

So here, either everything to the left of the pipe has to match or everything to the right of the pipe has to match. You can have as many pipes in the regular expression as you want.

File Line Matching


Matching regular expressions against sequences in a file is a common task in bioinformatics.

Say you have a file called sequences.txt with the following sequences in it:

GGACATCATTTC
GGACATTTC
GGATTC

You could use the following Perl script to search for matches to the motif "ACATC".

#!/usr/bin/perl
use strict;
use warnings;
 
# declare the input file name
my $file = 'sequences.txt';
 
# open filehandle to file
open (my $fh, '<', $file) or die "can't open $file";
 
# loop through file looking for lines that match my regex
while (my $line = <$fh>) {
    chomp($line);
    print $line;
    # match ACATC
    if ($line =~ m/ACATC/) {
        print " matched\n";
    }
    else {
        print " no match\n";
    }
}
 
__END__

The script opens a filehandle to the file or dies if it can't. It then loops through each line in the file, chomps the line, prints it and then using an if statement checks to see if "ACATC" matches it or not. In this case, "ACATC" only matches the first sequence in the file.

Quantifiers & Grouping


Thus far, all of our regular expressions have explicitly written out a string to match. You can write more encoded regular expressions though. For the rest of today's lecture we'll get into some kinds of encoded regular expressions and then tomorrow we'll get into it much deeper.

So what if you wanted to control how many times a pattern is matched? For starts let's consider matching something one or more times. To do this you use the "+" quantifier.

#!/usr/bin/perl
use strict;
use warnings;
 
# declare the input file name
my $file = 'sequences.txt';
 
# open filehandle to file
open (my $fh, '<', $file) or die "can't open $file";
 
# loop through file looking for lines that match my regex
while (my $line = <$fh>) {
    chomp($line);
    print $line;
    # match ACA then T one or more times and then C
    if ($line =~ m/ACAT+C/) {
        print " matched\n";
    }
    else {
        print " no match\n";
    }
}
 
__END__

This script is identical to the above script except now there is a plus sign after the "T" in the regular expression. That means match a "T" one or more times. The result is that now both the first and the second sequence match. Even though the second sequence does not contain "ACATC" is does contain "ACATTTC", which this regular expression can also match.

Another important thing to know is that quantifiers can operate on more than one character at a time. To do this group the characters together using parentheses "()". The next script demonstrates this using another quantifier, the "*" quantifier, which means match zero or more times.

#!/usr/bin/perl
use strict;
use warnings;
 
# declare the input file name
my $file = 'sequences.txt';
 
# open filehandle to file
open (my $fh, '<', $file) or die "can't open $file";
 
# loop through file looking for lines that match my regex
while (my $line = <$fh>) {
    chomp($line);
    print $line;
    # match ACAT then CA zero or more times and then T
    if ($line =~ m/ACAT(CA)*T/) {
        print " matched\n";
    }
    else {
        print " no match\n";
    }
}
 
__END__

So this regular expression matches "ACAT", then "CA" zero or more times, and then "T". This matches both the first and second sequences. The first sequence has the whole motif "ACATCAT" in it so that obviously matches. The second sequence does not have this motif in it but it does have "ACATT". This has the "ACAT" in it. And it has the "T" in it. But no "CA". That's OK because the "*" quantifier indicated that we can match "(CA)", grouped in parentheses, zero or more times.

Here's a script to show a second example of grouping with yet another quantifier, the "?" quantifier, which means match zero times or once.

#!/usr/bin/perl
use strict;
use warnings;
 
# declare the input file name
my $file = 'sequences.txt';
 
# open filehandle to file
open (my $fh, '<', $file) or die "can't open $file";
 
# loop through file looking for lines that match my regex
while (my $line = <$fh>) {
    chomp($line);
    print $line;
    # match GGA then CAT zero or one time and then TTC
    if ($line =~ m/GGA(CAT)?TTC/) {
        print " matched\n";
    }
    else {
        print " no match\n";
    }
}
 
__END__

This regular expression matches "GGA", then "CAT" either zero times or once, and then "TTC". The second and third sequences match this pattern but the first sequences does not. Why? Even though it has the right start and end, it has a repeat of "CAT", so the zero times or once quantifier won't match it.

OK, so we have three flavors of quantifiers but what about matching exactly two times? This can be done using the squiggly brackets.

#!/usr/bin/perl
use strict;
use warnings;
 
# declare the input file name
my $file = 'sequences.txt';
 
# open filehandle to file
open (my $fh, '<', $file) or die "can't open $file";
 
# loop through file looking for lines that match my regex
while (my $line = <$fh>) {
    chomp($line);
    print $line;
    # match GGA then CAT exactly twice and then TTC
    if ($line =~ m/GGA(CAT){2}TTC/) {
        print " matched\n";
    }
    else {
        print " no match\n";
    }
}
 
__END__

This script is identical to the one directly above except that the "?" quantifier has been replaced by "{2}", which means match exactly twice. Only the first sequence has the "CAT" repeat so only it matches.

Character Classes


In addition to quantifiers, another way to encode a regular expression is to replace specific characters with classes of characters. You use the square brackets, "[]", to create a character class and you just list out the characters you want to include in the class, without any form of separation.

#!/usr/bin/perl
use strict;
use warnings;
 
# declare the input file name
my $file = 'sequences.txt';
 
# open filehandle to file
open (my $fh, '<', $file) or die "can't open $file";
 
# loop through file looking for lines that match my regex
while (my $line = <$fh>) {
    chomp($line);
    print $line;
    # match GG then characters from the class comprised
    # of A and T one or more times and then C
    if ($line =~ m/GG[AT]+C/) {
        print " matched\n";
    }
    else {
        print " no match\n";
    }
}
 
__END__

So this regular expression is looking to match "GG", then one or more characters from the class comprised of As and Ts, and then "C". All three sequences contain a match. In the first two sequences, "GGAC" matches, while in the third sequence, the whole thing is the match.

You can put as many characters as you want in a class. Sometimes, its easier to list out which characters you don't want to match, rather than which characters you do want to match. This can be done with the class modifier, "[^]".

#!/usr/bin/perl
use strict;
use warnings;
 
# declare the input file name
my $file = 'sequences.txt';
 
# open filehandle to file
open (my $fh, '<', $file) or die "can't open $file";
 
# loop through file looking for lines that match my regex
while (my $line = <$fh>) {
    chomp($line);
    print $line;
    # match C then characters from the class comprised
    # of anything but C or G exactly twice and then C
    if ($line =~ m/C[^CG]{2}C/) {
        print " matched\n";
    }
    else {
        print " no match\n";
    }
}
 
__END__

This regular expression matches "C", then exactly two characters form the class comprised of all characters besides "C" or "G", and then "C". This only matches the first sequence. Only the first sequence has two Cs spanned by two characters that are neither "C" or "G". The match was "CATC".

Metacharacters & Escaping


One last thing. We've introduced a bunch of characters that help you encode regular expressions. These special characters are called "metacharacters". The full list of metacharacters is:

\ | ( ) [ { ^ $ * + ? .

We haven't talked about the meaning of all of these but you should know that if you put any these in your regular expression, it will not just match these characters.

So how do you match a metacharacter? Escaping. Placing a backslash "\" in front of any metacharacter (including itself) makes it a regular character.

#!/usr/bin/perl
use strict;
use warnings;
 
# make a target string
my $target = 'catch a falling *';
 
# use an if-else statement to see if something matches the target string
# match *
if ($target =~ m/\*/) {
    print "we got a match!\n";
}
else {
    print "no match bud.\n";
}
 
__END__

Review


OK, to review and to introduce a few more bits of syntax, here's a table for fast referencing:

m/A/i = match A case insensitive
m/A|B/ = match A or B
m/(AB)/ = match AB grouped together
m/A+/ = match A 1 or more times
m/A*/ = match A 0 or more times
m/A?/ = match A 0 or 1 time
m/A{COUNT}/ = match A COUNT times
m/A{MIN,}/ = match A MIN or more times
m/A{MIN,MAX}/ = match A at least MIN times but no more than MAX times
m/[AB]/ = match 1 character from class comprised of A and B
m/[^AB]/ = match 1 character from class comprised of all characters except A or B
m/\META/ = match the metacharacter META

One last crazy example to demonstrate how what you've learned can be combined:

# match one character from the class comprised of G and T,
# then 300 As
# then one character from the class comprised of all characters except C
m/([GT]A{300}[^C]){2}/i

Exercises


The following two links will be needed for these exercises:
Amino Acids
Amino Acid Properties

Problem1: Clean Fasta Checker

  • Write a Perl script checks if a fasta file has non-amino-acid characters (see link above for single character amino-acid codes). Have it print a statement indicating the result of the search.
  • Try running your script on the following two files. One of the files is clean and the other has non-amino acid-characters. Did your script figure it out?



Problem2: Transmembrane Protein Scanner

  • Write a Perl script to identify if a peptide fasta file contains the sequence for a transmembrane protein.
  • Make the script ask for a peptide fasta file from the user, scan the peptide sequences for a hydrophobic pass and then print out if the file did or did not contain a hydrophobic pass. A hydrophobic pass is seven hydrophobic characters in a row (see link above for amino-acid properties).



Problem3: Restriction Site Mapper

  • Write a Perl script that finds EcoR1 sites (GAATTC) in nucleotide fasta files.
  • Use the nucleotide fasta file below.
  • Make the first version of the script print out every line that has an EcoRI site.
  • Make the second version of the script print out how many lines contain EcoRI sites.
  • Make the third version of the script print out how many lines contain EcoRI sites or HindIII sites (AAGCTT).


Problem4: Codon Mapper

  • Write a Perl script that finds glutamate codons (GAA and GAG) in nucleotide fasta files.
  • Use the nucleotide fasta file from problem 3.
  • Make the first version of the script print out "Holy glutamate batman!" and the sequence of a line if it contains a glutamate codon.
  • Make the second version of the script count the number of lines containing each of the two different glutamate codons and then print out the total counts for each.

Problem5: Kinase Recognition Site Mapper

  • Write a Perl script that finds kinase recognition sites. The recognition site is an acidic amino acid, follwed by 2 ,3, or 4 threonines, followed by at least 1 hydrophobic residue.
  • Have your script simply test if a file contains a recognition site.
  • Run it one the following peptide fasta files.



Problem6: Repeat Finder

  • Write a Perl script that checks for a specific repeated sequence in a fasta file.
  • The repeat is any two acidic amino acids, followed by two residues with a terminal alcohol on the sidechain (Ser, Thr, or Tyr). The program should print "I found the repeat!" if it finds the sequence 3 or more times in a row.
  • Test your script on the following fasta files.



Bonus: Offending Sequence

  • Rewrite your Clean Fasta Checker from Problem 1 so that rather than reporting if a fasta file contains non-amino-acid characters it prints the fasta header for the sequence that contained the bad character.