We lied


Perl does care about whitespace...in REGEX. Whitespace matters for REGEX expressions.

The Dot

the dot can match any single character.
#!/usr/bin/perl
 
use warnings;
use strict;
 
my $string="Jsf";
if ($string=~m/j.f/i){
    print "Congrats to JF!\n";
}
 
__END__

This is especially powerful when combined with the modifier +, which means 1 or more times. To illustrate the difference, let's look at this script, which won't match...

 #!/usr/bin/perl
 
 use warnings;
 use strict;
 
 my $string = "Jaimesf";
 if ( $string =~ m/j.f/i ) {
     print "you caught me\n";
 }
 else {
     print "I got away!";
 }
 
__END__
but... adding the + symbol, will allow us to catch me.
 #!/usr/bin/perl
 
 use warnings;
 use strict;
 
 my $string = "Jaimesf";
 
 if ( $string =~ m/j.+f/i ) {
    print "you caught me\n";
 }
 else {
     print "I got away!";
 }
 
__END__
There are a few other shortcuts that can make your life easier as well
  • [0-9] matches any digit
  • [a-z] matches any lowercase letter
  • [A-Z] if you can't figure this out, you're in trouble
  • "\t" matches a tab
  • "\s" matches any whitespace
  • "\S" matches any non-whitespace
  • "\d" matches a digit [0123456789]
  • "\D" matches a non-digit [^0123456789]
  • "\w" matches a word character (includes letters, numbers, and underscore)
  • "\W" matches a non-word character
  • "\s+" matches a run of whitespace, very useful!

Anchoring, or FASTA files made easy

Until this point, we have been using a very hacky way of testing to see if we are in a header line. Wouldn't it be nice to use a match statement? I will now introduce the concept of Anchors, which allow you to match specifically the beginning or end of a line.

  • "^" will require a pattern to start at the beginning of the line (don't confuse with a caret inside a character class [^...])
  • "$" will require a pattern to start at the end

It's important to remember the distinction of the circumflex being used in a character class, and at the beginning of a match statement. Let's illustrate this point.
 #!/usr/bin/perl
 
 use warnings;
 use strict;
 
 my $string = "Jaime is my name. I was raised in the third-roughest neighbourhood in Toronto.";
 if ( $string =~ m/neighbour/i ) {
     print "high park, yo!\n";
 }
 else {
     print "you have no street cred\n";
 }
 
__END__
but if I add a circumflex, then it looks for matches at the beginning of the line.
 #!/usr/bin/perl
 
 use warnings;
 use strict;
 
 my $string = "Jaime is my name. I was raised in the third-roughest neighbourhood in Toronto.";
 if ( $string =~ m/^neighbour/i ) {
     print "high park, yo!\n";
 }
 else {
     print "you have no street cred\n";
 }
 
__END__
And suddenly I've lost my street cred. I can match at the Beginning of the line with the circumflex and the letter J.
But if I add a circumflex, then it looks for matches at the beginning of the line.
 #!/usr/bin/perl
 
 use warnings;
 use strict;
 
 my $string = "Jaime is my name. I was raised in the third-roughest neighbourhood in Toronto.";
 if ( $string =~ m/^J/i ) {
     print "high park, yo!\n";
 }
 else {
     print "you have no street cred\n";
 }
 
__END__
But... I will not match if I escape the J as a class
 #!/usr/bin/perl
 
 use warnings;
 use strict;
 
 my $string = "Jaime is my name. I was raised in the third-roughest neighbourhood in Toronto.";
 if ( $string =~ m/^[^J]/i ) {
     print "high park, yo!\n";
 }
 else {
     print "you have no street cred\n";
 }
 
__END__
But it will if I look for a line that starts with a non-number
 #!/usr/bin/perl
 
 use warnings;
 use strict;
 
 my $string = "Jaime is my name. I was raised in the third-roughest neighbourhood in Toronto.";
 if ( $string =~ m/^[^0-9]/i ) {
     print "high park, yo!\n";
 }
 else {
     print "you have no street cred\n";
 }
 
__END__
You can do similar things with the end of a line and the dollar sign (which is separate from the $ usage in variable declaration.
 #!/usr/bin/perl
 
 use warnings;
 use strict;
 
 my $string = "Jaime is my name";
 if ( $string =~ m/me$/i ) {
     print "myself, and I!\n";
 }
 else {
     print "you\n";
 }
 
__END__
This is obviously very powerful and provides a much better route to finding a header than substrings.
 #!/usr/bin/perl
 
 use strict;
 use warnings;
 
 my $file_line = ">the greatest protein ever";
 
 if ($file_line=~m/^>/){
      print "we're in a header line\n";
 }
__END__

CAPTURING MATCHES

You can capture any part of your match that is contained in parentheses "()", using the default variable $1.
#!/usr/bin/perl
 
use warnings;
use strict;
 
my $fasta_line = "ACRTJGYQSFHPF";
 
if ($fasta_line =~ m/([^ACDEFGHIKLMNPQRSTVWY])/i){
    print "you have a bad character in your sequence\n";
    print "the bad character is $1\n";
}
else {
    print "your sequence is good ol' protein\n";
}
__END__
And you can use this syntax to capture several different parts of a match
#!/usr/bin/perl
 
use warnings;
use strict;
 
my $fasta_line = "ACRTJGYQSFHPF";
 
if ($fasta_line =~ m/([^ACDEFGHIKLMNPQRSTVWY])G(Y.+F)/i){
    print "you have a bad character in your sequence\n";
    print "the bad character is $1\n";
    print "the second match is $2\n";
}
else {
    print "your sequence is good ol' protein\n";
}
__END__
But why does perl include everything out to the second F, shouldn't if find the first run of Y to F and then return true... It doesn't because perl is greedy!

GREED

Perl will always try to grab as much as possible for a match. Always. This can be annoying or advantageous. To prevent it use the "?" modifier after the quantifier.
#!/usr/bin/perl
 
use warnings;
use strict;
 
my $fasta_line = "ACRTJGYQSFHPF";
 
if ($fasta_line =~ m/([^ACDEFGHIKLMNPQRSTVWY])G(Y.+F)/i){
    print "you have a bad character in your sequence\n";
    print "the bad character is $1\n";
    print "the second match is $2\n";
}
else {
    print "your sequence is good ol' protein\n";
}
__END__

WHILE GLOBALLY MATCHING

What if you want to pull out all of the instances of your match in a line. To do this, you enter a slightly different kind of matching - global matching - in a while loop, so that it will keep matching until the match no longer returns true.
#!/usr/bin/perl
 
use warnings;
use strict;
 
my $string="lt - what a beautiful set of initials. Lt, our Russian egomaniac. lT,LT - my hero, let us raise our vodka glasses to the sky!";
 
while ($string=~m/(lt)/ig){
   print "Found my favorite initials:'$1'\n";
}
__END__

SUBSTITUTING

Perl is also like Microsoft Word find and replace on steroids. Let's say I'm not happy with all of the praise that Lenny is getting in the last statement and I want to replace it with my initials.
#!/usr/bin/perl
 
use warnings;
use strict;
 
my $string="lt - what a beautiful set of initials. Lt, our Russian egomaniac. lT,LT - my hero, let us raise our vodka glasses to the sky!";
 
$string =~ s/lt/jf/i;
$string =~ s/Russ.+?\s/Canadian /;
$string =~ s/vodka/rye whiskey/;
print $string;
__END__
We can also make these matches global, so that they properly replace lenny's initials with mine.
#!/usr/bin/perl
 
use warnings;
use strict;
 
my $string="lt - what a beautiful set of initials. Lt, our Russian egomaniac. lT,LT - my hero, let us raise our vodka glasses to the sky!";
 
$string =~ s/lt/jf/ig;
$string =~ s/Russ.+?\s/Canadian /;
$string =~ s/vodka/rye whiskey/;
print $string;
__END__
Of course it's perfectly legal to use variables in these statements, but it can be very confusing if you are at all unsure of what the variables might mean.
#!/usr/bin/perl
 
use warnings;
use strict;
 
my $string="lt - what a beautiful set of initials. Lt, our Russian egomaniac. lT,LT - my hero, let us raise our vodka glasses to the sky!";
my $lenny = "lt";
my $jaime = "jf";
 
$string =~ s/$lenny/$jaime/ig;
$string =~ s/Russ.+?\s/Canadian /;
$string =~ s/vodka/rye whiskey/;
print $string;
__END__
And importantly, you can use captured matches in the substitution so that you substitute (and modify the capture)
#!/usr/bin/perl
 
use warnings;
use strict;
 
 my $string="acgwttaczgg";
 
 $string=~ s/([^acgt])/\U$1/g;
 print "$string\n";
__END__

It's very important to test your REGEX with sample input and to comment the heck out of it!


Exercises

Problem1: peptide_checker.pl

  • Rewrite the script for problem 1 of 3.1 so that you print out the offending character.
  • Modify the script so that you print out all possible offending characters (think of two or more in a line)
  • Modify the script so that if offending characters are adjacent "acggQQta" you will print out both in one step "QQ"
  • Modify the script so that you print out at most 3 flanking amino acids around the offending character, in addition to the character itself.
  • For example, with the input "MEETTHEZKING", it should output "THEZKIN"
  • Modify the program to print out the flanking 3 amino acids in lower case, print the character itself in upper case, and put a space between the flanks and the illegal character.
  • For "MEETTHEZKING" the output should be "the Z kin" - CORRECTED!


Problem2: EcoRI finder

  • Rewrite problem 3 of 3.1 so that you find all occurrences of the EcoRI site.
  • Count the total number of occurrences, in each sequence of
  • Output the sequence header, followed by total number of EcoRI sites in that sequence.


Problem3: all possible numbers

  • Rewrite "string or number" exercise from 2.2 so that it detects all possible integers or numbers with decimals.
  • Use regular expressions - not loops to cycle through the string.
  • Test your program on this file:
  • The only strings in that file are "Lenny", "Dan", and "James".

Problem4:reverse complement

  • Can you write a reverse complement program Problem 6 from 2.2 (AAGTC => GACTT) with substitutions?
  • Why not?
  • (You sort of can, but in very convoluted, ugly ways.)
  • If you want to implement reverse-complement the way a perl bioinformatician would, lookup tr
(perldoc -f tr)

Problem5:reformat_fasta_to_codon.pl

  • Rewrite fasta_to_codon.pl from 3.0 to use substitutions.
Go back and finish the matching exercises to master REGEX.

SOLUTIONS