Strings & Looping Through User Input and Files


Larry Wall's Three Virtues of Programming: Laziness, Impatience & Hubris


Jaime introduced hubris through the organization of your code into pseudo-code comments ("Hey, nice code!") & laziness through reuse of words or numbers in variables ("Anything to avoid typing Hello! again.").

Today we will delve further into laziness and also touch on impatience in the exercises.

Strings


What is a string?


A shoe string is a one-dimensional (to a first approximation) piece of fibrous material.

A string in Perl is a one-dimensional series of characters.

#!/usr/bin/perl
use strict;
use warnings;
 
#Print out to the screen a greeting
print "Hello, intro-to-programming for bioinformatics";
 
#End the program and return to the prompt
__END__

Everything between the double-quotes is a string, including the " " (space) character.

Jaime demonstrated that strings can be stored in variables:

#!/usr/bin/perl
use strict;
use warnings;
 
my $greeting = "Hello"; #this is a variable!
 
 
#Print out to the screen a greeting
print "$greeting, intro-to-programming for bioinformatics\n";
print "$greeting, especially to Lenny!\n";
 
#End the program and return to the prompt
__END__

The variable $greeting got the value of everything between the double-quotes. So we say that this variable is a string.

To contrast, Jaime also showed you:

#!/usr/bin/perl
use strict;
use warnings;
 
my $first_number = 12;
my $second_number = "3";
 
my $sum = $first_number + $second_number;
 
print $sum, "\n";
 
__END__

The variable $first_number is holding the number 12. The variable $second_number is holding the string "3". So there are two kinds of variables, strings and numbers. One thing that is confusing is that every variable has both a numerical value and a string value. Depending on how you use it, it will be one or the other. In the above example, Jaime added $first_number to $second_number so Perl evaluated both as numbers, not strings. More on this later.

What can we do with strings?


You can't add strings together in the math sense but you can combine or concatenate them together:

#!/usr/bin/perl
use strict;
use warnings;
 
my $genus = 'Drosophila';
my $species = 'melanogaster';
 
my $latin_binomial = "$genus $species";
 
print $latin_binomial, "\n";
 
__END__

Simply combining string variables in double-quoted text works.

(Quick note about single-quotes vs double-quotes. Single-quotes are literally interpreted as text while double-quotes allow variable and special character interpolation. So use single-quotes except when you want to include a variable or special character, in which case use double-quotes.)

#!/usr/bin/perl
use strict;
use warnings;
 
my $genus = 'Drosophila';
my $species = 'melanogaster';
 
my $latin_binomial = $genus . $species;
 
print $latin_binomial, "\n";
 
__END__

The concatenation operator, "." (dot or period), can also be used.

#!/usr/bin/perl
use strict;
use warnings;
 
my $genus = 'Drosophila';
my $species = 'melanogaster';
 
my $latin_binomial = $genus . ' ' . $species;
 
print $latin_binomial, "\n";
 
__END__

But directly concatenating variables does not include a space between them.

You can get the length of a string using the "length" function.

#!/usr/bin/perl
use strict;
use warnings;
 
my $RNA1 = 'AAAUGACGUCAUUU';
my $RNA2 = 'ACCCUUGUAAUGUUCCCA';
 
print "RNA1 $RNA1 has a length of ", length($RNA1), "\n",
    "RNA2 $RNA2 has a length of ", length($RNA2), "\n";
 
__END__

Note that "length" returns the number of characters in a string, regardless of their value. So if a string contains " " (blank spaces) or "\n" newline characters, those get counted in the length.

You can reverse the order of the characters in a string using the "reverse" function.

#!/usr/bin/perl
use strict;
use warnings;
 
my $RNA1 = 'AAAUGACGUCAUUU';
my $RNA2 = 'ACCCUUGUAAUGUUCCCA';
 
my $revRNA1 = reverse($RNA1);
my $revRNA2 = reverse($RNA2);
 
print "RNA1 $RNA1 when reversed becomes $revRNA1\n",
    "RNA2 $RNA2 when reversed becomes $revRNA2\n";
 
__END__

A more common task in biology is to reverse complement a sequence. We've got the reverse half of it here. In a later lecture we'll cover how to do the complementation.

You can change the case of the characters in a string using the "uc" (uppercase) and "lc" (lowercase) functions.

#!/usr/bin/perl
use strict;
use warnings;
 
my $RNA1 = 'aaaugacgucauuu';
my $RNA2 = 'ACCCUUGUAAUGUUCCCA';
my $concatRNA = $RNA1 . $RNA2;
 
print "RNA1 $RNA1 is lowercase but after using uc it becomes ", uc($RNA1), "\n",
    "RNA2 $RNA2 is uppercase but after using lc it becomes ", lc($RNA2), "\n",
    "concatRNA $concatRNA is mixed case but after using uc it becomes ", uc($concatRNA), "\n";
 
__END__

You can extract a subset of the characters in a string using the "substr" (substring) function.

Unlike the above functions, "substr" takes more than one argument. Functions that take more than one argument often have a minimum number of required arguments and then some optional arguments. "substr" requires two arguments but can take as many as four:

substr($string, $start)
substr($string, $start, $length)
substr($string, $start, $length, $replacement)

Let's see how each of these works.

#!/usr/bin/perl
use strict;
use warnings;
 
my $genus = 'Drosophila';
 
my $hacked_up_genus = substr($genus, 2);
 
print "The genus is $genus but the hacked up genus is $hacked_up_genus\n";
 
__END__

When two arguments are passed to "substr" it starts at the specified start position and returns the rest of the string. In this case "osophila". Note that start is zero-based, not one-based, so a start value of 2 began on the third character.

#!/usr/bin/perl
use strict;
use warnings;
 
my $genus = 'Drosophila';
 
my $hacked_up_genus = substr($genus, -2);
 
print "The genus is $genus but the hacked up genus is $hacked_up_genus\n";
 
__END__

Here the start value is negative and it returned just the last two characters, "la". So a negative start value counts back from the end of the string and then returns everything after it.

Let's try the three argument case:

#!/usr/bin/perl
use strict;
use warnings;
 
my $genus = 'Drosophila';
 
my $hacked_up_genus = substr($genus, 2, 6);
 
print "The genus is $genus but the hacked up genus is $hacked_up_genus\n";
 
__END__

Like the two argument case is started with the third character but instead of returning the rest of the string it returned just the next 6 characters. In this case "osophi". So the third argument specifies the length of the substring being returned.

How about the four argument case?

#!/usr/bin/perl
use strict;
use warnings;
 
my $genus = 'Drosophila';
 
my $hacked_up_genus = substr($genus, 2, 6, 'acu');
 
print "The original genus was altered to be $genus but the hacked up genus is $hacked_up_genus\n";
 
__END__

In terms of what "substr" returned that was captured in the $hacked_up_genus variable, this behaved much as the three argument case did. As before, it returned "osophi". That said, something strange happened to the original value for the $genus variable. "Drosophila" turned into "Dracula". The characters that "substr" returned, "osophi", were replaced with the string in the forth argument, "acu", creating "Dracula".

Ok, that's a good start for how to work with strings in Perl.

Loops


Let's say you wanted to print the integers 1 to 12 using a Perl script. You might write a script like the following:

#!/usr/bin/perl
use strict;
use warnings;
 
my $num = 1;
 
print $num, "\n";
$num = $num + 1;
print $num, "\n";
$num = $num + 1;
print $num, "\n";
$num = $num + 1;
print $num, "\n";
$num = $num + 1;
print $num, "\n";
$num = $num + 1;
print $num, "\n";
$num = $num + 1;
print $num, "\n";
$num = $num + 1;
print $num, "\n";
$num = $num + 1;
print $num, "\n";
$num = $num + 1;
print $num, "\n";
$num = $num + 1;
print $num, "\n";
$num = $num + 1;
print $num, "\n";
 
__END__

I was a bit lazy/clever and I used a variable that I incremented so that I didn't have to type out all these lines, I just typed the first two and then cut and pasted the rest. That said, I was not being nearly as lazy as I should have been.

"for" loops.


"for" takes three arguments and then a block of code to do each pass through the loop.

# for (initialize counter; satisfy this condition to continue looping; increment/decrement counter) {
#    this code gets run each pass
#}

While pseudocode is helpful, let's look at a real example:

#!/usr/bin/perl
use strict;
use warnings;
 
for (my $counter = 1; $counter <= 12; $counter++) {
    print $counter, "\n";
}
 
__END__

Obviously this was much easier to type than my previous script to print 1 through 12. The first argument initialized the counter variable with the value 1. The second argument is a condition that must be met in order for the loop to continue. In this case, the counter needs to be less than or equal to 12. The third argument increments or decrements the counter at the end of each pass through the loop. In this case its incrementing. I used $counter++ which is shorthand for $counter = $counter + 1. So it adds 1 to counter each pass. Finally I have a block of code enclosed in the squiggly-brackets "{}". The block of code printed the value of the counter.

"while" loops.


"while" is a simplified version of "for". it takes only one argument, a condition that must be met for the loop to continue and then a block of code to do each pass through the loop.

# while (satisfy this condition to continue looping) {this code gets run each pass}
 

"while" loops come in handy when you don't know how many times you need to pass through the loop.

For example, say you wanted to take the product of a series of integers, starting at 1, and you wanted to know far into the series you could go before the product became larger than a billion.

#!/usr/bin/perl
use strict;
use warnings;
 
my $counter = 1;
my $product = 1;
 
while ($product <= 1000000000) {
    $counter++;
    $product = $product * $counter;
    print "$counter $product\n";
}
 
__END__

Truth be told, you could have done that with a "for" loop, but it was easier to do with a "while" loop.

There are cases where you really can't do it with a for loop though. For instance, if you are taking user input.

Jaime went over this yesterday. You can get user input using the <STDIN> filehandle. It causes the program to stop running until a user enters something and hits return. When placed in the context of a "while" loop, you can collect an unlimited ammount of user input.

The following script collects user entered numbers and then calculates the sum of these entries.

#!/usr/bin/perl
use strict;
use warnings;
 
my $sum = 0;
 
print 'please enter some numbers separated by hitting return.  ',
    'when you are done entering numbers type control-d.', "\n";
 
while (my $entry = <STDIN>) {
    $sum = $sum + $entry;
}
 
print "the sum of the numbers you entered is $sum\n";
 
__END__

So the condition in the "while" loop that needed to be satified in order for the loop to continue was the user entering something. The catch is that <STDIN> freezes the program until something is entered so the loop can't be broken without over-riding it using "control-d". This is slightly ugly programming but it works.

While programs that take user input are fun, typically in computational biology, you'll be reading and writing files more often than taking user input.

You can open a file using the "open" function. It doesn't actually open a file so much as it opens a filehandle, like the <STDIN> filehandle. It's easier to demonstrate than explain:

#!/usr/bin/perl
use strict;
use warnings;
 
# get a file from the user
print "enter the path to a file: ";
my $file = <STDIN>;
 
# open a filehandle to the file, read through it and print out the number of characters on each line
open (my $fh, $file); # this line could have been open (FH, $file);
while (my $line = <$fh>) {
    chomp($line);
    print 'there are ', length($line), ' characters on this line: ', $line, "\n";
}
close ($fh);
 
__END__

So the script asked for an input file using the <STDIN> filehandle. Then it used the "open" function to create a filehandle for the file. Open takes two arguments. The first is the variable that will hold the filehandle. In this case $fh (Filehandles are sort of unique in that they can also be stored in uppercase barewords, like "FH"). The second argument is the path to the file. Once the filehandle is created, we can use a while loop to read through the file, line-by-line. Setting $line equal to <$fh> grabs one line at a time. The while loop will continue until the end of the file is reached. Within the block of code, "chomp" (as introduced yesterday by Jaime) takes the newline character off the end of the line. Then a print statement prints the length of the line as well as the line itself. Finally, its good practice to close the filehandle when you are done with it usig the "close" function.

Now you are ready to manipulate strings, use loops to be lazy and read in user input and files.

Exercises


Problem1

  • Write a script that asks the user for the four-digit year they graduated from highschool (e.g. "1994") and then separately ask them for the name of their highschool (e.g. "South High").
  • Make the script print out the highschool name followed by "class of" followed by the last two digits in the year (e.g. "South High class of 94").
  • Make a copy of your first script and make the following change to the output. The FBI often releases documents where parts of words are crossed out. Make the script print the highschool and four-digit year graduated, replacing everything but the first and last characters of those strings with Xs (e.g. "SXXXXXXXXh class of 1XX4"). If you have no idea how to do this come back to this part after completing the rest of the problems.

Problem2

  • Write a script to get numbers from the user (with no predefined limit)
  • Make the script calculate the number of values, the sum and the mean for the numbers entered and print all of these statistics out at the end of the script.

Problem3

  • Ask user for two numbers, small and large
  • Starting with the small number, increasing by one up to the large number, calculate and print each number to the third power (so if they entered 2 and 4 then you should print out 8, 27, 64; also "**" is the exponent operator if you want to use it).
  • What will happen if the user's input is accidentally such that small>large?

Problem4

  • Get a DNA sequence input from the user
  • Make a palindrome from the input; if input is "CATG", output should be "CATGGTAC"
  • Write one version of this program that uses a loop
  • Write another version that does not use a loop

Problem5

  • Write a program that asks the user to enter the number of seconds until SpaceShipOne takes off.
  • Make the program countdown from the specified number of seconds to zero using a "for" loop (e.g. 30 seconds ... 29 seconds ... etc).
  • Can you come up with two ways to mess up the "for" loop conditions to get an infinite loop? (You know how to break the loop once it gets stuck. FYI: You can also completely stop a script while its running using control-c.)
  • Infinite loops are common programming bugs. Obviously you want to avoid them but if you happen to get into one, now you know what they look like and how to deal with them.

Problem6

  • The cerevisiae genome fasta file that you made yesterday is already on the CD here: /usr/local/data/nt/
  • Use "gunzip" to uncompress the file.
  • Calculate the size of the cerevisiae genome without the help of the wc command (i.e. write a script but feel free to use other unix commands besides "wc").
  • Write a second version of this script so that you use the length function only once
  • Why is the rewrite a bad way to answer this particular question?
  • Find out how long it takes to run each version of the program on the same input using the time command. Can you feel the impatience flowing through you? Remember your virtues of programming. Be as impatient as possible.
time perl your_sizing_script.pl
 

Bonus


  • Modify the script for Problem 5 to do the countdown with actual 1-second delays.
(Use google to figure out how to pause the loop execution in a timed manner)
  • Now do a 2-second countdown, printing 10,8,6...