Lyrics

Arrays: Pons Asinorum


After this lecture, there is no turning back - you are programmers.

We can do a lot of useful things already, but we have a major handicap - we need to know all the variables we will use in a program ahead of time. How can we read a file, storing each line in a variable? How can we sort the contents of a file?

What's an array?


An array is basically a freezer rack. Just as a rack is an ordered collection of boxes, an array is an ordered collection of scalar variables. The one subtle difference is that an array automatically expands to accommodate the number of elements that you want to put into it.

Declaring and Assigning to an array


If you prefix the array variable name with "$" instead of "@", you can refer to the variables in individual slots as "$array_name[slot_number]". As soon as you refer to the individual element, you are working with a regular scalar variable.

my @first_array = ();
 
$first_array[0] = "Lenny";
 
$first_array[1] = 5669;
 


Just as in "substr", arrays are 0-indexed and the first element is number 0.

Instead of assigning one-by-one, can create an array at once:

my @array = ("Daphne", 5669, "Josephine");

Accessing elements of an array


  • Since $array_name[slot_number] is just a scalar variable, can use it normally in your code.

my @array = ("Daphne", 5669, "Josephine");
 
print "1st element is: $array[0]", "\n";
print "2nd element is: $array[1]", "\n";
print "3rd element is: $array[2]", "\n";

  • The index can be a variable or expression.

my @array = ("Daphne", 5669, "Josephine");
 
my $number=1;
 
print "1st element is: ", $array[0], "\n";
print "2nd element is: ", $array[$number] , "\n";
print "3rd element is: ", $array[$number+1], "\n";

  • Accessing empty slots, just as in your freezer rack, is doable but there'll be no box there.

If you try to use "$array[55]" in an array with 3 elements, Perl will warn you that you are using an uninitialized value.


  • Iterating through an array

With loops, it's easy to iterate through all the elements of an array.

my @array = ("Daphne", "Josephine", "Sugar Kane", "Osgood" );
 
for (my $counter=0; $counter <= 4; $counter++){
    print "Element $counter is: " , $array[$counter],"\n";
}

But what if you don't know how many elements are in the array ahead of time?

scalar @array tells you the size of the array.

my @array = ("Daphne", "Josephine", "Sugar Kane", "Osgood" );
 
print 'The @array has a total of: ', scalar @array, " elements\n";

Now we can iterate through arrays without knowing the size ahead of time.

my @array = ("Daphne", "Josephine", "Sugar Kane", "Osgood" );
 
for (my $counter=0; $counter < scalar @array; $counter++){
    print "Element $counter is: " , $array[$counter],"\n";
}

More manipulations of arrays


  • push will add an element to the end of the array.

my @array = ("Daphne", "Josephine", "Sugar Kane", "Osgood" );
 
push @array, "Jerry";
 
for (my $counter=0; $counter < scalar @array; $counter++){
    print "Element $counter is: " , $array[$counter],"\n";
}
 

The above will add "Jerry" as the fourth element of the array.

  • pop will do the opposite of push, removing the last element.

my @array = ("Daphne", "Josephine", "Sugar Kane", "Osgood" );
 
pop @array;
 
for (my $counter=0; $counter < scalar @array; $counter++){
    print "Element $counter is: " , $array[$counter],"\n";
}
 

But "pop" doesn't just remove the last element - it also returns it.

  • join allows you to turn an array into a string with desired separators.
join EXPR,@ARRAY

my @array = ("Daphne", "Josephine", "Sugar Kane", "Osgood" );
 
my $array_string = join (" ", @array);
 
print $array_string, "\n";
 
print join (",", @array) , "\n";
print join ("\t", @array),  "\n";
print join ("\n", @array), "\n";

  • Copying an array is no different from copying a scalar

my @array = ("Daphne", "Josephine", "Sugar Kane", "Osgood" );
 
my @array_copy = ();
 
print  join(",", @array_copy) , "\n";
 
@array_copy = @array;
 
print join(",", @array_copy), "\n";

  • sort will order all the elements of an array


Split


This does the exact opposite of "join"; "split" will break up a string and put the individual strings into an array.

split /PATTERN/,STRING

my $names = "Daphne,Josephine,Sugar Kane,Osgood" ;
 
my @name_array = split (",", $names);
 
for (my $counter=0; $counter < scalar @name_array; $counter++){
    print "Element $counter is: " , $name_array[$counter],"\n";
}

While life without "join" would be just peachy, I would have to spend several extra years in graduate school without "split". That's because "split" accepts patterns.


my $names = "Daphne    Josephine    Sugar Kane    Osgood";
 
my @name_array = split (/\t/, $names);
 
for (my $counter=0; $counter < scalar @name_array; $counter++){
    print "Element $counter is: " , $name_array[$counter],"\n";
}

REAL patterns:

my $names = "Daphne, Josephine, Sugar Kane, Osgood";
 
my @name_array = split (/p.*?e/, $names);
 
for (my $counter=0; $counter < scalar @name_array; $counter++){
    print "Element $counter is: " , $name_array[$counter],"\n";
}

@ARGV


@ARGV is a special array variable in Perl for reading user input into your program.

When you run a Perl program like this:
$> perl myscript.pl argument1 argument2 argument3

@ARGV get's the values ("argument1", "argument2", "argument3") and brings them inside myscript.pl

This is very useful, because you can now pass information to the program on the command line when you run it, instead of prompting the user for this information or piping it into the script.

print "User passed in ", scalar @ARGV, " parameters\n";
 
my $arg1 = $ARGV[0];
 
my $arg2 = $ARGV[1];
 
print "argument1: ", $arg1, "\n";
print "argument2: ", $arg2, "\n";

As you remember from Jaime's lecture, the user is out to get you, so whenever you expect arguments, check that you got them.

my $filename = $ARGV[0];
my $n_lines  = $ARGV[1];
 
unless ($filename and $n_lines){
    die "Usage: perl perlhead.pl filename numlines\n";
}
my $lines_read = 0;
 
open(my $fh,"<", $filename ) ;
 
# if you used a 'for' loop and there were fewer than n_lines in the file, you'd crash
while (my $new_line = <$fh>) {
 
    if ( $lines_read < $n_lines ) {
        $lines_read++;    #Update $lines_read
        print $new_line;
    }
    else {
        exit();
    }
}
close ($fh);

Exercises


Problem1

  • Is the array "sort" function numeric or alphanumeric? (does 12 come before 1000?)
  • Create an array that will test the sort order.
  • Sort the array and iterate through it, printing it out in the sorted order.

Problem2

  • Modify EcoRI finder (Session4.1, problem2) to search for exact match to a user-specified sequence in a user-specified fasta file
  • Do not prompt the user for input from within the program

Problem3

  • Write a program to read a user-specified file into an array (each element of the array is a separate line), sort it in reverse, and print it back out using a for-loop.
  • Modify the script to print out without a for-loop.
  • In both cases, each line of the original file should be a separate line of the output.

Problem4

  • Modify the reverse-file-sorter program from #3 to reverse the array without using the "reverse" function. This reversing should be done by referring to array indexes.

Problem5

  • Modify the reverse-file-sorter program from #3 to reverse the array without using the "reverse" function and without using array indexes. That is, you are not accessing elements as in $array[$i] anywhere in the program.

Problem6

  • Download the cerevisiae genome annotation file from SGD:
ftp://genome-ftp.stanford.edu/pub/yeast/data_download/chromosomal_feature/saccharomyces_cerevisiae.gff
  • GFF is the standard format for annotation genome sequences. Look at GFF specifications:
http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml
  • Parse the saccharomyces_cerevisiae.gff file and print out all the gene lengths in ascending order

Bonus


Bonus1

The Bubbler

  • You will use the perl function rand to construct a moderate sized array (~ 20 elements) containing random integers in the range 1 through 100.
  • You will then perform the following operation on this array:
  1. For every consecutive pair of elements i,j (this is known as the bubble) in the array, you will swap the values of the two elements if and only if the element i is greater in value than the element j.
  2. You will walk through the array performing this operation.
  3. Keep track of how many such swaps you performed.
  4. If the number of swaps was non-zero, you will go back to step 1, and repeat the process for the entire array, and calculate how many swaps were performed in this new iteration. You will continue to repeat this process until in some iteration, no swaps were performed.
  • What has happened to the array?

Bonus2

Pascal's triangle

  • Your goal is to compute the binomial coefficients :

(n, k) = n!/( k!.(n-k)! )

  • Unfortunately, the problem of calculating the factorials n!= n.(n-1).(n-2)....1 is not a trivial one (Why?).
  • Fortunately, there's a simpler solution that uses a mathematical rule, Pascal's Rule, explained in detail on Wikipedia and elsewhere.
  • It amounts to constructing the following structure:

r|
==================================================================
0|                                                1
1|                                             1     1
2|                                          1     2     1
3|                                       1     3     3     1
4|                                    1     4     6     4     1
5|                                 1     5     10    10    5     1
6|                              1     6     15    20    15    6     1
7|                           1     7     21    35    35    21    7     1
8|                        1     8     28    56    70    56    28    8     1
9|                     1     9     36    84    126   126   84    36    9     1
0|                  1     10    45    120   210   252   210   120   45    10    1
1|               1     11    55   165   330    462   462   330   165   55    11    1
2|            1     12    66    220   495   792   924   792   495   220   66    12    1
3|         1     13    78   286   715   1287  1716  1716  1287   715   286   78    13    1
4|      1    14     91   364   1001  2002  3003  3432  3003  2002  1001   364   91    14    1
5|   1    15    105   455   1365  3003  5005  6435  6435  5005  3003  1365  455   105    15    1

  • The logic here is that any number in the triangle is the sum of the two numbers diagonally adjacent and in the row above it.
  • For example, on row r=3, the number 3 is the sum of 1 and 2 on the row r=2

Part 1


  • You will accept the number of rows required as an argument to your script.
  • First, try to get the program working to print the right numbers out, without worrying about the geometric positioning of the numbers or the rows.

Part 2


  • For extra credit, get it to actually print out the triangle.
  • You may build your script with the condition that the maximum number of rows you'll need to print is 16, and that the numbers never exceed four digits (as in the example above).
  • You don't need to print out the ruler I've added to the figure above.



Solutions