Hash is good for you



Today you're going to learn about a third data structure from me in the morning. And then in the afternoon, from Dan, you're going to learn how to move these structures around different parts of your code in a much more powerful way.

Since we're now in the project phase of the course, the code that you write is absolutely going to be reused over and over again by you during the course. Give your scripts good names!

A while ago we talked about how regular expessions set perl apart as a programming language and make it great for bioinformatics. One of the other great features of the language (although it's found in other languages too) are what are know as hashes.

Imagine you want to store people's first and last names. You could create two arrays @first_names and @last_names and make sure that if the name "Jaime" is the fifth element of the @first_names array, then Jaime's age is the fifth element of the @last_ages array. Now, to find someone's last names, you could scan the @first_names array, see if the person is in the array, and use the index to print the age from the @last_names array.

Isn't it nice that Perl can do that all for you with hashes?

You can think of a hash as two columns in an excel spreadsheet; first column has the keys (e.g. first names) and second column has the values (e.g. last names)

Here is how you define a hash

-hash1.pl
#!/usr/bin/perl
 
use strict;
use warnings;
use Data::Dump qw (dump);
 
my %last_name;

I want you to notice two things about this first simple script. First, I've added a new "use" call at the top. We're going to be using the dump function of Data::Dump to help us print out hashes later in the lesson. (Data::Dump is a module, and you can get documentation on any module that is declared at the beginning of the script with "use" by calling "perldoc Modulename" - in this case, "perldoc Data::Dump") Second, that a hash uses the percent sign (shift 5).

As some of you may have guessed by the name I assigned to the hash, a hash is a data structure that can associate one string with another. So I can use the hash %last_name, to store a string ( a "key" string) and associate a VALUE with that string:

-hash2.pl

#!/usr/bin/perl
 
use strict;
use warnings;
use Data::Dump qw (dump);
 
my %last_name;
$last_name{"Jaime"} = "Fraser";

Notice the syntax of the assignment. The value that is ultimately stored is a scalar, but the curly braces let perl know "This is a hash" associate this key "Jaime" with this value Fraser. I can set the value equal to a scalar variable ($great_person) or an element of an array $fileline[7].

And I can use the key to access the value at a later point (in this case only 1 line of code later)

-hash2a.pl
#!/usr/bin/perl
 
use strict;
use warnings;
use Data::Dump qw (dump);
 
my %last_name;
$last_name{"Jaime"} = "Fraser";
 
print $last_name{"Jaime"};
 
#note, the keys are case sensitive!
#
# print $last_name{"jaime"};
#
# gives a warning for undefined

I can populate the lowercase key jaime as well.

-hash2b.pl
#!/usr/bin/perl
 
use strict;
use warnings;
use Data::Dump qw (dump);
 
my %last_name;
$last_name{"Jaime"} = "Fraser";
$last_name{"jaime"} = "lee curtis";
 
 
print $last_name{"Jaime"},"\n";
print $last_name{"jaime"},"\n";

But If I change the case of the actresses name

 #!/usr/bin/perl
 
 use strict;
 use warnings;
 use Data::Dump qw (dump);
 
 my %last_name;
 $last_name{"Jaime"} = "Fraser";
 $last_name{"Jaime"} = "lee curtis";
 
 
 print $last_name{"Jaime"},"\n";

Then we will return only the most recent key->value mapping. In this case "lee curtis". So we can only have one value per key. If we try to set a new value for an existing key, it will override the previous value. KEYS must be unique. However, we can have many keys with the same value.

#!/usr/bin/perl
 
use strict;
use warnings;
use Data::Dump qw (dump);
 
 
my $variable = "Fraser";
my %last_name;
 
$last_name{"Jaime"} = $variable;
$last_name{"Brendan"} = "Fraser";
 
print $last_name{"Jaime"},"\n";
print $last_name{"Brendan"},"\n";

This type of assignment of the hash can be great when we are populating the hash in a loop. But often we want initialize a hash with a set of values. This can accomplished by an alterative syntax - the so-called "Big Arrow". Note here rather than initializing an empty hash, then populating it with $name{"string"} we directly populate on the %hash

hash3.pl
#!/usr/bin/perl
 
use strict;
use warnings;
use Data::Dump qw (dump);
 
my %last_name = ("Jaime" => "Fraser", "Brendan" => "Fraser");
 
print $last_name{"Jaime"},"\n";

Or we can play around with the whitespace to make this even more readable

hash3a.pl

#!/usr/bin/perl
 
use strict;
use warnings;
use Data::Dump qw (dump);
 
my %last_name = ( "Jaime" => "Fraser",
          "Brendan" => "Fraser"
        );
 
print $last_name{"Jaime"}, "\n";

Try to get into the habit of putting a comma after the last key/value pair in this type of at-one assignment. This will eliminate errors if you later add more key/value lines to the assignment.
my %last_name = ( "Jaime" => "Fraser",
          "Brendan" => "Fraser",
        );
 
print $last_name{"Jaime"}, "\n";

No matter how I assign the array. I still access it the same way. I'm always accessing a scalar value that is assigned to a unique key!

I can get a list of all of the unique keys that are in a hash with the keys function

-hash4.pl
#!/usr/bin/perl
 
use strict;
use warnings;
use Data::Dump qw (dump);
 
my %last_name = (
 "Jaime" => "Fraser",
 "Brendan" => "Fraser",
);
 
my @first_names = keys (%last_name);
 
print join ("\n",@first_names), "\n";
 

Or alternatively, I can get a list of all of the values with the values function

#!/usr/bin/perl
 
use strict;
use warnings;
use Data::Dump qw (dump);
 
my %last_name = (
 "Jaime" => "Fraser",
 "Brendan" => "Fraser",
);
 
my @last_names = values (%last_name);
 
print join ("\n",@last_names), "\n";

To avoid ambiguity and clashes, give all your variables/datastructs distinct, identifiable names.
my %last_name_hash = (
 "Jaime" => "Fraser",
 "Brendan" => "Fraser",
);
 
my @last_names_array = values (%last_name_hash);
 
print join ("\n",@last_names_array), "\n";

This is a good time to point out a GIANT note of caution about the keys and values functions. Unlike arrays, perl stores hashes so that it can access them really quickly. It doesn't do that in alphabetical order or the order that you input the key/value pairs into the hash or anything that you could hope to figure out quickly.

There are a few different ways to go through a hash. The first way is to take advantage of the keys function that we just talked about. The first thing we do is collect keys in an array. And then loop through the array and lookup the corresponding values.

-hash5.pl
#!/usr/bin/perl
 
use strict;
use warnings;
use Data::Dump qw (dump);
 
my %last_name_hash = (
 "Jaime" => "Fraser",
 "Brendan" => "Fraser",
 "Lenny" => "Teytelman",
 "Dan" => "Pollard",
);
 
my @first_names_array = keys (%last_name_hash);
 
foreach my $given_name (@first_names_array){
my $surname = $last_name_hash{$given_name};
 print "$given_name => $surname\n";
}

This is very generalizable and easy to think about. Like this:
-hash5a.pl

#!/usr/bin/perl
 
use strict;
use warnings;
use Data::Dump qw (dump);
 
my %any_hash = (
 "Jaime" => "Fraser",
 "Brendan" => "Fraser",
 "Lenny" => "Teytelman",
 "Venky" => "Iyer",
);
 
my @list_of_keys = keys %any_hash;
 
foreach my $key (@list_of_keys){
 my $value = $any_hash{$key};
 print "$key => $value\n";
}

There are three other tricks that are very useful when it comes to arrays. We can test to see if something exists

#!/usr/bin/perl
 
use strict;
use warnings;
use Data::Dump qw (dump);
 
my %last_name = (
 "Jaime" => "Fraser",
 "Brendan" => "Fraser",
 "Lenny" => "Teytelman",
 "Venky" => "Iyer",
);
 
 
if (exists $last_name{"Jaime"}){
 print "Jaime has a last name... Yipee!";
}
else { print "You are in a class with Madonna and Bono";}

And we can delete a key value pair. If you truly want to get rid of a key. This is the only safe way to do it. Setting the paired value to zero will still allow that key to be part of the hash

#!/usr/bin/perl
 
use strict;
use warnings;
use Data::Dump qw (dump);
 
my %last_name = (
 "Jaime" => "Fraser",
 "Brendan" => "Fraser",
 "Lenny" => "Teytelman",
 "Venky" => "Iyer",
);
 
delete $last_name{"Jaime"};
 
if (exists $last_name{"Jaime"}){
 print "Jaime has a last name... Yipee!";
}
else { print "You are in a class with Madonna and Bono";}

I told you I would get around to using the Data::Dump functions at some point. We don't want you to use this at all for the acutal function of your code, but it is extremely useful for debugging.

To print out a nice looking bit of hash without writing any loops. Simply call dump (\%hash). You'll discover what the backslash means this afternoon, but for now you can use it when debugging if necessary.

#!/usr/bin/perl
 
use strict;
use warnings;
use Data::Dump qw (dump);
 
my %last_name = (
 "Jaime" => "Fraser",
 "Brendan" => "Fraser",
 "Lenny" => "Teytelman",
 "Venky" => "Iyer",
);
 
delete $last_name{"Jaime"};
 
if (exists $last_name{"Jaime"}){
 print "Jaime has a last name... Yipee!";
}
else { print "You are in a class with Madonna and Bono";}
 
dump (\%last_name);

How to save lines using hashes. Base composition exercise used in this code.

#!/usr/bin/perl
# Author : Venky Iyer
# Date : Wed Jan 3 15:18:35 UTC 2007
# Description : Basic Bioinformatics;
 
use strict;
use warnings;
 
# initialize
my $As = 0;
my $Gs = 0;
my $Cs = 0;
my $Ts = 0;
my $total = 0;
 
while ( my $line = <STDIN> ) {
 
 chomp $line;
 
 # update total counts
 my $line_length = length $line;
 $total += $line_length;
 
 for ( my $i = 0; $i < $line_length; ++$i ) {
 
 my $base = substr( $line, $i, 1 );
 if ( $base eq 'A' ) {
 ++$As;
 }
 elsif ( $base eq 'G' ) {
 ++$Gs;
 }
 elsif ( $base eq 'C' ) {
 ++$Cs;
 }
 elsif ( $base eq 'T' ) {
 ++$Ts;
 }
 else {
 print STDERR "Found unknown character $base\n";
 }
 }
}
 
# Make sure we don't divide by zero.
print "No bases were read\n" unless $total;
 
# Compute percentages
$As /= $total;
$Gs /= $total;
$Cs /= $total;
$Ts /= $total;
 
# output
print "%A: $As\n";
print "%C: $Cs\n";
print "%G: $Gs\n";
print "%T: $Ts\n";
 
__END__

I'm going to replace mostly the business part of the for loop with the following code

#!/usr/bin/perl
# Author : Venky Iyer
# Date : Wed Jan 3 15:18:35 UTC 2007
# Description : Basic Bioinformatics;
 
use strict;
use warnings;
use Data::Dump qw (dump);
 
# initialize
 
my %base_count;
my $total;
 
while ( my $line = <STDIN> ) {
 
 chomp $line;
 
 # update total counts
 my $line_length = length $line;
 $total += $line_length;
 
 for ( my $i = 0; $i < $line_length; ++$i ) {
 
#just like venky's code except now each time it sees a base it will
#up the count by one
#or create the key value pair in the hash at 1!!!
 
 my $base = substr( $line, $i, 1 );
 $base_count{$base} += 1;
 }
}
 
my @base_types = keys %base_count;
 
#you can then loop over the hash to get the bases percentages
 
foreach my $type_of_base (@base_types){
 my $percentage = $base_count{$type_of_base}/$total;
 print "The percentage of $base_count{$type_of_base} was $percentage\n";
}
 
#notice how we didn't have to predefine any of the base types!
 
dump (\%base_count);
 
__END__

Exercises


Basics


Problem1:hash_address.pl


  • Make a hash to store your name, street, city, state, and zip.
  • Print it out as:
name: Lenny Teytelman
street: 1 Santa Street
city: New York
state: NY
zip: 11235
  • Iterate through your hash and print out its values

Problem2:clear_hash.pl

  • Modify the script from Problem1, working with the same hash, to clear all values of the hash.


Problem3:make_a_hash.pl

  • Given a list of keys, create a hash with the lengths of keys as values
  • The input to the script will be the desired set of hash keys
  • Make an array with all the hash keys
  • Write a subroutine that will accept an array as input and will create a hash with keys as elements of the input array and lengths of keys as values
  • Return the hash from the subroutine
  • Print the created hash

If the subroutine input is an array:
my @some_array= qw(Jody Roseanne Rich);
Then the created hash should be:

("Rich" => 4,
 "Roseanne"     => 8,
 "Jody"     => 4,
);

Problem4:check_hash_for_keys.pl


  • Program will check a hash for user-specified keys.
  • Input arguments to the program are a list of keys to check
  • For each element of the input array, check whether it is a key in the input hash
  • Print the missing keys.

  • Test on the following hashes and key-inputs:

1
my %address_hash = (
            "name" => "Dan Pollard",
            "street" => "Building 84, 1 Cyclotron Road",
            "city" => "Berkeley",
            "state" => "CA",
            "zip" => "94720"
            );
 
my @keys_array=qw(name street zip)
Should print nothing.

2
my %address_hash = (
            "name" => "Dan Pollard",
            "street" => "Building 84, 1 Cyclotron Road",
            "city" => "Berkeley",
            "state" => "CA",
            "zip" => "94720"
            );
 
my @keys_array=qw(name street planet zip country)
Should print:"planet", "country"




Project


To get our hands on the cerevisiae proteins with a RAS domain, we will:

  1. Find gene coordinates of RAS-family genes in the GFF file
  2. Extract the DNA sequence for these gene CDSs from the cerevisiae genome
  3. Translate the CDSs into proteins.

Problem2:translate_orfs.pl

This script will accept a coding sequence fasta file as a command line argument and will translate each sequence into a protein. The script will consist of the following subroutines.

1: sub unwrap_linebreaks{}
  • Write a subroutine to unwrap fasta files, removing line breaks, using hashes (the header is your key and the sequence is your value).
  • The above subroutine should accept a file name and should return a hash structure with all the headers
and their corresponding sequences.

2: sub reformat_to_codons{}
  • The input to this subroutine is the hash from part1, with sequence headers and sequences.
  • The output is a similar hash, but with the sequences in codon triplets (that is the value in the array will be the sequence string with space separated codon triplets).

3: sub translate_to_proteins{}
  • The input is the hash from part2 with codon-formatted sequences.
  • The output is a hash with sequence headers and translated proteins.
  • To save you time, the codon-to-aa hash is provided below.
%codon_table = (
 
   TCA => 'S',TCG => 'S',TCC => 'S',TCT => 'S',
 
   TTT => 'F',TTC => 'F',TTA => 'L',TTG => 'L',
 
   TAT => 'Y',TAC => 'Y',TAA => '*',TAG => '*',
 
   TGT => 'C',TGC => 'C',TGA => '*',TGG => 'W',
 
   CTA => 'L',CTG => 'L',CTC => 'L',CTT => 'L',
 
   CCA => 'P',CCG => 'P',CCC => 'P',CCT => 'P',
 
   CAT => 'H',CAC => 'H',CAA => 'Q',CAG => 'Q',
 
   CGA => 'R',CGG => 'R',CGC => 'R',CGT => 'R',
 
   ATT => 'I',ATC => 'I',ATA => 'I',ATG => 'M',
 
   ACA => 'T',ACG => 'T',ACC => 'T',ACT => 'T',
 
   AAT => 'N',AAC => 'N',AAA => 'K',AAG => 'K',
 
   AGT => 'S',AGC => 'S',AGA => 'R',AGG => 'R',
 
   GTA => 'V',GTG => 'V',GTC => 'V',GTT => 'V',
 
   GCA => 'A',GCG => 'A',GCC => 'A',GCT => 'A',
 
   GAT => 'D',GAC => 'D',GAA => 'E',GAG => 'E',
 
   GGA => 'G',GGG => 'G',GGC => 'G',GGT => 'G');

Problem3:get_gff_cds.pl

Given a list of gene IDs and a GFF file, extract the CDS entries from the GFF file.

  • Arguments to the program will be a file with gene IDs and a GFF file.
1:sub make_gene_list{}
  • Input parameter is the file name of the file with the IDs. Assume the IDs will be one per line of the file.
  • Function will create and return an array with the genes.
2:sub get_gff_records{}
  • Inputs are the GFF file name and the array with gene IDs.
  • This subroutine will not return anything but will print the matching GFF lines.

  • Use the cerevisiae GFF file:
  • Here are the 23 genes with a RAS domain:
  • Make sure you get the CDS entries and that you get exactly 23 of them
  • Use the "gene=" in the GFF file to match the IDs.
  • If you look for "YPT1", both "YPT1" and "YPT10" will match. Consult "Learning Perl" on the "\b" anchor to make sure that you only get "YPT1" when you check for it.
  • Save the output in "cer_ras_annotations.gff"

Problem4:extract_fasta_sequence.pl

Given a GFF annotations file and a fasta file, make a new fasta file with the GFF-specified subsequences.

  • Arguments to the program will be a fasta file and a GFF file.

  • Create a hash that will store header ids as keys and the fasta sequence as a value.
You don't have to write this again! Copy-paste the "unwrap_linebreaks" subroutine from problem 2.**
  • Use the hash to extract the GFF subsequences for each GFF entry.
  • As you print out the subsequence, the new header should be ">gene_name chromosome start stop"
  • Remember about the strand of the CDS - use Venky's rev_comp subroutine to reverse-complement the sequences.
  • We have modified the cerevisiae genome fasta so that the first word of each header is the chromosome. Use this file:
  • Run the program on the above fasta file and the cer_ras_annotations.gff from the previous exercise.
  • Save the output in cer_ras_codingsequence.fa

Problem5:Getting cerevisiae RAS proteins

  • Use the translate_orfs.pl from problem 1 to translate cer_ras_codingsequence.fa
  • Save the output in "cer_ras_proteins.fa"



Bonus problem:hash_implementation.pl

Imagine that there are no hashes in Perl (there aren't in other languages).
Implement hash functionality using arrays.

Your script should have subroutines:

  • add_to_hash("key","value")
  • delete_from_hash("key")
  • find_in_hash("key")
  • exists_in_hash("key")
  • print_hash()

Now use your script to create a hash of all cerevisiae genes and their lengths and then to print it out to the screen. Time this script and compare to a script that uses Perl's hashes.

Solutions