Multi-dimensional data structures 2: Hashes of Arrays, Arrays of Hashes, Hashes of Hashes, more sorting



  • Keep nested data structures homogeneous. Don't mix scalars and hashes within an array. Array-of-Arrays should have arrays as elements, not some arrays, some scalars, some hashes.

Hashes of Arrays


Consider the following code from the Arrays of Arrays lecture yesterday:
my @instructors = ("Jaime", "Lenny", "Dan");
my @assistants  = ("Rich", "Rose", "Jody");
my @students    = ("Aki",  "Chris", "Stephanie");
 
my @class_AoA=(\@instructors, \@assistants, \@students);
 

The @class_AoA contains three elements, but it is up to the programmer to remember that first element is a list of instructors, second of assistants, and third of students. We have lost the groupings when creating the array of arrays. The solution is to store the array references inside hashes instead of just arrays.
my %class_HoA=();
 
my @instructors = ("Jaime", "Lenny", "Dan");
my @assistants  = ("Rich", "Rose", "Jody");
my @students    = ("Aki",  "Chris", "Stephanie");
 
$class_HoA{"instructors"} = \@instructors;
$class_HoA{"assistants"}  = \@assistants;
$class_HoA{"students"}    = \@students;
 
And just as we could create arrays of arrays anonymously:
my @class_AoA=();
 
push (@class_AoA, [("Jaime", "Lenny", "Dan")]);
push (@class_AoA, [("Rich", "Rose", "Jody")]);
push (@class_AoA, [("Aki",  "Chris", "Stephanie")]);
 
We can do the same with hashes of arrays:
my %class_HoA=();
 
$class_HoA{"instructors"} = [("Jaime", "Lenny", "Dan")];
$class_HoA{"assistants"}  = [("Rich", "Rose", "Jody")];
$class_HoA{"students"}    = [("Aki",  "Chris", "Stephanie")];
 
To access elements within Hashes of Arrays:
my %class_HoA=();
 
$class_HoA{"instructors"} = [("Jaime", "Lenny", "Dan")];
$class_HoA{"assistants"}  = [("Rich", "Rose", "Jody")];
$class_HoA{"students"}    = [("Aki",  "Chris", "Stephanie")];
 
#the dereferenced array under "students" key
print @{$class_HoA{"students"}},"\n";
 
#the second element of the dereferenced array under "students" key
print $class_HoA{"students"}->[1],"\n";
 
#iterating through the whole Hash of Arrays
foreach my $key (keys %class_HoA){
    print "$key: ", join (",",@{$class_HoA{$key}}), "\n";
}

Arrays of Hashes

When learning about basic hashes, we created address records:
my %address_hash1 = (
            "name" => "Dan Pollard",
            "city" => "Berkeley",
            "state" => "CA",
            );
 
my %address_hash2 = (
            "name" => "Lenny Teytelman",
            "city" => "Albany",
            "state" => "CA",
            );
Just as we could assign array references as elements of arrays, we can assign hash references inside arrays.
my %address_hash1 = ("name" => "Dan","city" => "Berkeley","state" => "CA");
 
my %address_hash2 = ("name" => "Lenny","city" => "Albany","state" => "CA");
 
my @address_AoH = (\%address_hash1, \%address_hash2);
 
To do this anonymously, need to surround the hash in curly brackets {}. As a mnemonic, individual array elements are accessed through index inside "[]", and anonymous arrays go inside "[]", while individual hash values are accessed through keys inside "{}", and the anonymous hashes go inside "{}".
my @address_AoH = ( {"name" => "Dan","city" => "Berkeley","state" => "CA"},
             {"name" => "Lenny","city" => "Albany","state" => "CA"},
          );
To access Arrays of Hashes:
my @address_AoH = ( {"name" => "Dan","city" => "Berkeley","state" => "CA"},
                    {"name" => "Lenny","city" => "Albany","state" => "CA"},
                  );
 
foreach my $hash_ref (@address_AoH){
    foreach my $key (keys %$hash_ref){
        print "$key => ", $hash_ref->{"$key"}, "\n";
    }
    print "\n";
}

Hashes of Hashes

What if instead of just storing address records in an array, we want to be able to access the address of a person, given the name?
my %address_HoH = ();
 
my %address_hash_dan = ("city" => "Berkeley","state" => "CA");
 
my %address_hash_lenny = ("city" => "Albany","state" => "CA");
 
$address_HoH{"Dan"}= \%address_hash_dan;
$address_HoH{"Lenny"}= \%address_hash_lenny;
And anonymously:
my %address_HoH = ( "Dan"   => {"city" => "Berkeley","state" => "CA"},
            "Lenny" => {"city" => "Albany","state" => "CA"},
          );
To get inside the hashes of hashes:
my %address_HoH = ( "Dan"   => {"city" => "Berkeley","state" => "CA"},
            "Lenny" => {"city" => "Albany","state" => "CA"},
          );
 
print $address_HoH{"Dan"}->{"state"}, "\n";
print $address_HoH{"Lenny"}{"city"}, "\n";
 
foreach my $name_key (keys %address_HoH){
    foreach my $address_key (keys %{$address_HoH{$name_key}}){
        print "$name_key => $address_key => ",
               $address_HoH{$name_key}{$address_key}, "\n";
    }
    print "\n";
}

Deconvoluting complex datastructures

Hashes of Hashes and the rest of the bestiary are insanely confusing. You can create such a structure and 5 lines later have no clue how to access what's inside. Try to zoom into the inner layer, one loop at a time.

Sorting

When printing out Hashes of Hashes or Hashes of Arrays, we can use the sort function.
foreach my $key (sort keys %lab_hash){
    foreach my $innerkey (sort keys %{$lab_hash{$key}}){
        print "$key\t$innerkey\t" , $lab_hash{$key}{$innerkey},"\n";
    }
}
 
(But we can't do that with Arrays of Arrays or Arrays of Hashes - why?)

Recall that the default sort is alphabetic. In the subroutines lecture, Dan let us peak into the world of sort in a way that gave us more power.
 my @ascending_numerically_sorted = sort ascending_numerically @genome_sizes;
 
 sub ascending_numerically {
 # use spaceship operator to do the comparison and set the value to return
 return($a <=> $b);
 }
This is the same as the inline:
 my @ascending_numerically_sorted = sort {$a <=> $b} ascending_numerically @genome_sizes;
Remember that the special $a and $b are just elements of the list that sort is working on. We can use them in very creative ways in the subroutine that governs the sorting behavior.

Sorting by brain size:
 my %brains_hash =              ("lenny" => 18,
                 "dan"   => 55,
                 "andy"  => 20,
                 "bilge" => 19,
                 "jamie" =>  7,
                 "emily" => 18,
                 "erin"  => 24,
                );
 
 
foreach my $student_name (sort {$brains_hash{$a} <=> $brains_hash{$b}} keys %brains_hash){
    print "$student_name ", $brains_hash{$student_name}, "\n";
}
Sorting by number of lab members in Hash of Hashes:
while (my $file_line=<STDIN>){
    chomp $file_line;
 
    my ($lab_name, $member, $brain_size) = split "\t", $file_line;
 
    $lab_hash{$lab_name}{$member}=$brain_size;
 
}
 
foreach my $key (sort { scalar(keys %{$lab_hash{$b}}) <=> scalar (keys %{$lab_hash{$a}}) } keys %lab_hash){
    foreach my $innerkey (sort keys %{$lab_hash{$key}}){
        print "$key\t$innerkey\t" , $lab_hash{$key}{$innerkey},"\n";
    }
}
 
__END__
 input:
 rine lenny 18
 eisen dan 55
 alber andy 20
 rine bilge 19
 alber jamie 7
 eisen emily 18
 rine erin 24
Sorting by brain size, within each lab:
foreach my $key (sort keys %lab_hash){
    foreach my $innerkey (sort {$lab_hash{$key}{$a} <=> $lab_hash{$key}{$b}} keys %{$lab_hash{$key}}){
        print "$key\t$innerkey\t" , $lab_hash{$key}{$innerkey},"\n";
    }
}
 
Sorting by brain size, within each lab, in descending order:
 foreach my $key (sort keys %lab_hash){
    foreach my $innerkey (sort {$lab_hash{$key}{$b} <=> $lab_hash{$key}{$a}} keys %{$lab_hash{$key}}){
        print "$key\t$innerkey\t" , $lab_hash{$key}{$innerkey},"\n";
    }
}
 
Sorting by a field in Array of Arrays, via references:
my @lab_AoA =   (["lenny" , 18],
                 ["dan"   , 55],
                 ["andy"  , 20],
                 ["bilge" , 19],
                 ["jamie" ,  7],
                 ["emily" , 18],
                 ["erin"  , 24],
                );
 
print "sorting by brain size:\n";
foreach my $row_arrayref (sort {$a->[1] <=> $b->[1]  } @lab_AoA){
    print join("," , @$row_arrayref),"\n";
}
 
print "now sorting by name:\n";
 
foreach my $row_arrayref (sort {$a->[0] cmp $b->[0]  } @lab_AoA){
    print join("," , @$row_arrayref),"\n";
}
 

Basic Exercises

  • Problem1, seating_chart_HoA.pl
Modify the seating chart from 7.1 to use hashes of arrays. The hash keys should be row numbers, and the values should be array_refs with all the students in that row. Again, print out the data structure nicely without dump.

  • Problem2
Modify seating_chart_HoA.pl above to print out the students in each row in alphabetical order.

  • Problem3, gff_to_AoH.pl
Modify gff_to_matrix.pl to store the GFF gene records as hashes inside an array. The outermost array will contain all the gene records. The gff_record_hash should have a key for each GFF field.

  • Problem4, gff_to_HoH.pl
Modify the script from problem3 above to store each GFF record in a hash of hashes with the outermost key being the gene name. Retain just chromosome, start, stop, and strand in the inner hash.

  • Problem5, gff_parser.pl
Modify gff_to_HoH.pl above to parse a GFF file, pulling out all gene entries as follows:

- Using a complex data structure, store each gene, start, stop, and strand in a multi-d hash (ie, don't use arrays).
- The topmost key should be the chromosome number (sequence name).
- So, (keys %{$gff_hash{"chrI")}} should return all gene names on chromosome I.

Project

blast_parser_with_HoH.pl

Your script will accept the name of a BLAST report file (generated with m9) as well as an e-value threshold on the command line.

  • BLAST output contains -- for each query sequence, multiple subject sequences (hits); for each subject sequence, multiple HSPs. Each HSP is represented by a single row in the BLAST m9 output.
  • Parse this report file into a data structure in which all the HSPs for a given query-species are grouped together. Exclude HSPs that failed to pass the e-value threshold specified on the command line. What data structure would you use?
  • For this particular application, we do not need the %identity, aln. length, and so on. We just need the best hit to our query ras protein within each species(so store the e-value and hit name).
  • Sort the HSPs for a given query-species set in order of worsening e-value. Assuming that each subject can be assigned the e-value of its best HSP, now sort the subjects in order of worsening e-value.
  • The output from this script is the query protein ID, a tab, followed by a comma separated list of the best hit protein ID from each fungal genome (this is the subject with the best e-value for each species).
  • Use this script to discover the best hits for each of the Ras proteins, and collect the output into a single file.

The RAS proteins file for blasting is here:


Solutions



There are four slightly different solutions to the multi-d blast parsing (one instructor's, one student's, one from a TA, and one from a TA/former student). It may be highly beneficial to look over each of the different scripts.