Modules


Modules at their simplest, and that's all we'll be using them for, are collections of subroutines that you can store in a central location and access from a program.

Here is a sample module. Note the package declaration and the trailing one. You want to name this intro.pm and remember what directory you saved it in.

package intro;
 
#it's good to comment your modules a lot
 
#this subroutine takes in an array reference
#it will shift seating moving the last person from a row
#to the beginning of the next row.
#it includes special handling of the last row.
sub seating_shift {
    my ($original_chart_ref) = @_;
    my @original_chart = @$original_chart_ref;
    for (my $i=0; $i< scalar(@original_chart); $i++){
        if ($i == scalar(@original_chart)-1){
             unshift @{$original_chart[0]}, (pop@{$original_chart[$i]});
        }
        else {
            unshift @{$original_chart[$i+1]}, (pop@{$original_chart[$i]});
        }
    }
    my @shifted_chart = @original_chart;
    return(@shifted_chart);
}
 
1;
__END__
YOU MUST INCLUDE THE ONE!
To use this module:

#!/usr/bin/perl
# Author        :
# Date          : Wed Aug 15 10:52:43 UTC 2007
# Description   :
 
use lib '/home/student','/home/student/Desktop/';
use intro;
 
use strict;
use warnings;
use Data::Dump qw (dump);
use lib '/home/student','/home/student/Desktop/';
use intro;
 
my @row1 = ("Lenny", "Jaime", "Dan");
my @row2 = ("Jody", "Rich", "Rose");
my @row3 = ("Billy", "Ksenia");
 
my @seating_chart = (\@row1, \@row2, \@row3);
 
print "\nFRONT\n";
 
for (my $row=0; $row < scalar(@seating_chart); $row++){
    my $regular_count = $row+1;
    print "row $regular_count: @{$seating_chart[$row]}\n"
}
print "BACK\n";
 
 
@seating_chart = intro::seating_shift(\@seating_chart);#NOTE THE MODULE CALL
 
print "\nFRONT\n";
 
for (my $row=0; $row < scalar(@seating_chart); $row++){
    my $regular_count = $row+1;
    print "row $regular_count: @{$seating_chart[$row]}\n"
}
print "BACK\n";
 
__END__

And... it can co-exist with subroutines from the program itself. You use the :: to specify where the subroutine is from. But in general you want to have unique names.
#!/usr/bin/perl
# Author        :
# Date          : Wed Aug 15 10:52:43 UTC 2007
# Description   :
 
use lib '/home/student','/home/student/Desktop/';
use intro;
 
use strict;
use warnings;
use Data::Dump qw (dump);
use lib '/home/student','/home/student/Desktop/';
use intro;
 
my @row1 = ("Lenny", "Jaime", "Dan");
my @row2 = ("Jody", "Rich", "Rose");
my @row3 = ("Billy", "Ksenia");
 
my @seating_chart = (\@row1, \@row2, \@row3);
 
print "\nFRONT\n";
 
for (my $row=0; $row < scalar(@seating_chart); $row++){
    my $regular_count = $row+1;
    print "row $regular_count: @{$seating_chart[$row]}\n"
}
print "BACK\n";
 
 
@seating_chart = intro::seating_shift(\@seating_chart);#NOTE THE MODULE CALL
 
print "\nFRONT\n";
 
for (my $row=0; $row < scalar(@seating_chart); $row++){
    my $regular_count = $row+1;
    print "row $regular_count: @{$seating_chart[$row]}\n"
}
print "BACK\n";
 
__END__
You can use existing modules, for which there is documentation on the web. For example you can get access to Trig functions that are not natively built into perl using Math::Trig
#!/usr/bin/perl
# Author        :
# Date          : Thu Aug 16 10:02:33 UTC 2007
# Description   :
 
use strict;
use warnings;
use Getopt::Long;
use Math::Trig;
 
my $x = tan(0.9);
my $y = cosh(0.8);
 
print "$x  $y\n\n";
__END__
Or you can use Getopt::Long to deal with command line options (like you have seen in blast).
#!/usr/bin/perl
# Author        :
# Date          : Thu Aug 16 10:02:33 UTC 2007
# Description   : You can use --verbose or -v
 
use strict;
use warnings;
use Getopt::Long;
 
#GIVE DEFAULT VALUES
my $length  = 5;
my $data    = "";
my $verbose = '';
 
my $result = GetOptions(
    "length=i" => \$length,    # numeric
    "file=s"   => \$data,      # string,
    'verbose'  => \$verbose,
);
if ($verbose) {
    print "length is $length\n";
    print "the file I want to open is $data\n"
}
__END__

pymol

*Open with pymol at the prompt. Pymol has its own prompt at the bottom of it's screen which can be used to move around directories and enter commands.
*Pymol is great for representing proteins using different views. To get an overview of this, you can go to the Menu: Wizard->Demo->Representation
*The online manual (at pymol.org) is great for figuring out syntax and different capabilities. In general you want to format commands as: COMMAND option, selection. For example: colour red, chain a. Or: show spheres, residue 102 and chain a. etc.
*PDBs are flat text files that you can edit in gedit. One thing that you might want to do is delete all the waters.
*you can use the @ to run series of commands that are stored in a text file in your working directory
colour green, chain a
show surface, chain a
color yellow, residue 102 and chain a
*you can alter the PDB in memory to set bvalues to store any value you want in the b-factor field. this will come in handy for mapping conservation to the structure in the project. You can reset the field You can use a perl script to write out a list of commands for pymol to use. So to reset all bvalues on the entire PDB: alter 2GKG, b=5. To Reset just residue 50: alter residue 50, b=5. To alter just the protein residue: alter residue 50 and chain a, b=5.
*you can then use the command spectrum b to color by b factor.
*so, to put it all together... Get a value of positional conservation for each residue. Use perl to output "alter residue X, b=Y" commands to a file. Use pymol to run this command file using the "@" symbol at the pymol prompt. Ray the image. Bingo-> you have automatically created a kick ass figure.

ploticus

*A great open source graphing program.
*use perl to reformat input and ploticus prefabs (-prefab) to format pretty graphs.
*you can use perl system calls to run ploticus (including varialbes in the call if you want)!
1 1
2 4
3 9
4 16
5 25

ploticus -prefab scat data=input.data x=1 y=2 -png
It will also output jpeg, svg, etc

PRACTICE PROBLEMS

Download the PDB file for 2GKG from the PDB. Using this list of comma separated values alter the B factors for the protein chain. Colour the PDB by b factor. Did you reset all the values first (including the waters?). Include in your "@" script lines to make all residues with conservation values above 2 show as spheres.

In one perl script: open your sac.gff files and count the number of genes on each chromosome, then call ploticus and output the result as a jpeg bargraph (prefab vbars) using ploticus and then as a pie chart (prefab pie).

PROJECT

Part 3: Get target proteins


  • Once you've identified the best hit fungal proteins for the S.cer Ras dataset, write a script to read this information in from the file you created in Part 2.
  • Your script should now read the fungal proteins FASTA file (/usr/local/data/pep/fungal.combined.pep.fasta.gz) and obtain the protein sequence of each of the best hit proteins. You may only read through the FASTA file once, line-by-line. You may not read the entire file into a data structure or into memory. You may not loop through the IDs of the best hit proteins multiple times after you've read them into memory.
  • You will also need to read in the FASTA file containing the S.cer Ras protein sequences, and get the protein sequence of each of the S.cer Ras proteins. This is a smaller file, and it is acceptable to read this into memory.
  • The result of running this script will be a set of multi-FASTA files, each file containing one S.cer protein plus all of its best hit fungal sequences. Each file will be named for the ID of the S.cer protein. Each sequence in the file will be named for the species it came from, like >scer. Make sure you don't lose any information, though -- append the original identifier to the FASTA header.

Solution:

Part 4: Multiple alignments


  • Add the human protein to each of your FASTA files (see below).
Solution:

  • Run the FASTA files through ClustalW. Output the files in PIR format. You should write a perl script that will do this for all 23 multiple FASTA files and create an alignment for each one.
Solution:

  • Parse the resulting alignments, and in each column that is not a gap for the human protein, return the percent of rows that match the amino acid of the human protein (percent identity). The human pdb and fasta files start at coordinate 2. Use perl to generate a pymol input file that takes advantage of ("alter residue #, b=?") to map percent identity onto the structure..

>human_cdc42
QTIKCVVVGDGAVGKTCLLISYTTNKFPSEYVPTVFDNYAVTVMIGGEPYTLGLFDTAGLEDYDRLRPLSYPQTD
VFLVCFSVVSPSSFENVKEKWVPEITHHCPKTPFLLVGTQIDLRDDPSTIEKLAKNKQKPITPETAEKLARDLKAVKYVE
CSALTQKGLKNVFDEAILAALEPPEPKKSRRCVL