Modules Continued: CPAN


The Standard Perl Library


You have been using modules from day 1 of the course.

use strict;
use warnings;

"strict" and "warnings" are both modules that are part of the Standard Perl Library. They are examples of pragmas, which are modules that change the way Perl actually compiles. Typically, these are the only pragmas you'll use but to learn more about pragmas check out Chapter 31 in Programming Perl:

http://www.unix.org.ua/orelly/perl/prog3/ch31_01.htm

Modules that are not pragmas contain subroutines (or functions) that you can use in your code. Jaime introduced some of the Standard Modules that are in the Standard Perl Library. For instance:

use Math::Trig;
use Getopt::Long;

There are dozens of modules in the Standard Perl Library, that can always be used in Perl code. To learn more about these check out Chapter 32 in Programming Perl:

http://www.unix.org.ua/orelly/perl/prog3/ch32_01.htm

To reiterate: every platform that has Perl installed on it contains the Standard Perl Library and therefor any Perl scripts written using the Standard Perl Library will work on any platform.

Also, if you are ever unsure if a module is installed on your computer or what functions are available in a given module, use the unix command "perldoc":

 > perldoc Math::Trig

If the perldoc doesn't tell you everything you need to know to use a module, you can google search for it and usually find some nice examples of using the module.

Your Own Modules


As Jaime introduced, you can write your own modules. As you develop a code-base, it is great to have a central place to keep subroutines that you expect to use over and over again. It is generally a good idea to organize your subroutines into modules with a common theme. For instance you might have a FASTA module, that has all of your subroutines that read in, print out, manipulate and make calculations on FASTA files. You might have a BLAST module that has subroutines for parsing BLAST output. You get the idea. Naming is really important when writing your subroutines and modules. Be descriptive and specific but also try to keep it brief. As Jaime said, comment comment comment comment!

Perhaps this is obvious, but if you write code that uses your own modules, the code will not work on another computer. If you move the script AND the modules and you remember to use "lib" to tell your script where to look for the modules, then everything will work fine.

All Other Modules


As you can imagine, because any Perl programmer can write their own modules, thousands of modules have been written by Perl programmers all around the world to accomplish millions of tasks. Almost anything you can think of has probably already been written in a Perl module. Instead of reinventing the wheel, it is often nice to use what's out there.

The primary repository for Perl modules is CPAN:

http://www.cpan.org/

These modules have been tested and though not guaranteed to work, are very likely to work as their description implies. That said, BE CAREFUL. Test a module with a simple example to make sure it is doing what you think it is doing. Even if the module doesn't have bugs, it might be doing something slightly different from what you think its doing, just because you misunderstood the description.

An Example


Let's say you wanted to get some statistics for a dataset. You could write subroutines to calculate basic statistics for an array of number (you've already done some of this), but perhaps there is a module in CPAN that does everything you need already.

Searching For And Picking A Module


Go to cpan.org, click on "Perl modules" and then Randy Kobes' search. Searching for "Statistics" returns 131 modules. Typically your top hit will be your best choice. Although its worth scanning through the results to see if there is something that appears to be a better choice. "Statistics::Descriptive" was the top hit for this search. Clicking on the name gives you a page with basic information on the module as well as a download link. Clicking again on the name takes you to an HTML version of what you might see if you used "perldoc". Reading through it, it may or may not seem like what you want. In this case it seems like it might be a good choice but its talking about "object oriented" Perl and perhaps is a little too hardcore for what you need. Going back to the search results, the next hit seems too specific but the third hit, "Statistics::Basic", might be OK. Clicking through though reveals that it doesn't even have a documentation page. That's no good. Glancing down the search results you come upon "Statistics::Lite" with the brief description "Small stats stuff". That sounds promising. Clicking through the documentation describes the module as an easy to use non-object-oriented stats module. That sounds perfect.

Downloading And Adding To Perl Library


Depending on what kind of operating system you are using, you may have a couple of options for this step.

On Linux you can actually use a program called "cpan" do download and install a module. This is super easy. In theory this should also work on Macs but it appears to involve some expert tinkering.

Fortunately, the alternative method will work on any operating system that has "make" (to add "make" to your Mac use the installation discs to install the "Xcode Tools"). You manually download the module you want. In our case the "Statistics::Lite" module. Unpack the tarball. Move into the unpacked directory.

There should be a README type file to explain what to do. Typically you need to run a Perl script in the directory and then run "make" and "install make".

Run the Perl script:

 > perl Makefile.PL
 > make
 > sudo make install

The module should now be installed and you should be able to perldoc it. You don't need to keep the downloaded directory after this point since the module is now in your library.

One reason the above might not work is that you don't have administrator permissions on the computer. In this case, you can simply download and unpack the module and then keep the module in a modules directory in your home directory. You will then need to use "lib" with this directory in your Perl scripts that use this module, just like with your own modules. In order to see the perldoc, you will need to be in the module directory and call it directly on the module file.

Bioperl


CPAN is full of modules for all aspects of computing, including biology-specific modules. In addition to these individual modules, there is a toolkit that has dozens of biology related modules called Bioperl.

http://www.bioperl.org

Bioperl can be downloaded from CPAN.

Bioperl is mostly object-oriented so that may take some getting used to. It can handle a very broad array of bioinformatics tasks so it can be very handy. Its particularly helpful for parsing complicated output from common sequence analysis programs, like alignment programs. It even has a nice module for manipulating phylogenetic trees.

Despite all these wonderful attributes, Bioperl has some limitations. It can be very slow. It was not really written to optimize efficiency. It was written to be very extensible (the ability to add functionality). It also isn't always doing exactly what you might expect it to be doing. A very common issue with Bioperl is that a user assumes a certain functionality but is wrong about it. The documentation is good but not 100% complete, thus leaving some room for poor interpretation.

The bottom line on Bioperl is that it can really save you a bunch of time but it should be used with caution and tested thoroughly before being put to heavy use.

Exercises


Problem 1: Get A Module From CPAN


  • Search CPAN for a module to accomplish any task you can think of. If you can't think of anything, look for a module that parses comma-separated files or CSV format as its commonly called (it turns out it's really hard to parse CSV because commas are allowed in quoted text).
  • Download your module. Create a directory called "modules" and copy the module into this directory.
  • Run "perldoc" on the module.

Project: Pipeline


  • One of the simply amazing aspects of computational research is that once you have completed a project, it can be repeated automatically with the click of a button (or close to that). Perhaps you want to change one aspect of the project but otherwise do everything the same. Maybe you want to try a range of conditions. All of this is very easy to accomplish. This is typically done by writing a "control" script to run the "pipeline" of code you used in the project. Perl works very well for this task.
  • If you have completed the project, try to make a "control" script that will run through all of your scripts automatically.
  • Use "system" commands to run the scripts and programs from within the master pipepline script. Create appropriate subdirectories to organize the different output files. Don't forget to fully specify the path of each of the input files.
  • For the first pass write it so that it works specifically on the exact files you used as input for the project. Be careful to try running it in a different directory than you did the project in originally so that you don't over-write your old files.
  • Alter your "control" script so it can take any list of input ids.
  • The inputs to the script should be:
    -file of gene ids (the 23 RAS genes)
    -name of the gff file that contains the genes (s.cer gff)
    -name of the fasta file with the genome corresponding to the gff (s.cer genome fasta)
    -name of the blast database (fungal proteins blastdb)
    -name of the fasta file with the blasted sequences (fungal proteins fasta)
    -name of the fasta file with the additional protein for mutlitple alignments (human cdc42 in our case)
  • You should use Getopt::Long to parse the command-line inputs to this script.