Command-line tools

Loose ends

There are a few Perl things that we have not mentioned or emphasized enough.

  • Hashes for counting (is actually in the 6.1 printout/lecture).

  • defined to test variables

Often, testing whether a variable is true ("if ($somevar){...}") is not appropriate. Suppose you are testing whether a file name has been given when the program was called. Well, what if the filename happens to be "0"?
Use the "defined" keyword instead.

unless (defined $fasta_file and defined $search_string){
    die "Usage: perl fasta_file  search_string\n";

  • using lists in split
Dan has already discussed lists when covering subroutines (we pass arguments into subroutines as lists).
sub some_routine{
     my ($arg1, $arg2, $arg3) = @_;
Similarly, can use lists of variables to store multiple elements returned from functions such as split.
my $dna_string = "chrI    SGD    gene    7236    9017    .    -    .    ID=YAL067C";
my ($chr, undef, $type, $start, $stop, undef, $strand, undef,$desc) = split (/\t/, $dna_string);
print "$chr\t$type\t$start\t$stop\t$strand\t$desc\n";

  • Watch out for LISTS in SCALAR context.
The most common error with subroutines is assigning a single argument in scalar context.
"my $filename = @_" will make $filename equal 1 instead of the first element of @_. That's because without parantheses, $filename is a scalar, and assigning an array to a scalar is like putting "scalar" in front of the array, returning its size.
Always remember to put the variable in parentheses.
sub test_routine{
    my ($arg1) = @_;


All the tools are already on the LiveCD, thanks to Venky!

For help on any given utility:
transeq -h


Before BLASTing

The input to blast is a fasta-format file and a special database.
Before blasting, you have to set up this database using formatdb.

Just running 'man formatdb' will list all the input paramters, but here are
the crucial ones that MUST be specified:

    • -i is the input fasta file
    • -p T/F tells formatdb whether the input file has protein sequence (T) or nucleotides (F).

To run it:
formatdb -i ~/class/datafiles/s_cerevisiae.nt.niceheader.fasta -p F


To run blast, type 'blastall.' This will also list the million options. The
basic ones are:

    • -p the blast program you want to run (eg blastn,blastp)
    • -d name of database file
    • -i name query file in fasta format
    • -o name of output file (out.blast)
    • -e the maximum e-value for matches
    • -m output format (0-9); use 9 (tab-delimitted) for easiest parsing

blastall -p blastn -d s_cerevisiae.nt.fasta -i query_dna.fa -e 1E-5
blastall -p blastn -d s_cerevisiae.nt.fasta -i query_dna.fa -e 1E-5 -m 9



Here is the file with the 23 s.cer RAS gene open reading frames.

1. Lookup EMBOSS "transeq" and use it to make a fasta file with all 23 RAS proteins.
2. Venky has prepared a combined BLAST database of C. albicans, C. glabrata, K. lactis, and S. pombe proteins here:
BLAST the 23 cerevisiae proteins against the four other species.
3. Parse the BLAST results, extracting, for each cerevisiae protein, the name of the top hit from each of the four species.

Extra Practice

Write a program that will accurately predict whether Abby will give birth before or after the class ends.

Generate a random genome of user-specified size and AT-content.

Use regular expressions to find and mask with "X" all simple repeats that occur more than a user-specified number of times in a row.
If user number is 5, should mask out:
Make sure that it's not just the first n repeats that are masked. If user number is 5, you should mask "CGAAAAAAAAAAATC" the following sequence as "CGXXXXXXXXXXXTC"

The from 6.1 is not a brilliant way to solve this task. We are storing huge genomes (what if it's human?) into a hash. That's an enormous amount of memory, wasted. We really only need one chromosome at a time.
Rewrite your script to avoid this waste.