Command-line tools



Loose ends


There are a few Perl things that we have not mentioned or emphasized enough.

  • Hashes for counting (is actually in the 6.1 printout/lecture).

  • defined to test variables

Often, testing whether a variable is true ("if ($somevar){...}") is not appropriate. Suppose you are testing whether a file name has been given when the program was called. Well, what if the filename happens to be "0"?
Use the "defined" keyword instead.

unless (defined $fasta_file and defined $search_string){
    die "Usage: perl fasta_file  search_string\n";
}

  • using lists in split
Dan has already discussed lists when covering subroutines (we pass arguments into subroutines as lists).
sub some_routine{
     my ($arg1, $arg2, $arg3) = @_;
}
Similarly, can use lists of variables to store multiple elements returned from functions such as split.
my $dna_string = "chrI    SGD    gene    7236    9017    .    -    .    ID=YAL067C";
 
my ($chr, undef, $type, $start, $stop, undef, $strand, undef,$desc) = split (/\t/, $dna_string);
 
print "$chr\t$type\t$start\t$stop\t$strand\t$desc\n";

  • Watch out for LISTS in SCALAR context.
The most common error with subroutines is assigning a single argument in scalar context.
"my $filename = @_" will make $filename equal 1 instead of the first element of @_. That's because without parantheses, $filename is a scalar, and assigning an array to a scalar is like putting "scalar" in front of the array, returning its size.
Always remember to put the variable in parentheses.
sub test_routine{
    my ($arg1) = @_;
}

EMBOSS

http://emboss.sourceforge.net/

All the tools are already on the LiveCD, thanks to Venky!

For help on any given utility:
transeq -h

BLAST


Before BLASTing


The input to blast is a fasta-format file and a special database.
Before blasting, you have to set up this database using formatdb.

Just running 'man formatdb' will list all the input paramters, but here are
the crucial ones that MUST be specified:

    • -i is the input fasta file
    • -p T/F tells formatdb whether the input file has protein sequence (T) or nucleotides (F).

To run it:
formatdb -i ~/class/datafiles/s_cerevisiae.nt.niceheader.fasta -p F

BLASTING

To run blast, type 'blastall.' This will also list the million options. The
basic ones are:

    • -p the blast program you want to run (eg blastn,blastp)
    • -d name of database file
    • -i name query file in fasta format
    • -o name of output file (out.blast)
    • -e the maximum e-value for matches
    • -m output format (0-9); use 9 (tab-delimitted) for easiest parsing

Example:
blastall -p blastn -d s_cerevisiae.nt.fasta -i query_dna.fa -e 1E-5
blastall -p blastn -d s_cerevisiae.nt.fasta -i query_dna.fa -e 1E-5 -m 9

Exercises


Project


Here is the file with the 23 s.cer RAS gene open reading frames.

1. Lookup EMBOSS "transeq" and use it to make a fasta file with all 23 RAS proteins.
2. Venky has prepared a combined BLAST database of C. albicans, C. glabrata, K. lactis, and S. pombe proteins here:
/usr/local/data/blast/fungal.combined.pep
BLAST the 23 cerevisiae proteins against the four other species.
3. Parse the BLAST results, extracting, for each cerevisiae protein, the name of the top hit from each of the four species.

Extra Practice


abby_predictor.pl

Write a program that will accurately predict whether Abby will give birth before or after the class ends.

random_genome.pl

Generate a random genome of user-specified size and AT-content.

simple_repeat_masker.pl


Use regular expressions to find and mask with "X" all simple repeats that occur more than a user-specified number of times in a row.
If user number is 5, should mask out:
AAAAA, ATATATATAT, CGGTCGGTCGGTCGGTCGGT
Make sure that it's not just the first n repeats that are masked. If user number is 5, you should mask "CGAAAAAAAAAAATC" the following sequence as "CGXXXXXXXXXXXTC"

extract_fasta_sequence_smart.pl

The extract_fasta_sequence.pl from 6.1 is not a brilliant way to solve this task. We are storing huge genomes (what if it's human?) into a hash. That's an enormous amount of memory, wasted. We really only need one chromosome at a time.
Rewrite your script to avoid this waste.

Solutions