Structure of FASTA files

Piping

Remember to use the power of unix. The pipe "|" can send the contents of files to standard in!

#!/usr/bin/perl
 
use warnings;
use strict;
 
while(my $input = <STDIN>){
 
print $input;
}
__END__
If you cat in a file, it will cat it back out! You've just rewritten cat!

Scoping

A few of you have already encountered problems with scoping, so I suppose we should introduce you to it. Each set of curly braces is a scope. Scopes are hierarchical. Variables that are declared with "my" inside a lower hierarchy scope cannot exist in a higher scope. To demonstrate this, let's play around with some code.
#!/usr/bin/perl
 
use strict;
use warnings;
 
my $foo = 5;
print "foo outside the scopes is $foo\n";
 
{
    print "foo at the beginning of the first scope is $foo\n";
 
    my $foo = 10;
    print "foo in the first scope is $foo\n";
 
    {
        print "foo at the beginning of the second scope is $foo\n";
        my $bar = 20;
        print "bar in the second scope is $bar\n";
    }
 
    print "foo at the end of the first scope is $foo\n";
 
    # will throw error! $bar doesn't exist here!
    print "bar at the end of the first scope is $bar\n";
}
print "foo outside the end of the scopes is $foo\n";
 
__END__
#  :-
#  foo outside the scopes is 5
#  foo at the beginning of the first scope is 5
#  foo in the first scope is 10
#  foo at the beginning of the second scope is 10
#  bar in the second scope is 20
#  foo at the end of the first scope is 10
#  foo outside the end of the scopes is 5
 

Printing to STDERR

A very useful feature in debugging is using multiple outputs to define different files. Until now we have been using standard output to output everything, but we can also take advantage of perl's ability to switch outputs and unix's ability to capture outputs.
#!/usr/bin/perl
 
use strict;
use warnings;
 
my $foo = 5;
print "foo outside the scopes is $foo\n";
 
{
    print "foo at the beginning of the first scope is $foo\n";
 
    my $foo = 10;
    print "foo in the first scope is $foo\n";
 
    {
        print STDERR "foo at the beginning of the second scope is $foo\n";
        my $bar = 20;
        print "bar in the second scope is $bar\n";
    }
 
    print "foo at the end of the first scope is $foo\n";
 
    # will throw error! $bar doesn't exist here!
    print STDERR "bar at the end of the first scope is $bar\n";
}
print "foo outside the end of the scopes is $foo\n";
 
__END__
#  :-
#  foo outside the scopes is 5
#  foo at the beginning of the first scope is 5
#  foo in the first scope is 10
#  foo at the beginning of the second scope is 10
#  bar in the second scope is 20
#  foo at the end of the first scope is 10
#  foo outside the end of the scopes is 5
# use 2> logfile.log

Warn

You can also switch the output using warn. This can be less confusing, provided you know you want to have that print statement always go to STDERR. Warn, then, is like "print STDERR".
#!/usr/bin/perl
 
use strict;
use warnings;
 
my $foo = 5;
print "foo outside the scopes is $foo\n";
 
{
    print "foo at the beginning of the first scope is $foo\n";
 
    my $foo = 10;
    print "foo in the first scope is $foo\n";
 
    {
        warn "foo at the beginning of the second scope is $foo\n";
        my $bar = 20;
        print "bar in the second scope is $bar\n";
    }
 
    print "foo at the end of the first scope is $foo\n";
 
    # will throw error! $bar doesn't exist here!
    warn "bar at the end of the first scope is $bar\n";
}
print "foo outside the end of the scopes is $foo\n";
 
__END__
#  :-
#  foo outside the scopes is 5
#  foo at the beginning of the first scope is 5
#  foo in the first scope is 10
#  foo at the beginning of the second scope is 10
#  bar in the second scope is 20
#  foo at the end of the first scope is 10
#  foo outside the end of the scopes is 5
# use 2> logfile.log
Using warn, or print STERR, or even just print statements is very powerful. As you code, you should add print statements to check what your variables main in all scopes. I print every 2-4 lines of code I write. I print every 2-4 lines of code I write. Lenny prints every 2-4 lines of code he writes. Dan prints every 2-4 lines of code he writes. I print every 2-4 lines of code I write.

Exit


This is another concept that we've talked about, but haven't formally explained. It does what you think it does.
james-fraser:~ jamesfraser$ perl test.pl
#!/usr/bin/perl
 
use strict;
use warnings;
 
my $name = "Jaime";
 
if ($name eq "Jaime"){
    print "you got my name\n";
    exit;
}
 
print "Lenny is the best, I love him so much. Dan ain't half bad either.\n";
 
__END__


Die

Is like warn and exit.
#!/usr/bin/perl
 
use strict;
use warnings;
 
my $name = "Jaime";
 
if ($name eq "Jaime"){
    die "you got my name\n";
}
 
print "Lenny is the best, I love him so much. Dan ain't half bad either.\n";
 
__END__

Debugging


The gedit setup gives you a few tools that come in handy when programming. Perl Tidy is a little plugin that automatically formats your code in accordance with a reliable and readable standard. Perl Check is another that works like the use warnings tags, but without having to run the program.

Debugging is very personal and, I think, hard to teach. It's also very difficult to get good at in a course setting like this when you have your neighbours, TAs, and instructors to talk to - Although, talking to other programmers is often an effective debugging strategy. I'm including this section as a hitlist of how to debug code effectively on your own. You can use this as a first line of defense before asking the TAs and hopefully when you begin to program on your own.

SYNTAX ERRORS

  • are your variables spelled the same way throughout your code - perl will warn for variables not initialized with "my"
Global symbol "$test" requires explicit package name at test.pl line 5.
Execution of test.pl aborted due to compilation errors.
  • do your lines end with semi-colons - perl will fail when it can't interpret things properly. This can result in many types of error messages, for example the program:
#!/usr/bin/perl
use strict;   # line 2
use warnings; # line 3
              # line 4
my $test = 0  # line 5
print "hello world\n";
returns the following error:
syntax error at test.pl line 6, near "print"
Execution of test.pl aborted due to compilation errors.
But, the error is actually on line 5. So the lesson is perl's warnings are sometimes a bit inaccurate and it's a good idea to check not only the line, but the surrounding lines as well.

LOGIC ERRORS

  • what are the values of your variables at each stage of the program? Is it incrementing properly through loops? Are lines being read properly?
    • I usually include print statements in my programming every time I add a new set of braces.
    • This means checking your variables with a print statement every time scope is potentially changed!
    • You can use STERR and capture it to a log file!
  • Make a simple test file
  • WRITE PSEUDOCODE!

So the hitlist is:
  1. start out with pseudocode
  2. spelling and declaration of variables
  3. checking the line (AND SURROUNDING LINES!) for proper semicolon usage
  4. print out your questionable variables to STDERR and capture STDERR in a log
  5. make a simple test file and validate on that!
  6. print out your questionable variables!!!
  7. run perl tidy and check that your indenting, and hence your scoping, makes sense
  8. print out your questionable variables!!!


Good Coding Practices and Debugging


Writing clean, readable code


What's in a name?

Well, by not using naming conventions, your programs might lose their sweet smell and start to stink.

  • Give descriptive names to the scripts themselves
  • Choose descriptive variable names

Comments

These are so easy to leave out as you go, but you'll regret it later.
  • You'll thank yourself; other readers of your code will kiss your feet
  • If half of your program consists of comments, that's not a bad thing.
  • Comment before you start coding - helps with the logic

Program Description

Give a detailed description/pseudocode BEFORE you start writing the code.

Formatting

  • Indent, indent, indent
  • Use perltidy

Debugging


Error Types


  • Comiple-time: usually a simple syntax error; easiest to catch
  • Run-time: often due to unanticipated input (file doesn't exists, division by zero)
  • Logical: this is a real bitch

Execute the code frequently

Don't add more than 4-5 lines of code without testing as you go along.

perltidy and perlcheck

  • If perltidy messes up your formatting, that alerts you to something being wrong
  • Perlcheck goes through the syntax, without actually executing the program


Test input

  • Simplify sample input
  • Pipe test user input

Users are out to get you

  1. Users are incredibly adept at screwing up your program
  2. You are a user of your own program
  3. Data files from others (genome centers, ncbi, etc.) are user input
Try to anticipate what will go wrong

Print statements

Love them, use them.
Print variables before, during, after loops and logic steps.



Exercises


Problem1: "fasta_unwrap.pl"

  • Name the script "fasta_unwrap.pl"
  • Write this as a filter, accepting piped input of the cerevisiae genome fasta
  • Reformat each sequence in a fasta file to be a single line.
  • Make sure you are getting all the sequences (test on smaller input)

Problem2: "unique_fasta.pl"

  • Add a unique incremented identifier to the beginning of each header in a fasta file
  • Ask user for a fasta file_name and open it.
  • Ask for a prefix from a user and add the counter to the prefix
  • If no prefix is given, simply add the counter
    • with no prefix, you should get:
      • ">1 YAL001C TFC3..."
      • "ATGGTA..."
      • ">2 YAL002W..."
      • "ATGGAGC..."

    • with a prefix of "gene":
      • ">gene1 YAL001C TFC3..."
      • "ATGGTA..."
      • ">gene2 YAL002W..."
      • "ATGGAGC..."


Problem3: "reformat_fasta_to_codon.pl"


  • Use the orf_coding.fasta as the input file. Open it with a filehandle rather than STDIN stream.
  • This file has CDS of cerevisiae genes. Reformat it to print out spaced codons

    • ">YAL002W VPS8 SGDID:S000000002, Chr I from 143709-147533, Verified ORF, "Membran"
    • "ATG GAG CAA AAT GGC CTT GAC CAC ..."


Solutions


#!/usr/bin/perl
# Author        : Lenny Teytelman
# Date          : Sat Jan  6 12:15:11 UTC 2007
# Description   : Remove line breaks in a fasta file
 
use strict;
use warnings;
 
my $header='';
my $sequence='';
 
while ( my $file_line = <STDIN> ) {
    chomp $file_line;
    #if we're in a header line, just print the header
    if( substr($file_line,0,1) eq ">"){
        if($sequence){
            print "$header\n$sequence\n";
        }
        $header=$file_line;
        $sequence='';
    }
    else{
        $sequence = $sequence . $file_line;
    }
}
 
#don't forget the last sequence
print "$header\n$sequence\n";
 
__END__

#!/usr/bin/perl
# Author        : Lenny Teytelman
# Date          : Sat Jan  6 12:15:11 UTC 2007
# Description   : Add a unique identifier to each fasta header.
 
use strict;
use warnings;
 
my $seq_counter=1;
 
print "Please enter a prefix:\n";
chomp(my $user_prefix=<STDIN>);
 
 
print "Please give the name of the fasta file to process:\n";
chomp(my $file_name=<STDIN>);
 
open(FH, $file_name) or die "couldn't open the file $file_name\n";
 
while ( my $file_line = <FH> ) {
    chomp $file_line;
    #if we're in a header line, add the identifier
    if( substr($file_line,0,1) eq ">"){
        print ">" , $user_prefix, $seq_counter, " " , substr($file_line,1),"\n";
        $seq_counter++;
    }
    else{
        print $file_line, "\n";
    }
}
 
__END__

#!/usr/bin/perl
# Author        : Lenny Teytelman
# Date          : Sat Jan  6 15:03:26 UTC 2007
# Description   : Reformat DNA coding sequence to print it out in codon triplets
 
 
use strict;
use warnings;
 
my $header = '';
my $dna_sequence='';
 
print "Please Enter a fasta file to reformat: \n";
 
chomp(my $file = <STDIN>);
 
open( FH, $file );
 
while ( my $file_line = <FH> ) {
    chomp $file_line;
    if ( substr($file_line,0,1) eq ">" ) {
        if ($dna_sequence){
            print "$header\n";
 
#            $dna_sequence =~ s/(...)/$1 /g;
            #run through the whole sequence and print triplets
 
            for (my $counter=0;$counter< length($dna_sequence); $counter=$counter+3){
                print substr($dna_sequence,$counter,3)," ";
            }
            print "\n";
 
        }
        $header=$file_line;
        $dna_sequence='';
 
    }
    else{
        $dna_sequence .= $file_line;
    }
}
#let's not forget the very last sequence
print "$header\n";
 
for (my $counter=0;$counter< length($dna_sequence); $counter=$counter+3){
    print substr($dna_sequence,$counter,3)," ";
}
print "\n";
 
 
 
 
 
 
__END__