You already know how to open a file or die, but today we're going to be working with several variations of file handles so I'm going to get a bit more specific in my syntax. In particular we want you to adapt three specific coding practices. First, we want to you to always have a variable for your $FILEHANDLE. Otherwise, you will have trouble passing your file handles around in future exerciess

-file1.pl

#!/usr/bin/perl
use strict;
use warnings;
 
my $filename = "file1.pl";
 
open my $FILE, $filename or die "you suck!";
#notice the declaration of my $FILE during the open call!

Second. We are going to be explicit about adding another parameter to the open call. If you recall the unix directional commands, then we'll be fine. So to read IN a file, we wan to change the parameters like so:
-file1a.pl
#!/usr/bin/perl
use strict;
use warnings;
 
my $filename = "file1.pl";
 
open my $FILE, "<", $filename or die "you suck!";
#notice the declaration of my $FILE during the open call! and the declaration of file IN with "<"

The third thing that we need to do is remember to always ALWAYS close our files as soon as we are done using them. This can save you memory, accidentally changing files when you don't want to, etc. The time is NOW to get in the habit of doing this, even for files that have only one file to work with. So something like this should be your standard filecall:

-file1b.pl
#!/usr/bin/perl
use strict;
use warnings;
 
my $filename = "file1.pl";
 
open my $FILE, "<", $filename or die "you suck!";
 #HERE I would do something
close $FILE;
 
#and now I can sleep at night because the file was opened properly
#but is now closed

Now that we've gotten that syntax out of the way. I'd like to explain something about the @ARGV array that is really powerful for dealing with files. When you give the @ARGV array unix wildcards, the shell will first interpret them and then pass them to the @ARGV array.

so if we are in a directory with a bunch of fasta files with the extension ".fa", then running a perlscript called file1s.pl like so
Will populate the @ARGV array with all .fa files in that directory. This can be a real timesaver!


-file1s.pl
#!/usr/bin/perl
 
use strict;
use warnings;
 
#pass @ARGV immediately to a descriptively named array
my @files = @ARGV;
 
#recall using a foreach loop to go through all the strings in array
foreach my $file_name (@files) {
 
    #we're using good coding style now that includes:
    #1 variable $FILEHANDLE
    #2 how we're going to use the file, in this case
    #         for input "<"
 
    open my $FILEHANDLE, "<", $file_name or die "you suck!";
    while ( my $file_line = <$FILEHANDLE> ) {
        chomp $file_line;
        if ( $file_line =~ m/^>/ ) {
            print "the header line for $file_name is:\n$file_line\n";
        }
    }
 
    #DON'T FORGET TO CLOSE THE FILE HANDLE!
    close $FILEHANDLE;
 
}
 
__END__

By changing the input to the @ARGV array we could use a similar program to grab all the blast reports, etc. We can change the internal control structure however we want.

One of things we can do with our new open control syntax is open a file for WRITING! Just as a > on the unix prompt meant output TO this file, opening with syntax open my $FILE, ">", $filename means open with the intention of writing to this file. In this code I'm doing a few things. I'm opening up the files, creating a new variable for the output file name that is unique to each inputed file and printing directly to these files.

-file2.pl *.fa

#!/usr/bin/perl
 
use strict;
use warnings;
 
#pass @ARGV immediately to a descriptively named array
my @files = @ARGV;
 
foreach my $file_name (@files) {
    open my $FILEHANDLE, "<", $file_name or die "you suck1";
    while ( my $file_line = <$FILEHANDLE> ) {
        chomp $file_line;
        if ( $file_line =~ m/^>/ ) {
 
            #output_file_name is scoped here!
            my $output_file_name = $file_name . ".out";
 
            #notice how we use a file for output
            open my $OUTPUTFILE, ">", $output_file_name or die "you suck2";
 
            #we can use the filehandle $OUTPUTFILE to print to the outputfile
            #because we used the ">" flag in open!
            print $OUTPUTFILE $file_line;
 
            #DON'T FORGET TO CLOSE THE FILE HANDLE!
            close $OUTPUTFILE;
        }
    }
 
    #DON'T FORGET TO CLOSE THE FILE HANDLE!
    close $FILEHANDLE;
 
}
 
__END__

YES! Try it again, after closing the files. I'm now going to edit sample1.fa. If we rerun the program, then each time goes to open the ">" file handle it will clear the entire contents of the file. This is especially a problem if we wanted to output all of this to the same code without being smart about where we opened and closed our output file.

-file2a.pl *.fa

#!/usr/bin/perl
 
use strict;
use warnings;
 
#pass @ARGV immediately to a descriptively named array
my @files = @ARGV;
my $output_file_name = "output.txt";
 
foreach my $file_name (@files) {
 open my $FILEHANDLE, "<", $file_name or die "you suck1";
 while ( my $file_line = <$FILEHANDLE> ) {
 chomp $file_line;
 if ( $file_line =~ m/^>/ ) {
 
 #output_file_name is scoped here!
 open my $OUTPUTFILE, ">", $output_file_name or die "you suck2";
 
 print $OUTPUTFILE $file_line;
 
 #even if we don't close the file handle! It will still write over it
 #every time open is called!
 }
 }
 
 #DON'T FORGET TO CLOSE THE FILE HANDLE!
 close $FILEHANDLE;
 
}
__END__

One thing that we could do is to try to open up the file once for the entire run of the program, but that has the potential to take up a lot of memory and slow the system down especially if you want to open up several files at once for writing in different conditions. Like for example taking all the header lines and putting them in one file and taking all the sequence and putting it in another file... hmm... that sounds like a good idea for an exercise!

So what you can do is use the >> operator. Let's try this out on the command line first.

~>echo "jf is great" > great.txt
~>more great.txt
~>echo "lenny is great" > great.txt
~>more great.txt
~>echo "venky is great" >> great.txt
~>more great.txt

So now if we want to output both headers to the same file in our perlscript we can use the same type of operator.

-file3.pl
#!/usr/bin/perl
 
use strict;
use warnings;
 
#pass @ARGV immediately to a descriptively named array
my @files = @ARGV;
 
foreach my $file_name (@files) {
 open my $FILEHANDLE, "<", $file_name or die "you suck1";
 while ( my $file_line = <$FILEHANDLE> ) {
 chomp $file_line;
 if ( $file_line =~ m/^>/ ) {
 my $output_file_name = "captureboth.out";
 
 #notice how we use a different tag for the file for output
 open my $OUTPUTFILE, ">>", $output_file_name or die "you suck2";
 print $OUTPUTFILE "$file_line\n"; #notice how I've added a newline
 
 #DON'T FORGET TO CLOSE THE FILE HANDLE!
 close $OUTPUTFILE;
 }
 }
 
 #DON'T FORGET TO CLOSE THE FILE HANDLE!
 close $FILEHANDLE;
 
}
 
__END__


What if we want to know something about the file before we open it for writing? Let's say we want to create a file for writing only if that file doesn't exist already. Perl gives us some handy shortcuts for doing that kind of test.

-file4.pl

#!/usr/bin/perl
 
use strict;
use warnings;
 
 
my $output_file = "store_initials.txt";
 
#here is the new test!
if ( -e $output_file ) {
 print "not so fast mister.you tried to open an existing file\n";
}
else {
 open( my $INITIALS_OUTPUT, ">", $output_file );
 print $INITIALS_OUTPUT "all is good jf lt vi";
 close $INITIALS_OUTPUT;
}
 
 
__END__

This can be quite helpful for keep us from writing over files that we don't want to touch. There are a few other flags like this that you can use. You can also set the results of these calls to variables.
-e
 file or directory exists
-r
 readable
-w
 writable
-z
 file exists, but has zero size (all directories have size)
-s
 file or directory exists and have non zero size (returns the value of the size)
-d
 is a directory
-M
 last modified (returns a value in days)
-A
 last accessed (returns a value in days)

So you can return calls to these variables like so (although keep in mind that most of them will return 0 or 1 only):

-file5.pl
#!/usr/bin/perl
 
use strict;
use warnings;
 
my $filename = "file1.pl";
my $size = (-s $filename);
print "size is $size\n";

By now you guys are old pro's at moving around directories with cd, renaming files with mv, etc on the command line. You can also do a lot of this within perl. This is incedibly useful if you want to move to a certain directory to create a bunch of output files and a different directory for error files, etc.

so we can read a directory (and essentially duplicate the unix command ls) with the following code:

-file6.pl

#!/usr/bin/perl
 
use strict;
use warnings;
 
my $dir = "/bin";
opendir my $DH, $dir or die "sukas";
#note the readdir call
while (my $name = readdir $DH){
 print "$dir/$name\n";
}

Perl starts in the directory in which it is called. But we can quickly change the working directory with the chdir command (like cd in unix).

UNIX     PERL
cd     chdir         chdir "/bin";
rm     unlink         unlink "badfile.txt", "notsogoodfile.txt";
     this is irreversible so be careful!
mv     rename         rename "badfile.txt", "goodfile.txt";
     it's really good to test if the destination file exists first!
mkdir  mkdir
rmdir  rmdir

Since we are looking at the interaction of perl with unix, this is a good time to bring up how perl can call commands. Using backticks (the key beside 1) allows you to capture the result (including all newlnes) in a single scalar variable, or each line as an element in an array.

#!/usr/bin/perl
 
use strict;
use warnings;
 
my $string = `ls -l`;
print $string;


As an alternative you can use qx/command/ to start processes which looks a little nicer. Another option is to have perl spawn off a process using the system command

 #!/usr/bin/perl
 
 use strict;
 use warnings;
 
system "pymol";

Perl will wait for that process to finish unless you use the unix command & which runs the process in the background - it won't be killed until it exits on its own or because of user input.
 #!/usr/bin/perl
 
 use strict;
 use warnings;
 
 system "pymol &";

Exercises


Problem 1

(file_list.pl) - Write a perl program that runs and captures the output of "ls -l" for your /bin directory and appends it (>>) to a file called overwrite.txt.
Run this program at least twice before moving on to the next part (ie. capture the ls -l output twice in overwrite.txt)
(reverse_file_list.pl) - Modify the program so that it prints the lines in reverse to a file called reverse.txt
if calling the command on the prompt gives
~>ls -l
ab
cd
it should produce
cd
ab
in reverse.txt.

Problem 2

(parse_date.pl) - Write a perl program that runs and captures the output of the unix utility "date" and parses it into an array (What delimiter can you use?) Use regular expressions to change the part of the array that contains the time so that it removes the colons (ie 12:39:45 should become 123945). Output that string (123945 in the example), the year, the seconds, and the day of the month, seperated by the letter "j"
if the date is
Sat Jan 6 00:04:12 PST 2007
then the output should be
000412j2007j12j6

Problem 3

(checker_file_list.pl) - Use the program from problem 1, but modify it so that now it checks to see if overwrite.txt exists before writing to it. If the file exists, use perl to create a backup directory and move the previous version of overwrite.txt to that directory. Make sure you have a relatively unique identifier as the filename in the backup directory (HINT: see problem 2) so that if you run the program multiple times you don't overwrite your backup! What are some other ways you can make unique names for your files?

Problem 4

(blast_parser.pl) - eXtract the tar file. It will make the directory called blast_results (it has 3 .bla files in it). Write a program to go through all the results files and print out three files: query.txt should have only the query ids for all three input files delimited by newlines. subject.txt should have only the subject id's for all three input files delimited by tabs. evalue.txt should print evalues for all three input files seperated by commas. Decide whether you want to access the output files once for each input file or store everything and output the entire output file at once.

For example, if the only file was initialsblast.bla:
>query,subject,evalue
jsf,lt,0.1
jsf,dp,10
dp,lt,0.00001
should produce query.txt
jsf
jsf
dp
should produce subject.txt
lt dp lt
should produce evalue.txt
0.1,10,0.00001

Problem 5

(directory_blast_parser.pl) - Write a program to go through all the results files to look for significant or not significant hits (let's say we consider an evalue of less than 0.1 to be significant). Store the results so that there is a different directory for significant and not significant hits. In each directory, you should output a file for each original results file that has a header and the results.
Modify your program so that it checks to make sure that your evalue contains only numbers or a single decimal. Output any results that don't match to a new directory and new file for errors. Where did your errors get sorted before?
For example, for initialsblast.bla: in the significant directory we should have a file specific for initialsblast.bla that contains only a header line and the dp,lt match separated by semicolons
>query,subject,evalue
dp;lt;0.00001
and any other blast reports would have their own file in that directory.

Problem 6

(launch_hydro_fox.pl) - Write a program that checks all the fasta files (.fa) in the directory created by this tar file for transmembrane passes. Don't use the @ARGV array to get the file names! If it finds a transmembrane pass, print "I found a pass, let's checkout the structure" and launch pymol. After searching all of the files, print out a list of filenames for all of the proteins that did NOT contain a transmembrane pass.



If you have time, go back and work on the problems from yesterday!

Solutions

solutions