Session 8.2


Protein Alignment Using Clustal


So much of our understanding of protein function has been enhanced by comparing sequences across species. BLAST lets us find putative homologs and a careful alignment of the amino acid sequence lets us learn a great deal about the relationship of the sequences and the potential conservation or evolution of function.

To achieve this step of alignment of peptide sequences, there are a number of computer programs that have been written specifically for this task. These programs tend to be aware of the odds of each amino acid being substituted by another amino acid as well as the additional constraints of secondary protein structure.

Only a time machine will give you perfect certainty in an alignment. That said, crystal structures in multiple species allow researchers to guess which parts of a sequence are homologous based on structural homology. Databases of structural homologs have been used to assess the accuracy of alignment programs. Different programs have different strengths and weaknesses and there is no one program that is always much better than others.

Clustal


Although it is older and in many ways not as sophisticated as newer programs, Clustal has been the most commonly used alignment tool for the last decade. We will use it for the project and you should feel comfortable using it for your own research. Just keep in mind that anything you might want to publish based on alignments should be confirmed using at least two different alignment programs.

There are a couple of different ways to use Clustal. The most user-friendly version is ClustalX, which opens up a GUI window with menus etc. The command-line version, ClustalW, is interactive be default (e.g. it will ask you for your input sequences) but can be run with arguments in a non-interactive way. This last way is the most common way to use it for someone with programming skills - you can call ClustalW multiple times in a Perl script that loops through a list of protein families, for instance.

Command-Line Use: Formats and Syntax


Typical input sequence format is FASTA. This is how you will use it.

The simplest syntax for running ClustalW is:

 > clustalw -align -infile=bicoid_protein.fa

This spits a bunch of stuff to the screen and writes two new files: bicoid_protein.dnd & bicoid_protein.aln.

The DND file is the estimated tree relating the sequences. ClustalW is an alignment program and not a tree building program. Do not assume that this is a good phylogenetic tree.

The ALN file is the output alignment. By default ClustalW returns alignments in a very human-readable format. This is great for getting a feeling for what the alignment looks like. However, this format is difficult to parse for downstream analysis.

You can change the output format using the "-output" argument. Unfortunately ClustalW does not output FASTA format. It does output PIR (Protein Information Resource) format which is very close to FASTA.

 > clustalw -align -infile=bicoid_protein.fa -output=pir

This created a new alignment file called bicoid_protein.pir. Note that this is not exactly FASTA format. The headers have been altered and there are "*" characters and empty lines throughout. You should easily be able to convert this into proper FASTA format.

You can look at all the possible arguments by typing "clustalw -help". A lot of these are pretty complicated and require a fairly deep knowledge of how multiple alignments are created. One additional argument that may be helpful lets you determine the output file name.

 > clustalw -align -infile=bicoid_protein.fa -output=pir -outfile=bicoid_protein_aln.pir

This creates a file name "bicoid_protein_aln.pir".

Visualizing Alignments


In addition to looking at human-readable alignment formats like the default for ClustalW, you can run ClustalX, which color codes the alignment, as well as other alignment viewers. Another excellent alignment viewer is JalView. These can both be run just by typing them without arguments at the unix prompt:

 > clustalx
 > jalview

Wget

blast with blastcl3
Wget is a handy little unix utility that you can use to grab things off of the web. It's format is very simple:

wget "URL" -O output

There are of course may options that you can access using wget --help.

You can get files from ncbi automatically using wget and using a special NCBI address called eutils. Eutils give simple formats to automatically download sequences. We can use wget to then comply to those formats and automagically download sequences!

#!/usr/bin/perl
# Author        :
# Date          : Tue Aug 14 08:46:04 PDT 2007
# Description   :
 
use strict;
use warnings;
 
 
my $url
    = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=77417679&retmode=text&rettype=fasta";
 
system("wget \"$url\" -O test.seq2");
sleep 3;
 
__END__
The sleep command is to keep you from hammering the NCBI server when you use this in a loop.
#!/usr/bin/perl
# Author        :
# Date          : Tue Aug 14 08:46:04 PDT 2007
# Description   :
 
use strict;
use warnings;
 
my @proteins = ( "77417679", "13361104" );
 
foreach my $download_protein (@proteins){
    my $url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=$download_protein&retmode=text&rettype=fasta";
 
        system("wget \"$url\" -O $download_protein.txt");
        sleep 3;
}
 
__END__
 
 

wget is not included in the latest versions of Mac OSX. The following link explains how to add it:

http://radio.javaranch.com/bear/2005/03/02/1109829480523.html