Main Goal


We would like to find homologues of RAS family cerevisiae proteins in other yeast species and to analyze the conservation of each one.

In order to achieve this, we will need to complete the steps below.

Outline


Getting the cerevisiae proteins

  • For the 23 cerevisiae RAS genes, need to find coordinates in the GFF3 genome annotation file.
  • Using the annotated coordinates, pull out the DNA sequence of each gene from the cerevisiae genome fasta.
  • Translate the coding sequence into protein for each gene.

Finding homologues

  • Blast each cerevisiae RAS-family protein against all the yeast ORFs.
  • Parse the blast results and pull out the best hits from each species.

Analyzing conservation

  • For each cerevisiae protein, make multiple alignments of the homologues.
  • Map conservation to structure.

Automate the analysis



Project Details

Getting the cerevisiae proteins

For complete specification, see exercise descriptions under lecture 6.1.


Finding homologues

For complete specification, see exercise descriptions under lecture 8.1.

  • We prepared a combined BLAST database of C. albicans, C. glabrata, K. lactis, and S. pombe proteins here: /usr/local/data/blast/fungal.combined.pep
BLAST the 23 cerevisiae proteins against the four other species.

  • Write a blast parser to to discover the best hits for each of the Ras proteins, and collect the output into a single file. The output from this script is the query protein ID, a tab, followed by a comma separated list of the best hit protein IDs from each fungal genome (this is the subject with the best e-value for each of the species).

Analyzing conservation

For complete specification, see exercise descriptions under lecture 9.2.
  • The fasta file with the fungal proteins is here:
/usr/local/data/pep/fungal.combined.pep.fasta.gz
For each of the 23 Ras proteins, using the parsed blast results, make a fasta file with the cerevisiae Ras protein and the top hits from the other species.
  • The Ras structure is from the human protein, so we need to add it to each of our multi-fasta files, if we want to map conservation to structure.
  • Make a multiple alignment of the Ras homologs for each of the 23 fasta files.
  • Calculate percent identity for each residue of the human RAS across the multiple alignment.
  • Map conservation to structure using Pymol.

Pipeline

For complete specification, see exercise descriptions under lecture 10.1.

Write a master perl script to run through all the parts of the project automatically.



Solutions

Getting the cerevisiae proteins

  • For the 23 cerevisiae RAS genes, need to find coordinates in the GFF3 genome annotation file

  • Extract the DNA sequence for these gene CDSs from the cerevisiae genome

  • Translate the coding sequence into protein for each gene.


Finding homologues

  • BLAST the 23 cerevisiae proteins against the four other yeast species.


  • Write a blast parser to to discover the best hits for each of the Ras proteins. The output from this script is the query protein ID, a tab, followed by a comma separated list of the best hit protein IDs from each fungal genome.


Analyzing conservation

  • For each of the 23 Ras proteins, using the parsed blast results, make a fasta file with the cerevisiae Ras protein and the top hits from the other species.


  • The Ras structure is from the human protein, so we need to add it to each of our multi-fasta files, if we want to map conservation to structure.


  • Make a multiple alignment of the Ras homologs for each of the 23 fasta files.


  • Calculate percent identity for each residue of the human RAS across the multiple alignment.
  • Map conservation to structure using Pymol.