Session 9.1


Databases


The concept of a database may be very familiar to you. I imagine 935 Pennsylvania Avenue having lots of rooms full of computers filled with databases on everything 'interesting' anyone does. You hear about big online retailers, like Amazon, hiring lots of very smart people to work on their databases. Sometimes specific types of databases come up, like Oracle.

While all of this sounds mysterious, databases can be very simple. For instance, a text file with a list of anything is technically a database, although not a very interesting one.

In bioinformatics, nearly everything you need to do can be done with text files and Perl. Using multi-dimensional data structures you can read in lists and tables and do complicated comparisons and queries. So if you are a good Perl programmer and are working on your own, you don't need to know about databases. But as the size of a project grows, reading in data to fill a multi-dimensional data structure and then computing on this structure can take some time. Databases are organized such that looking up information and comparing information is very very fast. Thus knowing a bit about their potential will do you well.

SQL/SQLITE3


Today we are going to preview a type of database language called SQL. Specifically we are going to use a light-weight version of SQL called SQLITE3.

SQL stands for Structured Query Language. SQL is used to create and manipulate a type of database, called a relational database. This is just a fancy name for a database that uses tables to store information. You are probably very used to using Excel to store and manipulate data in tables. SQL works similarly, except you don't use a GUI editor. Instead you use commands to populate, manipulate and query your tables.

SQLITE3 is on your live cd and is also standard on Mac OS X (you might need to add it using the Install Discs).

SQLITE3 is started by typing it on the command-line plus the name of the database you are either creating or restoring from a previous session:

 > sqlite3 your_database_name.db

The following gets printed to the screen:

 SQLite version 3.3.5
 Enter ".help" for instructions
 sqlite>

You have a new prompt. This is the environment in which you use the SQL language to work on databases. To get a quick reference list of basic SQLITE3 commands, it tells you to type ".help". This prints the following:

.databases             List names and files of attached databases
.dump ?TABLE? ...      Dump the database in an SQL text format
.echo ON|OFF           Turn command echo on or off
.exit                  Exit this program
.explain ON|OFF        Turn output mode suitable for EXPLAIN on or off.
.header(s) ON|OFF      Turn display of headers on or off
.help                  Show this message
.import FILE TABLE     Import data from FILE into TABLE
.indices TABLE         Show names of all indices on TABLE
.mode MODE ?TABLE?     Set output mode where MODE is one of:
                         csv      Comma-separated values
                         column   Left-aligned columns.  (See .width)
                         html     HTML <table> code
                         insert   SQL insert statements for TABLE
                         line     One value per line
                         list     Values delimited by .separator string
                         tabs     Tab-separated values
                         tcl      TCL list elements
.nullvalue STRING      Print STRING in place of NULL values
.output FILENAME       Send output to FILENAME
.output stdout         Send output to the screen
.prompt MAIN CONTINUE  Replace the standard prompts
.quit                  Exit this program
.read FILENAME         Execute SQL in FILENAME
.schema ?TABLE?        Show the CREATE statements
.separator STRING      Change separator used by output mode and .import
.show                  Show the current values for various settings
.tables ?PATTERN?      List names of tables matching a LIKE pattern
.timeout MS            Try opening locked tables for MS milliseconds
.width NUM NUM ...     Set column widths for "column" mode

These are all SQLITE3-specific commands. I will be introducing general SQL commands which will be UPPERCASE. The SQLITE3-specific commands start with a dot and will be lowercase.

So you don't get trapped, the most important command in the help file is ".exit". This will get you back to the unix prompt.

CREATE


Typically you want to make a table and put some data into it. The way you make tables is with the "CREATE" command.

sqlite> CREATE TABLE your_table_name (first_col_name first_col_data_type,
second_col_name second_col_data_type, ...);

The data types available are:

TEXT(size)    - maximum
NUMERIC(size) - maximum
varchar(size) - maximum
char(size)    - defined size
decimal(size) - maximum
integer(size) - maximum
 
NOTE: NUMERIC and DECIMAL will understand blast E-values

While varchar will work for everything, it is best to be as specific as possible, both because it enables you to run the appropriate commands on the values in the columns and also because it makes searches faster.

Let's make a GFF style table for the saccharomyces_cerevisiae.gff file with which you've been working.

sqlite> CREATE TABLE scer_annots (seqid varchar, source varchar,
type varchar, start integer, end integer, score decimal, strand
varchar, phase integer, attributes varchar);

Now we've made an empty table.

SELECT


The way you query for information in a table is the "SELECT" command. This is the most basic usage for the "SELECT" command:

sqlite> SELECT * FROM your_table_name;

SQL reads like a natural language. This "SELECT" statement means "select all columns from your table".

So we can verify that our table is empty by entering:

sqlite> SELECT * FROM scer_annots;

.import


So we have this GFF file and we want to get it into this table in our database. How does this work? In SQLITE3 you use the ".import" command.

sqlite> .import your_file your_table_name

You need to set the separator to be whatever form of separation your input file has. To change it to tab-delimited enter:

sqlite> .separator "\t"

To import our GFF file:

sqlite> .import saccharomyces_cerevisiae_clean.gff scer_annots

Notice that I'm using a "clean" version of the GFF file. SQL will get upset if there are any lines in the file that do not have exactly the columns you specified when you created the table. There are ways to relax this but I won't get into that here. So the "clean" version contains GFF format lines only.

More SELECT


Now when we try to use the "SELECT" command we get everything printed to the screen. To limit the amount of output printed to the screen (sort of like "head"), you can add "limit n" to the "SELECT" command to get "n" lines.

sqlite> SELECT * FROM scer_annots LIMIT 4;

We can turn headers on so that this is easier to read:

sqlite> .headers on

"SELECT" is actually a very powerful command that can return every imaginable slice of the data.

For instance, if you wanted the count of the genes:

sqlite> SELECT COUNT(*) FROM scer_annots WHERE type="gene";

Or perhaps you want the mean gene length:

sqlite> SELECT AVG(1 + end - start) FROM scer_annots WHERE type="gene";

Or perhaps the maximum gene length for each chromosome:

sqlite> SELECT seqid, MAX(1 + end - start), attributes FROM scer_annots WHERE type="gene" GROUP BY seqid;

Or perhaps you want to list out all the distinct type values:

sqlite> SELECT DISTINCT type FROM scer_annots;

Or perhaps you want the list of features with lengths between 1 and 100 on chromosome I and further you want the list ordered by position on the chromosome:

sqlite> SELECT type, start, end, (1 + end - start) FROM scer_annots WHERE seqid="chrI" AND (1 + end - start) BETWEEN 1 AND 100 ORDER BY start;

You get the idea. There are many many ways to query the data and all of these can be done quickly and easily.

Editing


I won't explain how they work but you can use the "INSERT", "UPDATE", "DELETE" and "DROP" commands to add, alter and subtract information in tables.

Interacting With SQL


For SQLITE3 you can run an SQL script using the ".read" command and can output the results of a "SELECT" command to a file using the ".output" command.

Quite importantly, in addition to these methods, one can run SQL from within a Perl script using the DBI (Database Interface) module. This is extremely powerful as you can quickly access information in databases or fill databases on the fly with Perl scripts. The specifics of how you do this are beyond the scope of this course but are a simple extension of what you have already learned.

Online Resources


Tutorials:
http://dev.mysql.com/doc/refman/4.1/en/index.html
http://www.w3schools.com/sql/default.asp

Perl DBI:
http://dbi.perl.org/
http://search.cpan.org/~juerd/DBIx-Simple-1.31/


Exercises


Problem 1: Playing with GFFs


  • Parse your Scer GFF file in Perl to get rid of comments and create seperate GFF files for each chromosome. Create tables in SQL and import these files.
  • Pick your favourite chromosome and work on the table for that chromosome.
  • Run SELECT commands to get a count of the different feature types (hint: GROUP BY might be very helpful).
  • What is the average start position for a gene? (Yes this is kind of a silly question)
  • Get comfortable using SELECT WHERE and GROUP BY commands
  • Output just the gene start, stops, and attributes to a new file using .output. Output the same for a different chromosome to a different file.

Problem 2: BLAST Output


  • Finish your BLAST parsing for the Project before working on this problem.
  • Parse your BLAST file in Perl to get rid of the comment lines and separate species_subject from subject_id. e.g.:

#Some comments
RAS1    cgla:CAGL0B04521g   54.71   340 118 3   4   309 2   339 4e-85    311
 
becomes
 
RAS1    cgla    CAGL0B04521g   54.71   340 118 3   4   309 2   339 4e-85    311

  • Create a database for your BLAST results files.
  • Import your blast results from CDC42 and RAS1 into seperate tables (remember you must first create the tables, then import the data in sqlite3 on the command line).
  • Use SQL to query each of these tables to return the average E-value grouped by species.
  • Output the query, subject, evalue to a file for RAS1 hits from k. lactis to a file using .output.
  • How many distinct evalues are there in the CDC42 hsps? Output an ordered list.

Return to the Project