Linux: The very basics

Why Unix, orienting yourself, Ubuntu CD+memory stick, text editor, web browser

> [need a shell before any work]

Before typing any of the commands, you need a place to type them.
Open the shell/terminal/console (same thing) as a first step.

> apropos 'search text' [what's the command to do 'xxx'?]

There are many more commands in Linux than we can cover in a single lecture or in 10 days. Use the 'apropos' command to search for ways to do something particular.
e.g. apropos 'rename'
e.g. apropos 'rename files'

> man command_name [look up information on a particular command]

Most commands have many many useful options. For information on a particular command, look at the manual pages with the "man".
e.g. man rename

>pwd [where am I?]

(Print Working Directory) Prints your current location, such as "/home/lenny/Docs". This is the directory in which you are at the current moment. If you create any files, they will appear in this spot. When you first open the terminal shell, you will be in your "home"/base directory ("/home/lenny").

>cd directory_path [move to the named directory]

(Change Directory) Given a complete path, this command moves your "current
location" to the specified directory.
e.g. cd /home/lenny/ [will take me to my home directory]
If you simply want to descend into a subdirectory from you current position,
you can omit the full path and just specify the directory name. This is called giving
a relative path; it's a frequent source of error in programming.
e.g. cd Docs [if I execute this from /home/lenny/, I'll end up in
/home/lenny/Docs]
To move back to the parent directory, just do "cd .."
e.g. cd .. [If I execute this from /home/lenny/Docs/, I will go to
/home/lenny/]

>ls [lists contents of a directory]

(LiSt) Shows the files and directories in the current location.
Options:
ls -l [lists security permissions, owners of files, sizes,
date created]
ls -F [directories will be shows with a "/" after the name,
so it's easier to tell them apart from files]
ls path [lists contents of the specified directory]
ls .. [list contents of the parent directory]
Options can be combined as: ls -lF /home/lenny/Docs/

Now that we can move around directories and look inside them,
a few words about the linux directory structure. The topmost level
is "/" and everything else resides somewhere below in the hierarchy of
directory branches.

>mkdir directory_name [Create a given directory]

(MaKe DIRectory) Exactly what it says - let's you create new directories.

>cp original_name copy_name [copy file or directory]

(CoPy) It is possible to copy all subdirectories and their files
under a given directoy recursively with the command:
cp -r source_directory destination_directory

>mv source destination [move files or directories]

(MoVe) Similar to the copy command, but this actually moves the desired
file or directories instead of copying them.
This is the command that is used to rename files or directories.

>more file_name [view contents of a file]

Shows contents of a file. Allows scrolling, jumping, searching.
Can be used only for text files (won't show you MS Word or PowerPoint -
need special programs for that)
Most useful options (once inside the viewer):
-"space" to scroll down a page
-"Control-G" to go to the end of the file
-"/" to type in text to search for
-"q" to quit
-"h" for a complete list of functions

>less file_name [view contents of a file - "less is more, enhanced"]

Everything from "more" applies, but this is a nicer viewer.

>most file_name [most is most]


>head filename [print first 10 lines of the file]

By default, prints the top 10 lines of the input file.
To print a different number of lines, execute:
head -number filename
So, "head -n 100 unix_ref.txt" will print the
top 100 lines of this file.

>tail filename [print the last ten lines of the file]

Just the reverse of "head".

>unzip file_name [uncompress a ".Z" file]

Often, files that you download will be compressed so they take up
less space, but you cannot use them in this form.
If the file is compressed and has a ".Z" at the end, such as
"yeast_genome.fa.Z", simply invoke:

unzip yeast_genome.fa.Z [and you should end up with just
yeast_genome.fa in the directory.]

>gunzip file_name [uncompress a ".gz" file]


Same as above, but for files with a ".gz" extension. If the
file type is ambiguous ("yeast_genome.compressed"), try both
unzip and gunzip - one of them will give you an error, and
the other one should uncompress.

>tar -xf file_name [unpack an archive file]


Frequently, many files will be combined or packaged into
a single one (archived), to extract them, use this command.
Such files usually have a ".tar" ending. Also,
these archives can be compressed, and you may get a file such
as "cerevisiae_chromosomes.tar.gz". In that case, first
uncompress the file with gunzip, and then unpackage it with
tar.
Of course, "tar" also allows you to combine many files into one,
not just extract them. Get more info on the tar command with
"man tar".


>cat file1 file2 ... [print named files to the screen]

(conCATenate) If given just one file, cat will simply spit out the
contents of the file to the screen. However,

given multiple files, this will concatenate all of
them, printing one after the other.

>egrep 'search_string' file

(Global Regular Expression Print)

Searches for the "search string" in a text file and
prints out all lines where it find the desired text.

e.g.: grep '>' fasta_file.fa [Will print to screen the headers
of the fasta file as they match
">" at the beginning of each line.]

-v will invert the search.
e.g: grep -v '>' fasta_file.fa
[Prints out all non-header lines, that is,
the sequences only.]

>wc file_name

(Word Count) Reports the number of:
lines, words, characters (in that order)
of a given file.

star "*"

The star of unix is simple and incredibly useful. This symbol
can be used with all of the commands listed above, and means that you
want it to MATCH ANYTHING. This is the "wild-card"

e.g.: ls *.fa [list all fasta files - files with ".fa" on the end]
wc * [Do the word-count on all files in the current directory.]

pipe '|'

Piping with "|" connects unix commands, allowing the output
of one command to "flow through the pipe" to another.

e.g." grep '>' file.fasta | wc [get all header lines from the
fasta file, but instead of printing
them to the screen, send the output
to the "wc" command, to count the
number of headers. Effectively,
this will count number of sequences
in the file.]

">"

In addition to redirecting output to another command, the results can
be sent into a file with ">".

e.g. cat file1 file2 file3 > file4 [Combine files 1-3 into file4.]

The ">" will create a new file or overwrite an existing one. If you
simply want to add to a file, use ">>".

>cut -f NUMBER file_name [Extract one or more columns from a file]

Prints only the specified column/field from a text file. By
default, expects the fields to be tab-separated.

Options:
-d ' ' [specifies the character separating the columns.
If fields are separated by spaces, just use
"cut -d ' ' -f..." if they are separated by
commas, "cut -d ',' -f...", and so on.]

Examples:

cut -f 1 some_file.txt [get the first column of the file]
cut -d ' ' -f 3-5 some_file.txt [get columns 3,4,5 from
a space-separated file]
cut -f 2,6,7 some_file.txt [get columns 2,6,7 from the file]

>sort file_name [Order lines in a file alphabetically/numerically)

Will sort lines in a text file. There are many useful
options:

sort -r file_name [will do a bottom-up reverse sort]
sort -k # file_name [will sort on the specified column in a
tab-separated file]
sort -k # -t ' ' file [also sort on given column, but the columns
are now space-separated]
sort -n file_name [do a numeric rather than alphabetical sort]

Of course, all the above options can be combined.

>uniq file_name [print distinct lines from a sorted file]


This will run through the whole file, comparing every two
adjacent lines, and will remove the duplicate lines.
Unless you have a good specific reason not to, you should
always sort the file first.

Exercises

Problem1

  • Make a directory called "fasta_files" and change into it
  • Go to http://www.yeastgenome.org followed by "FTP" (under Data Download), then "sequence", "genomic_sequence", "chromosomes", "fasta"
  • Download one-by-one all cerevisiae chromosomes
  • Make a single whole genome file called "cerevisiae_genome.fasta"
  • Count the chromosomes in the whole genome file using commands from the lecture
  • Get size of genome, excluding the header lines

Problem2


Problem3

  • Use "apropos" to find the command to look up 'disk space' usage.
  • Use "man" to find out how the command works.
  • Run the command on your system to see how much free space you have.

Problem4

  • Make a primer file for the following pair of primers:
    • forward:GTTGGAGCTGGAGCAGAAGA reverse:AGCTCCACCACTGAAAGCTC, product size=245
    • The primer file format is "name forward_primer reverse_primer product_size", with tabs separating the columns.
  • Run e-PCR against the cerevisiae genome
  • Change e-PCR parameters to allow a 1bp mismatch
  • Count matches (not by hand!) with a 2bp mismatch
  • Count matches with a 200bp margin, in addition to the 2bp mismatch

Problem5

  • Make a temporary directory under "fasta" and "cd" into it.
  • Connect to the YGD ftp server with the 'ftp' program:
    • "ftp genome-ftp.stanford.edu"
  • Use "ls" and "cd" to navigate to the chromosomes directory.
  • Look at the man pages for the "ftp" program to figure out how to download, with a single command, all of the chromosomes for cerevisiae.

Bonus


[There's a trick necessary here. Look under the "FTP Options" of the man page]

  • Figure out how to count the different types of genes in #3 without the "wc" command.
You should be able to get the breakdown of all the different gene types and their counts with a single statement (with pipes of course)

Solutions


Problem1


  • Number of chromosomes (once the chromosomes are all in one directory)
cat *.fsa > cerevisiae_genome.fa
 
egrep '>' cerevisiae_genome.fa  | wc
 
or
egrep -c '>' cerevisiae_genome.fa

The total number of chromosomes is 17. This includes the mitochondrial, and if you did
"egrep -c 'chr' cerevisiae_genome.fa" you would have missed it. In general, fasta file format
always has a header line start with ">" before the sequence, whether DNA or protein.

  • Genome size
 egrep -v '>' cerevisiae_genome.fa  | wc
Being picky here, notice that "wc" includes line breaks in the total character count ("wc" on a file with "Hello" will give 6).
So to get the real genome size, subtract number of lines 202620 from 12359298.
Cerevisiae thus has 12,156,678bp.

Problem2

  • Chromosome count using the features file.
cut -f 7 SGD_features.tab | egrep 'chr' | sort | uniq | wc

  • Count total genes: 6605.
cut -f 2 SGD_features.tab| egrep 'ORF' | wc

  • Verified Genes:4648.

 egrep ORF SGD_features.tab  | cut -f 1-3 | egrep 'Verified' | wc 

  • Uncharacterized: 1142.
 egrep ORF SGD_features.tab  | cut -f 1-3 | egrep 'Unchar' | wc 

  • All gene types:Dubious,Uncharacterized,Verified,Verified|silenced_gene.

cut -f 2-3 SGD_features.tab | egrep 'ORF' | cut -f 2 | sort | uniq

Bonus


  • "wget"
wget --retr-symlinks ftp://genome-ftp.stanford.edu/pub/yeast/sequence/genomic_sequen
nce/chromosomes/fasta/*.fsa

  • Counting different gene types

 cut -f 2-3 SGD_features.tab | egrep 'ORF' | cut -f 2 | sort | uniq -c

815 Dubious
1142 Uncharacterized
4644 Verified
4 Verified|silenced_gene