Grep

Grep stands for - Globally search for the Regular Expression and Print. This is one of the more useful commands you will come across since it can filter out a lot of unwanted information from large files. For this tutorial you might want to refresh your copy of foo.

Also note that I haven't yet mentioned that fact that you can easily cut and paste simply using your mouse. If you wanted to execute the above command, simply highlight it using the left mouse button, move your mouse to the terminal, and paste using your middle mouse button.

The most basic thing grep does is search for the occurrence of a word or phrase. Recall that this is a protein structure file, and apart from the actual structural coordinates, there is a lot of other information in the file (use more or less to view the file). If we wanted to pull out information about the journal where this structure was published, we could use:

grep JRNL foo

and we get all the lines that match "JRNL". Notice that it is case sensitive and the command:

grep jrnl foo

doesn't return anything (you can use the -i flag to ignore case). If we wanted sequence information:

grep SEQ foo

but now notice that in addition to the sequence lines (the ones that start with SEQRES), we get other lines that also contain SEQ. You might have missed these lines since they scrolled by, so use the pipe command "|" to send the output from the grep command to the more command:

grep SEQ foo | more

If we wanted to get rid of these extra lines, we could do any number of things:

more exact search string:
grep SEQRES foo

find SEQ at the beginning of a line
grep "^SEQ" foo

take the first output and remove lines that contain REMARK
grep SEQ foo | grep -v REMARK

Although the other information contained in the file is useful, we are most often just interested in the atomic coordinates. These are lines that begin with ATOM. If we want to restrict ourselves to these entries and put them in a new file, we could use:

grep "^ATOM" foo > bar

This takes only the lines that begin with ATOM and puts them in a new file called bar. If you look at the file bar, you will see that the different proteins within this complex have different chain letters (this is column 5 in the file, between the residue name and the residue number). If we wanted only chain G we could do (watch the spaces):

grep " G " bar > barG , or
grep "^ATOM" foo | grep " G " > barG

Grep also allows you to get lines before or after strings that it matches. If we look for a tryptophan in chain G we get:

grep TRP barG

If we were just looking for the gamma carbon we would have found:

grep "CG TRP" barG     (there are 2 space between CG and TRP).

This gives us only one atom, but we can get the rest of the residue with:

grep -A8 -B5 "CG TRP" barG