Awk

Awk is a very powerful scripting language and we will only cover some basic functionality. Just like grep, awk can search and match strings, but it can also manipulate or perform operations on the things it matches. One of its strengths is that awk understands columns. Since our protein files have regular columns, awk is an ideal tool for manipulating them. In awk, $0 represents the entire line, but $1 is column one, $2 is column two, and so on.

Hopefully you still have your bar file around. If you take a look at the start or head of the file (using head bar) you see:

ATOM      1  N   GLY A   3      20.981  41.727  55.144  1.00 32.31           N  
ATOM      2  CA  GLY A   3      21.740  42.192  56.340  1.00 31.83           C  
ATOM      3  C   GLY A   3      23.199  41.923  56.131  1.00 31.04           C  
ATOM      4  O   GLY A   3      23.575  41.578  55.006  1.00 31.51           O  
ATOM      5  N   ARG A   4      24.036  42.204  57.126  1.00 30.76           N  
ATOM      6  CA  ARG A   4      25.434  41.770  57.056  1.00 30.41           C  
ATOM      7  C   ARG A   4      25.641  40.242  56.848  1.00 28.22           C  
ATOM      8  O   ARG A   4      26.668  39.803  56.312  1.00 27.92           O  
ATOM      9  CB  ARG A   4      26.395  42.248  58.231  1.00 31.02           C  
ATOM     10  CG  ARG A   4      25.866  42.297  59.675  1.00 32.91           C  

Suppose we were interested in how many glycines (GLY) were in this structure. We could find that information using grep, but easily with awk as well using:

awk '{if($3=="CA" && $4=="GLY"){num++}}END{print num, "glycines"}' bar

Since there is only one alpha carbon (CA) per residue, just counting the CA's gives us the total number of residues.


As you will learn, at neutral pH some amino acids are acidic (negatively charged) and some are basic (positively charged). If we wanted to get a rough idea of the overall charge in this protein complex, we could simply find (number of acidic) - (number of basic). The basic residues are lysines (LYS) and arginines (ARG). The acidic ones are aspartic acid (ASP) and glutamic acid (GLU), so:

awk '{if($3=="CA" && ($4=="GLU" || $4=="ASP")){charge--};if($3=="CA" && \
($4=="ARG" || $4=="LYS")){charge++}}END{print "Total charge is", charge}' bar


The (x,y,z) coordinates for each atom are in columns 7, 8, and 9. If we wanted to find the center of mass of this protein complex, we could use:

awk '{x+=$7; y+=$8; z+=$9; atom++}END{print "The center is", x/atom, y/atom, z/atom}' bar

This simple script sums up each of the columns 7, 8, and 9 (referenced by $7, etc.) and increments the variable atom for each line it reads. It then prints out the average coordinates. If we instead wanted to use just the CA atoms:

awk '{if($3=="CA"){x+=$7; y+=$8; z+=$9; atom++}}END{print "Center is", x/atom, y/atom, z/atom}' bar

where now it only considers entries that have "CA" in column 3.

These are some very simple examples, but awk is much more powerful.