Awk
Awk is a very powerful scripting language and we will only cover some basic functionality. Just like grep, awk can search and match strings, but it can also manipulate or perform operations on the things it matches. One of its strengths is that awk understands columns. Since our protein files have regular columns, awk is an ideal tool for manipulating them. In awk, $0 represents the entire line, but $1 is column one, $2 is column two, and so on.
Hopefully you still have your bar file around. If you take a look at the start or head of the file (using head bar) you see:
ATOM 1 N GLY A 3 20.981 41.727 55.144 1.00 32.31 N ATOM 2 CA GLY A 3 21.740 42.192 56.340 1.00 31.83 C ATOM 3 C GLY A 3 23.199 41.923 56.131 1.00 31.04 C ATOM 4 O GLY A 3 23.575 41.578 55.006 1.00 31.51 O ATOM 5 N ARG A 4 24.036 42.204 57.126 1.00 30.76 N ATOM 6 CA ARG A 4 25.434 41.770 57.056 1.00 30.41 C ATOM 7 C ARG A 4 25.641 40.242 56.848 1.00 28.22 C ATOM 8 O ARG A 4 26.668 39.803 56.312 1.00 27.92 O ATOM 9 CB ARG A 4 26.395 42.248 58.231 1.00 31.02 C ATOM 10 CG ARG A 4 25.866 42.297 59.675 1.00 32.91 C
Suppose we were interested in how many glycines (GLY) were in this structure. We could find that information using grep, but easily with awk as well using:
awk '{if($3=="CA" && $4=="GLY"){num++}}END{print num, "glycines"}' bar
Since there is only one alpha carbon (CA) per residue, just counting the CA's gives us the total number of residues.
As you will learn, at neutral pH some amino acids are acidic (negatively charged) and some are basic (positively charged). If we wanted to get a rough idea of the overall charge in this protein complex, we could simply find (number of acidic) - (number of basic). The basic residues are lysines (LYS) and arginines (ARG). The acidic ones are aspartic acid (ASP) and glutamic acid (GLU), so:
awk '{if($3=="CA" && ($4=="GLU" || $4=="ASP")){charge--};if($3=="CA" && \
($4=="ARG" || $4=="LYS")){charge++}}END{print "Total charge is", charge}' bar
The (x,y,z) coordinates for each atom are in columns 7, 8, and 9. If we wanted to find the center of mass of this protein complex, we could use:
awk '{x+=$7; y+=$8; z+=$9; atom++}END{print "The center is", x/atom, y/atom, z/atom}' bar
This simple script sums up each of the columns 7, 8, and 9 (referenced by $7, etc.) and increments the variable atom for each line it reads. It then prints out the average coordinates. If we instead wanted to use just the CA atoms:
awk '{if($3=="CA"){x+=$7; y+=$8; z+=$9; atom++}}END{print "Center is", x/atom, y/atom, z/atom}' bar
where now it only considers entries that have "CA" in column 3.
These are some very simple examples, but awk is much more powerful.