Sed - the stream editor
This tutorial is meant as a brief introductory guide to sed that will help give the beginner a solid foundation regarding how sed works. It's worth noting that the tutorial also omits several commands, and will not bring you to sed enlightenment in itself. To reach sed enlightenment, your best bet is to follow the seders mailing list. See the end of this page for some useful tips on manipulating pdb and other files.Sed regular expressions
The sed regular expressions are essentially the same as the grep regular expressions. They are summarized below. ^ |
matches the beginning of the line |
$ |
matches the end of the line |
. |
Matches any single character |
(character)* |
match arbitrarily many occurrences of (character) |
(character)? |
Match 0 or 1 instance of (character) |
[abcdef] |
Match any character enclosed in [] (in this instance, a b c d e or f)
ranges of characters such as [a-z] are permitted. The behaviour
of this deserves more description. See the page on grep
for more details about the syntax of lists.
|
[^abcdef] |
Match any character NOT enclosed in [] (in this instance, any character other than a b c d e or f) |
(character)\{m,n\} |
Match m-n repetitions of (character) |
(character)\{m,\} |
Match m or more repetitions of (character) |
(character)\{,n\} |
Match n or less (possibly 0) repetitions of (character) |
(character)\{n\} |
Match exactly n repetitions of (character) |
\(expression\) |
Group operator. |
\n |
Backreference - matches nth group |
expression1\|expression2 |
Matches expression1 or expression 2. Works with GNU sed, but this feature might not work with other forms of sed. |
Special Characters
The special character in sed are the same as those in grep, with one key difference: the forward slash /
is a special character
in sed. The reason for this will become very clear when studying
sed commands.
How it Works: A Brief Introduction
Sed works as follows: it reads from the standard input, one line at a time. for each line, it executes a series of editing commands, then the line is written to STDOUT. An example which shows how it works : we use thes
command. s
means "substitute" or search and replace.
The format is
s/regular-expression/replacement text/{flags}
We won't discuss all the flags yet. The one we use below is g
which means "replace all matches"
>cat file
I have three dogs and two cats
>sed -e 's/dog/cat/g' -e 's/cat/elephant/g' file
I have three elephants and two elephants
>
OK. So what happened ? Firsty, sed read in the line of the file and
executed
s/dog/cat/g
which produced the following text:
I have three cats and two cats
and then the second command was performed on the edited line
and the result was
I have three elephants and two elephants
We actually have a name for the "current text": it is called the
pattern space. So a precise definition of what sed does
is as follows :
sed reads the standard input into the pattern space, performs a sequence of editing commands on the pattern space, then writes the pattern space to STDOUT.
Getting Started: Substitute and delete Commands
Firstly, the way you usually use sed is as follows:
>sed -e 'command1' -e 'command2' -e 'command3' file
>{shell command}|sed -e 'command1' -e 'command2'
>sed -f sedscript.sed file
>{shell command}|sed -f sedscript.sed
so sed can read from a file or STDIN, and the commands can
be specified in a file or on the command line.
Note the following :
that if the commands are read from a file, trailing whitespace can be fatal, in particular, it will cause scripts to fail for no apparent reason. I recommend editing sed scripts with an editor such as vim which can show end of line characters so that you can "see" trailing white space at the end of line.
The Substitute Command
The format for the substitute command is as follows:
[address1[ ,address2]]s
/pattern/replacement/[flags]
The flags can be any of the following:
n | replace nth instance of pattern with replacement |
g |
replace all instances of pattern with replacement |
p |
write pattern space to STDOUT if a successful substitution takes place |
w file |
Write the pattern space to file if a successful substitution takes place |
If no flags are specified the first match on the line is replaced.
note that we will almost always use the s
command with
either the g
flag or no flag at all.
If one address is given, then the substitution is applied to lines
containing that address. An address can be either a regular expression
enclosed by forward slashes /regex/
, or
a line number . The $
symbol can be used in place of
a line number to denote the last line.
If two addresses are given seperated by a comma, then the substitution is applied to all lines between the two lines that match the pattern.
This requires some clarification in the case where both addresses are patterns, as there is some ambiguity here. more precisely, the substitution is applied to all lines from the first match of address1 to the first match of address2 and all lines from the first match of address1 following the first match of address2 to the next match of address1 Don't worry if this seems very confusing (it is), the examples will clarify this.
The Delete Command
The delete command is very simple in it's syntax: it goes like this
[address1[ , address2 ] ]d
And it deletes the content of the pattern space. All following commands are
skipped (after all, there's very little you can do with an empty pattern space),
and a new line is read into the pattern space.
Example 1
>cat file
http://pegasus.rutgers.edu/
>sed -e
's@http://www.foo.com@http://www.bar.net@' file
http://andromeda.rutgers.edu/
Note that we used a different delimiter, @ for the substitution
command. Sed permits several delimiters for the s command including
@%,;: these alternative delimiters are good for substitutions which
include strings such as filenames, as it makes your sed code much more
readable.
Example 2
>cat file
the black cat was chased by the brown dog
>sed -e 's/black/white/g' file
the white cat was chased by the brown dog
That was pretty straight forward. Now we move on to something more interesting.
Example 3
>cat file
the black cat was chased by the brown dog.
the black cat was not chased by the brown dog
>sed -e '/not/s/black/white/g' file
the black cat was chased by the brown dog.
the white cat was not chased by the brown dog.
In this instance, the substitution is only applied to lines matching
the regular expression not
. Hence it is not applied to the
first line.
Example 4
>cat file
line 1 (one)
line 2 (two)
line 3 (three)
Example 4a
>sed -e '1,2d' file
line 3 (three)
Example 4b
>sed -e '3d' file
line 1 (one)
line 2 (two)
Example 4c
>sed -e '1,2s/line/LINE/' file
LINE 1 (one)
LINE 2 (two)
line 3 (three)
Example 4d
>sed -e '/^line.*one/s/line/LINE/' -e '/line/d' file
LINE 1 (one)
3a : This was pretty simple: we just deleted lines 1 to 2.3b : This was also pretty simple. We deleted line 3.
3c : In this example, we performed a substitution on lines 1-2.
3d : now this is more interesting, and deserves some explanation. Firstly, it is clear that line 2 and 3 get deleted. But let's look closely at what happens to line 1.
First, line 1 is read into the pattern space. It matches the regular expression
^line.*one
So the substitution is carried out, and the resulting pattern space
looks like this:
LINE 1 (one)
So now the second command is executed, but since the pattern space
does not match the regular expression line
, the delete command
is not executed.
Example 5
>cat file
hello
this text is wiped out
Wiped out
hello (also wiped out)
WiPed out TOO!
goodbye
(1) This text is not deleted
(2) neither is this ... ( goodbye )
(3) neither is this
hello
but this is
and so is this
and unless we find another g**dbye
every line to the end of the file gets deleted
>sed -e '/hello/,/goodbye/d' file
(1) This text is not deleted
(2) neither is this ... ( goodbye )
(3) neither is this
This illustrates how the addressing works when two
pattern addresses are specified. sed finds the first match of the expression
"hello", deleting every line read into the pattern space until it gets
to the first line after the expression "goodbye". It doesn't apply the
delete command to any more addresses until it comes across the expression
"hello" again. Since the expression "goodbye" is not on any subsequent line,
the delete command is applied to all remaining lines.
Manipulating Real Files
So how is all this useful for me? Let's go back to bar to see how this works. There are different naming conventions for atoms within pdb files. One such difference is the delta carbon in isoleucine (ILE). In bar it is currently CD1, but we may want it to be simply CD. To make this change we simply use:
sed "s/CD1 ILE/CD ILE/" bar > bar2
N.B. one space in the first one and two in the second to maintain columns
This output was directed to a new file bar2. To see where the differences are in the two files, use the diff commands:
diff bar bar2
and you should just see the ILE lines that you changed. Again, this is a simple example of a powerful tool.