Psych 290 | Graduate Research Methods:
How to do stuff
back to unix page


grep, sed, and awk

grep, sed, and awk are extraordinarily useful text processing tools. In general, I find it useful to think about Unix programming as a little bit like demonology. There is an element of ritual, a focus on knowing the names of strange beasts that perform arcane services for you, and the possibility that if you screw up all hell can break loose. To this end, we can consider a Unix machine to be nothing but a box of demons, to be summoned and put to work. The shell is our familiar, one who speaks the language of the abyss and interprets our commands. So let's don dubious robes, drape ourselves in occult jewelery, light candles made of suspicious substances, and learn the proper names of some of these beasts...

grep

The grep is a small, rotund, green demon that eats lines of text. Its name is a contraction of the phrase Get Regular ExPression, which is the purpose to which it is put by most aspiring demonologists. When applied to a file, it pulls out the lines of code that contain a search phrase, and eats the rest. Using the command grep word < file will produce all the lines from file that contain word. Sometimes we need to use quotation marks to identify search phrases that contain more than one word.

grep is useful in situations where we want to pull out only some parts of a text file, or text output from another command. For example, we could get all of the subject lines from a file that contains e-mail messages, or just all the dates. It can also be used to count things, by piping to wc -l. wc counts the lines, words, and characters in a document. Using the flag -l just outputs the number of lines. So grep 'Subject:' < file | wc -l would count the number of lines that contain the word Subject.

One other convenient trick with grep is the use of the -v flag. -v tells grep to reverse its usual behaviour: now it eats only the lines that contain the search phrase. This is a useful trick for getting rid of junk lines in a file. For example, I often use -1 to denote missing data in my experiments, so I can type grep -v '-' < data > cleandata to remove all of the lines with missing entries.

grep is actually the name of a small family of demons. There is also fgrep, which operates just like grep but uses a whole file of potential search phrases, one per line. So if I want to check whether people are reading their e-mail during class, and have the class usernames on unique lines in a file called classlist, I can type w | fgrep -f classlist to get a list of what everybody on Psych is up to, and then pull out the lines that correspond to the class. egrep is a slightly faster version of grep, and allows some other nice features, like multiple search phrases specified on the command line. w | egrep 'gruffydd|jmcg' would tell you what both I and Julie are doing on Psych - the inclusion of the | in the serach phrase means that we want lines containing what comes before it, or what comes after it.

Things to do with grep:

sed

sed is a slimy purple beast with many tentacles and several eyes. It gets its name from the fact that it is a Stream EDitor. Like grep, it acts on one line at a time. However, sed doesn't eat any lines (most of the time). Instead, it changes things.

There are all sorts of things that sed can be told to do. Perhaps the most useful are deletions and substitutions. sed -e 1D < file will delete the first line of a file. In general, you can substitute any number for the 1. Substitutions are specified as sed -e s/a/b/ < file, where a is what you want to replace, and b is what you want to replace it with.

If you want to do multiple replacements on a given file, it's best to write a little sed program. This is just a file with one sed command on each line. So we could have a program that does several substitutions, and looks like

s/a/b/
s/c/d/
s/e/f/
s/g/h/
Then to run this program, we type sed -f program < file, where our program is saved in a file called program.

To find out more about writing programs for sed, and all the other things of which it is capable, type man sed, or consult one of the many webpages that describe it.

Things to do with sed

awk

awk is a somewhat rulebound little blue man, with several legs. It runs through data checking if particular rules are satisfied, and performing certain actions when they are. Just as with sed, it is often easiest to write small programs that give awk a set of instructions, although short instructions can be given on the command line.

The basic format of instructions to awk is condition { action } . You write a condition, and then enclose the corresponding action in curly brackets immediately afterwards. awk then runs through the file, line by line, and checks whether each condition is satisfied. For those conditions that are satisfied, it performs the corresponding action. If you don't specify a condition, and just give an action in curly brackets, that action is performed on every line. If you don't specify an action, the default action is just to print that line.

awk assumes that data are stored in columns. Columns are specified using $ signs. Column 1 is referred to as $1, column 2 as $2, and so forth. The conditions that you specify to awk can thus correspond to the values of particular columns. For example, we could write a program for awk that reads

$1==1 {print $2}
$1==0 {print $3}
and save it in a file called program.awk. Then when we type awk -f program.awk < data, awk will run through the data and print the value in the second column of a line if the first column contains a 1, and the value in the third column if the first column contains a 0.

We can use awk to do things like change the way that data files are set up. I usually enter my data so that I have a single subject on each line of the data file, with their responses to different questions in separate columns. To analyse my data, it is more convenient to have each response on a separate line, with the other columns coding for subject, condition, and so forth. I can do this with awk.

awk also provides a convenient way of pulling specific lines from a data file. If my data are organised in columns, and I'm interested just in those cases where one of the columns have a particular value (say the second column having the value 3), I can type awk '$2==3' < data | more. If I wanted those cases where the value is greater than 3, awk '$2>3' < data | more. I can also take those not having a 3, '$2!=3', or those having a 3 or a 1 '($2==3)||($2==1)'. Finally, we can also combine conditions, so that we take only those that have a 3 in the second column and a 2 in the first column, ($2==3)&&($1==2). Notice that since these are all short sets of instructions, I've specified them on the command line.

If you're going to do anything creative with awk, it's worth knowing about two special conditions that you can use when giving instructions. The instruction BEGIN { command } executes {\tt command} at the start of a file, and the instruction END { command } executes command at the end of a file. We can also assign values to variables in commands, using any text string to denote a variable. So if we wanted an awk program that would add up a string of numbers, we would set up a file called something like sum.awk that contains

BEGIN {sum=0}
{sum=sum+$1}
END {print sum}

Obviously there are all sorts of more complicated things you could consider doing, such as introducing conditions and printing the sum every time a condition is satisfied, or only adding to it when a different condition is satisfied.

One other kind of condition that comes in really handy is the length of a line. We can type awk 'length > 1' to filter blank lines from a file, or otherwise condition on the length of a line measured in the number of columns.

Things to do with awk:

Worked example: Lera's PsyScope data

Next: The Web, and Scripts