Awk is a small but capable programming language which is used for processing text. It was developed by Aho, Weinberger, Kerninghan at Bell Labs.
Julia Evans made an awesome intro to awk
:
Awk scans input file as a sequence of lines and splits each line into fields. The field separator is usually whitespace but you can customize it to any character.
An awk
program is a sequence of pattern-action pairs i.e for each line, it checks if it matches the pattern and if yes, it performs the associated action on the line. Awk can be used interactively or to run saved programs.
Here is what Awk does written in Python-like pseudocode:
initialize() # Initializes variables in BEGIN block
for line in input_lines: # Awk divides file / input into a list of lines
for condition, action in conditions: # A program is a list of condition-action pairs
if condition(line): #match line against condition
action() #perform action on match
Here are some small snippets of Awk:
1. Hello World!
You can run awk
programs inline or through a file:
awk 'BEGIN{ print "Hello, World!"}'
Alternatively, you can save this to a file hello.awk
:
BEGIN{ print "Hello, World!"}
Then run it as awk -f hello.awk
2. Reading a CSV and printing a specific column
Let's now do something useful! Download this csv which is 2010 census data by zip code in Los Angeles city.
Read the first 3 lines from csv: head -3 2010_Census_Populations_by_Zip_Code.csv
Zip Code,Total Population,Median Age,Total Males,Total Females,Total Households,Average Household Size
91371,1,73.5,0,1,1,1
90001,57110,26.6,28468,28642,12971,4.4
We will print just the total column using awk -F, '{print $2}' 2010_Census_Populations_by_Zip_Code.csv
The -F,
sets the field separator to comma as we need to split by commas for getting fields in a CSV file. $n
allows you to use the value in the nth column.
3. Computing some statistics
Awk allows the use of variables and functions. Let's see how to use them by computing the total population in the entire city.
# total.awk
{s += $2}
END {print "Total population:", s}
Variables are by default initialized to 0. Here, we use a variable s
to hold the total.
Running this script as awk -F, -f total.awk 2010_Census_Populations_by_Zip_Code.csv
, we get output: Total population: 10603988
Special variables and built-in functions
Awk uses some special variables and functions to make your programs more compact:
- NF : Number of fields in a line
- NR : Line number
- $0 : The entire input line
- length : gives number of characters in a string
Now, we will compute the average household size which is total population divided by total households. The columns of interest are $2 and $6.
We also want the average population per zip code. Our script:
# stats.awk
{ s += $2; h += $6;}
END {print "Total population:", s, "\nTotal households:", h, "\nAverage household size:", s/h, "\nAverage population per zip code:", s/NR}
NR
gives us the total number of lines. But we do not want the header line. We can use tail
command to skip the 1st line as tail -n +2
. Running tail -n +2 2010_Census_Populations_by_Zip_Code.csv | awk -F, -f total.awk
gives us :
Total population: 10603988
Total households: 3497698
Average household size: 3.0317
Average population per zip code: 33241.3
4. Pattern matching
We have done some useful things with awk so far, but we have ignored its biggest strength - pattern matching. We can match based on field values, regexes, line numbers.
- Print every 2nd line :
NR%2 == 0 {print $0}
. Here $0 stands for the entire line. - Print all zip codes with population > 100,000 :
$2 > 100000 {print $1}
- Print all zip codes with population > 10,000 and average household size > 4 :
$2 > 10000 && $7 > 4 { print $1}
. We can combine conditions using&&
and||
which stand for logical and and or respectively.
Further reading
There is a lot more to Awk. Here are some references:
The best resource for learning Awk is The AWK programming language written by the same trio. This book goes over and beyond a typical programming language tutorial and teaches you how to use your Awk superpowers to build versatile systems like a relational database, a parser, an interpreter, etc.
The GNU Awk Manual for Effective Awk Programming is a thorough reference.
Top comments (15)
Nice intro/overview! Awk is great, I've used it a lot ... I remember I often used it to read/parse log files and then generate SQL using Awk, in order to perform database changes. It's very flexible for this kind of extract/transform/output work.
Some Awk one-liners replace my use of other Unix tools like cut, grep, join. Some people use
perl
as a betterawk
but I prefer the simplicity of Awk.Awk is extremely flexible and easy to use. Can indeed replace cut/grep/join if you want to. And perl, never got into it, too complicated and cryptic, I prefer the C/javascript-like syntax of Awk.
Awesome Article! Great Hands on tutorial ❤
Best awk intro/summary I have read, thank you.
Output formatting becomes much better when you ditch
print
in favor ofprintf
.That was a fun introduction. Glad to add that as a tool in my toolbelt
Nice intro, good pointers. Thanks!
The picture is enough for me, thanks the author!
Interesting! Just wondering why a whole new language for a feature? (Not to sound critical).
Are there any specialised optimizations specific to file i/o and parsing the file at lower level ? If so, it would be great to have it also as wrapper for other langaugaes. Any benchmarking ?
🙂🙂
Hi Ishani,
awk
is a very old language (1977), predating scripting languages likeperl
andpython
. As part of Unix philosophy, it is used in combination with other Unix tools. It is simpler and faster to write than say apython
script. Mostawk
uses are simple one-liners to extract particular columns.It is indeed very fast as all it does is: For each line:
There was this famous article which showed that clever use of command-line tools can be several times faster than some big-data tools.
Very helpful! Thanks!
LOVE AWK!!!!!!!!!!!