awk

linux

Important:
http://www.grymoire.com/Unix/Awk.html - done reading
http://www.ibm.com/developerworks/library/l-awk1/ - done looking through
http://www.staff.science.uu.nl/~oostr102/docs/nawk/nawkA4.pdf - the awk manual
http://www.vectorsite.net/tsawk.html - done looking through

http://www.hcs.harvard.edu/~dholland/computers/awk.html - done looking through
https://quickleft.com/blog/command-line-tutorials-sed-awk/ - done looking through
http://www.theunixschool.com/p/awk-sed.html - done looking through
https://www.digitalocean.com/community/tutorials/how-to-use-the-awk-language-to-manipulate-text-in-linux - done looking through
https://www.cse.iitb.ac.in/~br/courses/cs699-autumn2013/refs/awk-tutorial.html - done looking through

What is awk?

It is an excellent filter and report writer. Many UNIX utilities generates rows and columns of information. AWK is an excellent tool for processing these rows and columns, and is easier to use AWK than most conventional programming languages. It can be considered to be a pseudo-C interpretor, as it understands the same arithmatic operators as C. AWK also has string manipulation functions, so it can search for particular strings and modify the output. AWK also has associative arrays, which are incredible useful, and is a feature most computing languages lack. Associative arrays can make a complex problem a trivial exercise.

Awk supports a full range of mathematical functions such as cos, sin, tan, log, exp, sqrt, rand, int and more.

What is the basic structure of awk?

The essential organization of an AWK program follows the form:

pattern { action }

The pattern specifies when the action is performed. Like most UNIX utilities, AWK is line oriented. That is, the pattern specifies a test that is performed with each line read as input. If the condition is true, then the action is taken. The default pattern is something that matches every line. This is the blank or null pattern.

What are some useful one-liners?

kill -9 `ps -elf | awk '$1 == "chavez" {print $2}'`
awk '{ print $1 }'
awk '{ print }' /etc/passwd // Print each lines of /etc/passwd
awk '{ print $0 }' /etc/passwd // Print the first field of each line
awk -F":" '{ print $1 $3 }' /etc/passwd // Using a different input field separator
awk -F":" '{ print $1 " " $3 }' /etc/passwd
awk -f myscript.awk myfile.in
/foo/ { print } // Patterns in Awk are much like address range similar to sed
$1 == "fred" { print $3 } // Condition can be used as pattern
$5 ~ /root/ { print $3 } // If the fifth field contain "root", print the third field
awk '{ print $2, $1 }' file // Print first two fields in opposite order
awk 'length > 72' file // prints lines that are longer than 72 characters
awk '{print length($2)}' file // Print length of string in 2nd column
awk '{ for (i = NF; i > 0; --i) print $i }' file // print fields in reversed order
awk '{print NR, $0}' file // add line number
awk '{$2 = ""; print}' file // print every line after erasing the second field
awk 'END { print }' // print the last line of a file, or just use the tail command
awk '/root/' ~/temp.txt // print all lines that match root.  awk is similar to sed.  The default action in awk is to print the entire line.
awk '$9 ~/^\./ {gsub(/jessica/, "nicelady"); print;}' ~/temp.txt // if field number 9th starts with a dot, do the global substitution on the entire line and then print
awk '/UUID/' /etc/fstab // print every lines that contains the pattern

What is the purpose of the BEGIN pattern?

The BEGIN pattern is a special keyword / pattern. Its purpose is to specify actions to be taken before the first line is read.

What is the purpose of the END pattern?

The END pattern is a special keyword / pattern. Its purpose is to specify actions to be taken after the last line is read.

What does a bare bone awk program look like?

BEGIN { print "START" }
      { print         }
END   { print "STOP"  }

How is awk different from sed or other utilities?

AWK is designed to process lines that contains fields separated by separators.

The "$8" and "$3" have a meaning similar to a shell script. However, instead of the eighth and third argument, they mean the eighth and third field of the input line. You can think of a field as a column, and the action you specify operates on each line or row read in.

AWK does not evaluate variables within strings.

In scripting languages like Perl and the various shells, a dollar sign means the word following is the name of the variable. Awk is different. The dollar sign means that we are refering to a field or column in the current line.

BEGIN { x=5 }
{ print x, $x}

The above code print the number 5 and the fifth field (column) of the input line.

How can we do initialization?

BEGIN { x=5 }
{ print x, $x}

How can we make a script?

#!/bin/awk -f
BEGIN { print "File\tOwner" }
{ print $8, "\t", $3}
/startPatter/,/endPattern/ { ... }
END { print " - DONE -" }
#!/bin/sh
# Linux users have to change $8 to $9
awk '
BEGIN { print "File\tOwner" }
{ print $8, "\t", $3}
END { print " - DONE -" }
'

How can we use variables?

variableName = arithmetic_expression

How can we do math with awk?

Awk support a full range of mathematical operators: +, -, *, /, %, ++, —, +=, -=, *=, /=, %=

What are the comparison operators?

==    Is equal
!=    Is not equal to
>    Is greater than
>=    Is greater than or equal to
<    Is less than
<=    Is less than or equal to

These operators are the same as the C operators. They can be used to compare numbers or strings. With respect to strings, lower case letters are greater than upper case letters.

Does awk support regular expression?

Yes. Two operators are used to compare strings to regular expressions:

~    Matches
!~    Doesn't match

The regular expression must be enclosed by slashes, and comes after the operator. AWK supports extended regular expressions.

word !~ /START/
lawrence_welk ~ /(one|two|three)/

What are the boolean operators?

and
&&

or
||

!

Does awk support loop and conditional branching?

Yes.

if ( conditional ) statement [ else statement ]
while ( conditional ) statement
for ( expression ; conditional ; expression ) statement
for ( variable in array ) statement
break
continue
{ [ statement ] ...}
variable=expression
print [ expression-list ] [ > expression ]
printf format [ , expression-list ] [ > expression ]
next
exit

How can we use the while loop construct?

i=1;
while (i <= 10) {
        printf "The square of ", i, " is ", i*i;
        i = i+1;
}

How can we use the for loop construct?

for (i=1; i <= 10; i++) {
        printf "The square of ", i, " is ", i*i;
}

What is the definition of positional variable?

Positional variable is like $1, $2, etc, which represent a particular field or column in the input. Positional variables start with a dollar sign.

What is the definition of user-defined variable?

User defined variables are the ones that you defined, such as:

x=5;

What is $0?

The variable "$0" refers to the entire line that AWK reads in.

Can we modify the value of positional variables?

Yes.

What happens if we assign an empty string to a positional variable?

The actual number of fields does not change. Setting a positional variable to an empty string does not delete the variable. It's still there, but the contents has been deleted.

How can we change the input separator character?

AWK can be used to parse many system administration files. However, many of these files do not have whitespace as a separator. as an example, the password file uses colons. You can easily change the field separator character to be a colon using the "-F" command line option.

awk -F: '{if ($2 == "") print $1 ": no password!"}' </etc/passwd

You can also change the input field separator:

#!/bin/awk -f
BEGIN {
    FS=":";
}
{
    if ( $2 == "" ) {
        print $1 ": no password!";
    }
}

Can we use a string as an input field separator?

Yes and no. Another difference between the command line option and the internal variable is the ability to set the input field separator to be more than one character. If you specify:

FS=": ";

then AWK will split a line into fields wherever it sees those two characters, in that exact order. You cannot do this on the command line.

Inside an awk script, how frequent can we change the input field separator?

You can change the field separator character as many times as you want while reading a file. Well, at most once for each line. You can even change it depending on the line you read.

Suppose you had the following file which contains the numbers 1 through 7 in three different formats. Lines 4 through 6 have colon separated fields, while the others separated by spaces.

ONE 1 I
TWO 2 II
#START
THREE:3:III
FOUR:4:IV
FIVE:5:V
#STOP
SIX 6 VI
SEVEN 7 VII

The AWK program can easily switch between these formats:

#!/bin/awk -f
{
    if ($1 == "#START") {
        FS=":";
    } else if ($1 == "#STOP") {
        FS=" ";
    } else {
        #print the Roman number in column 3
        print $3
    }
}

What is the default value for the output field separator?

Space.

How can we change the output field separator?

Normally this is a space, but you can change this by modifying the variable "OFS".

OFS=":";

What does the print command do if we do not specify any value?

It prints $0, which is the entire line.

How does the print command behave if we specify multiple variables with commas and without commas?

There is an important difference between

print $2 $3

and

print $2, $3

The first example prints out one field, and the second prints out two fields. In the first case, the two positional parameters are concatenated together and output without a space. In the second case, AWK prints two fields, and places the output field separator between them.

What is the NF variable?

The Number of Fields variable. It is useful to know how many fields are on a line. You may want to have your script change its operation based on the number of fields. As an example, the command "ls -l" may generate eight or nine fields, depending on which version you are executing. The System V version, "/usr/bin/ls -l" generates nine fields, which is equivalent to the Berkeley "/usr/ucb/ls -lg" command. If you wanted to print the owner and filename then the following AWK script would work with either version of "ls:"

#!/bin/awk -f
# parse the output of "ls -l"
# print owner and filename
# remember - Berkeley ls -l has 8 fields, System V has 9
{
    if (NF == 8) {
        print $3, $8;
    } else if (NF == 9) {
        print $3, $9;
    } 
}

What happens if we use prefix the NF variable with a dollar sign?

Don't forget the variable can be prepended with a "$". This allows you to print the last field of any column.

#!/bin/awk -f
{ print $NF; }

What is the limit on the number of fields per single line?

There is a limit of 99 fields in a single line.

What is the purpose of the NR built-in variable?

Number of Records. This tells you the number of records, or the line number. You can use AWK to only examine certain lines. This example prints lines after the first 100 lines, and puts a line number before each line after 100:

#!/bin/awk -f
{ if (NR > 100) {
    print NR, $0;
}

What is the purpose of the RS built-in variable?

Record Separator. Normally, AWK reads one line at a time, and breaks up the line into fields. You can set the "RS" variable to change AWK's definition of a "line". If you set it to an empty string, then AWK will read the entire file into memory. You can combine this with changing the "FS" variable.

This example treats each line as a field, and prints out the second and third line:

#!/bin/awk -f
BEGIN {
# change the record separator from newline to nothing    
    RS=""
# change the field separator from whitespace to newline
    FS="\n"
}
{
# print the second and third line of the file
    print $2, $3;
}

This will only work if the input file is less than 100 lines, therefore this technique is limited.

You can also use this technique to break words up, one word per line, using this:

#!/bin/awk -f
BEGIN {
    RS=" ";
}
{
    print ;
}

But this only works if all of the words are separated by a space. If there is a tab or punctuation inside, it would not.

What is the purpose of the ORS built-in variable?

The default output record separator is a newline, like the input. This can be set to be a newline and carriage return, if you need to generate a text file for a non-UNIX system.

#!/bin/awk -f
# this filter adds a carriage return to all lines
# before the newline character
BEGIN {    
    ORS="\r\n"
}
{ print }

What is the purpose of the FILENAME built-in variable?

It tells you the name of the file being read.

How can I use associative arrays?

username[i]

How can we use the for in construct?

#!/bin/awk -f
{
    username[$3]++;
}
END {
    for (i in username) {
        print username[i], i;
    }
}

There is one minor problem with associative arrays, especially if you use the for command to output each element: you have no control over the order of output. You can create an algorithm to generate the indices to an associative array, and control the order this way.

How can we implement multi-dimensional array?

You can put anything in the index of an associative array.

a[1 "," 2] = y;

The AWK string concatenation operator is the space. It combines the three strings into the single string "1,2". Then it uses it as an index into the array. That's all there is to it.

Does awk have user-defined functions?

The traditional awk does not support user-defined function, but nawk and gawk do support user-defined functions.

What is the purpose of the printf function?

The printf is very similar to the C function with the same name. C programmers should have no problem using printf function. Printf has one of these syntactical forms:

printf ( format);
printf ( format, arguments...);
printf ( format) >expression;
printf ( format, arguments...) > expression;

The first argument to the printf function is the format. This is a string, or variable whose value is a string. This string, like all strings, can contain special escape sequences to print control characters.

The escape sequences for the printf format:

\a    ASCII bell (NAWK/GAWK only)
\b    Backspace
\f    Formfeed
\n    Newline
\r    Carriage Return
\t    Horizontal tab
\v    Vertical tab (NAWK only)
\ddd    Character (1 to 3 octal digits) (NAWK only)
\xdd    Character (hexadecimal) (NAWK only)
\<Any other character>    That character

The format specifiers:

%c    ASCII Character
%d    Decimal integer
%e    Floating Point number
(engineering format)
%f    Floating Point number
(fixed point format)
%g    The shorter of e or f,
with trailing zeros removed
%o    Octal
%s    String
%x    Hexadecimal
%%    Literal %

What are the differences between the print function and the printf function beside the formatting?

The print function terminates the line with the ORS character, and separate each field with the OFS separator. The printf does nothing unless you specify the action. Therefore you will frequently end each line with the newline character "\n", and you must specify the separating characters explicitly.

How can we implement conditional branching?

#!/usr/bin/awk -f
{
    # lots of code here, where you may find an error
    if ( numberOfErrors > 0 ) {
        exit
    }
}

What is the purpose of the exit function?

Causes awk to end reading its input and execute END operations, if any are specified.

What is the purpose of the next statement?

Causes awk to immediately get another line of input and begin scanning it from the first match statement.

Join lines.

What is the purpose of the int function?

Truncate a floating-point number to make an integer.

How can we use the length function?

The length() function calculates the length of a string. I often use it to make sure my input is correct. If you wanted to ignore empty lines, check the length of the each line before processing it with:

if (length($0) > 1) {
    . . .
}

How can we use the index function to search for a sub-string?

If you want to search for a special character, the index() function will search for specific characters inside a string. To find a comma, the code might look like this:

sentence="This is a short, meaningless sentence.";
if (index(sentence, ",") > 0) {
printf("Found a comma in position \%d\n", index(sentence,","));
}

How can we use the substr function to extract a sub-string?

The substr() function can extract a portion of a string. There are two ways to use it:

substr(string,position)
substr(string,position,length)

where string is the string to search, position is the number of characters to start looking, and length is the number of characters to extract (default is 1).

Can we assign value to a variable and use it in a condition?

Yes.

#!/bin/awk -f
{
# field 1 is the e-mail address - perhaps
    if ((x=index($1,"@")) > 0) {
        username = substr($1,1,x-1);
        hostname = substr($1,x+1,length($1));
# the above is the same as
#        hostname = substr($1,x+1);
        printf("username = %s, hostname = %s\n", username, hostname);
    }
}

How can we use the split function?

The split functions takes three arguments: the string, an array, and the separator. The function returns the number of pieces found.

#!/usr/bin/awk -f
BEGIN {
# this script breaks up the sentence into words, using 
# a space as the character separating the words
    string="This is a string, is it not?";
    search=" ";
    n=split(string,array,search);
    for (i=1;i<=n;i++) {
        printf("Word[%d]=%s\n",i,array[i]);
    }
    exit;
}

How can we convert a string to lower or upper case?

GAWK has the toupper() and tolower() functions:

#!/usr/local/bin/gawk -f
{
    print tolower($0);
}

What is the purpose of the sub function?

Performs a string substitution. To replace "old" with "new" in a string, use:

sub(/old/, "new", string)

If the third argument is missing, $0 is assumed to be string searched. The function returns 1 if a substitution occurs, and 0 if not. If no slashes are given in the first argument, the first argument is assumed to be a variable containing a regular expression. The sub() only changes the first occurrence.

What is the difference between the sub function and the gsub function?

The sub() only changes the first occurrence. The gsub() function is similar to the g option in sed: all occurrence are converted, and not just the first.

What is the return value of the sub function and the gsub function?

These functions return 1 if a substitution occurs, and 0 if not.

How can we use the match function?

As the above demonstrates, the sub() and gsub() returns a positive value if a match is found. However, it has a side-effect of changing the string tested. If you don't wish this, you can copy the string to another variable, and test the spare variable. NAWK also provides the match() function. If match() finds the regular expression, it sets two special variables that indicate where the regular expression begins and ends. Here is an example that does this:

#!/usr/bin/nawk -f
# demonstrate the match function

BEGIN {
    regex="[a-zA-Z0-9]+";
}
{
    if (match($0,regex)) {
#           RSTART is where the pattern starts
#           RLENGTH is the length of the pattern
            before = substr($0,1,RSTART-1);
            pattern = substr($0,RSTART,RLENGTH);
            after = substr($0,RSTART+RLENGTH);
            printf("%s<%s>%s\n", before, pattern, after);
    }
}

What is the purpose of the system function?

NAWK has a function system() that can execute any program. It returns the exit status of the program.

if (system("/bin/rm junk") != 0)
print "command didn't work";

What is the purpose of the getline function?

AWK has a command that allows you to force a new line. It doesn't take any arguments. It returns a 1, if successful, a 0 if end-of-file is reached, and a -1 if an error occurs. As a side effect, the line containing the input changes. This next script filters the input, and if a backslash occurs at the end of the line, it reads the next line in, eliminating the backslash as well as the need for it.

#!/usr/bin/awk -f
# look for a  as the last character.
# if found, read the next line and append
{
    line = $0;
    while (substr(line,length(line),1) == "\\") {
# chop off the last character
        line = substr(line,1,length(line)-1);
        i=getline;
        if (i > 0) {
            line = line $0;
        } else {
            printf("missing continuation on line %d\n", NR);
        }
    }
    print line;
}

Instead of reading into the standard variables, you can specify the variable to set:

getline a_line
print a_line;

NAWK and GAWK allow the getline function to be given an optional filename or string containing a filename.

NAWK's getline can also read from a pipe. If you have a program that generates single line, you can use:

"command" | getline;
print $0;

"command" | getline abc;
print abc;

If you have more than one line, you can loop through the results:

while ("command" | getline) {
    cmd[i++] = $0;
}
for (i in cmd) {
    printf("%s=%s\n", i, cmd[i]);
}

Only one pipe can be open at a time. If you want to open another pipe, you must execute:

close("command");

This is necessary even if the end of file is reached.

What is the purpose of the systime function?

The systime() function returns the current time of day as the number of seconds since Midnight, January 1, 1970.

How can we define our own user-defined function?

function functionName ( variableName ) {
}

What is a pattern in awk?

The basic syntax for awk:

pattern {commands}

So far, we've only used the BEGIN and END special pattern. Other patterns are possible, yet I haven't used any. There are several reasons for this. The first is that these patterns aren't necessary. You can duplicate them using an if statement. Therefore this is an "advanced feature". Patterns, or perhaps the better word is conditions, tend to make an AWK program obscure to a beginner. You can think of them as an advanced topic, one that should be attempted after becoming familiar with the basics.

A pattern or condition is simply an abbreviated test. If the condition is true, the action is performed. All relational tests can be used as a pattern. The "head -10" command, which prints the first 10 lines and stops, can be duplicated with:

{if (NR <= 10 ) {print}}

Changing the if statement to a condition shortens the code:

NR <= 10 {print}

Besides relational tests, you can also use containment tests, i. e. do strings contain regular expressions? Printing all lines that contain the word "special" can be written as:

{if ($0 ~ /special/) {print}}

or more briefly:

$0 ~ /special/ {print}

This type of test is so common, the authors of AWK allow a third, shorter format:

/special/ {print}

These tests can be combined with the AND (&&) and OR (||) commands, as well as the NOT (!) operator. Parenthesis can also be added if you are in doubt, or to make your intention clear.

The following condition prints the line if it contains the word "whole" or columns 1 and 2 contain "part1" and "part2" respectively.

($0 ~ /whole/) || (($1 ~ /part1/) && ($2 ~ /part2/)) {print}

This can be shortened to:

/whole/ || $1 ~ /part1/ && $2 ~ /part2/ {print}

There is one case where adding parenthesis hurts. The condition:

/whole/ {print}

works, but

(/whole/) {print}

does not. If parenthesis are used, it is necessary to explicitly specify the test:

($0 ~ /whole) {print}

Can variable be used as a condition?

No. AWK does not accept variables as conditions (or patterns). For example,

NF {print}

does not work. It results in a syntax error. To prevent a syntax error, change it to:

NF != 0 {print}

Does awk have address range like sed?

Yes.

/start/,/stop/ {print}

This form defines, in one line, the condition to turn the action on, and the condition to turn the action off. That is, when a line containing "start" is seen, it is printed. Every line afterwards is also printed, until a line containing "stop" is seen. This one is also printed, but the line after, and all following lines, are not printed.

What is the purpose of the FNR builtin variable?

The FNR variable contains the number of lines read, but is reset for each file read. The NR variable accumulates for all files read. Therefore if you execute an awk script with two files as arguments, with each containing 10 lines:

nawk '{print NR}' file file2
nawk '{print FNR}' file file2

The first program would print the numbers 1 through 20, while the second would print the numbers 1 through 10 twice, once for each file.

What is the purpose of the OFMT built-in variable?

The OFMT variable specifies the default format for numbers. The default value is "%.6g".

What is the purpose of the RSTART variable and RLENGTH variable?

After the match() function is called, these variables contain the location in the string of the search pattern. RLENGTH contains the length of this match.

How can we delete an element from an associative array?

delete fooarray[1]
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License