awk

linux

Advance stuff

Important:
http://www.grymoire.com/Unix/Awk.html - done reading
http://www.ibm.com/developerworks/library/l-awk1/ - done looking through
http://www.staff.science.uu.nl/~oostr102/docs/nawk/nawkA4.pdf - the awk manual
http://www.vectorsite.net/tsawk.html - done looking through

http://www.hcs.harvard.edu/~dholland/computers/awk.html - done looking through
https://quickleft.com/blog/command-line-tutorials-sed-awk/ - done looking through
http://www.theunixschool.com/p/awk-sed.html - done looking through
https://www.digitalocean.com/community/tutorials/how-to-use-the-awk-language-to-manipulate-text-in-linux - done looking through
https://www.cse.iitb.ac.in/~br/courses/cs699-autumn2013/refs/awk-tutorial.html - done looking through

Example 1. How can we use awk to extract one column of a csv file?

awk -F "\"*,\"*" '{print $2}' textfile.csv

Change '$2' to the nth column you want.

In the above code, we are using -F to specify the field separator. In this case, it looks like the field separator is not a fixed string, but a regular expression pattern. In the above code, we print the second column of the csv file.

Example 2. Basic explanations:

awk 'BEGIN {sum=0} \
 {sum=sum + $2} \
END {print "tot:", sum}' Yourinputfile.txt

In the above example, we invoke the awk program with two parameters. The first parameter is the awk script to be parsed and executed. The second parameter is the input file that we want our script to process.

The basic structure of an AWK script follows the form:

pattern { action }

The pattern specifies when the action is performed. Like most UNIX utilities, AWK is line oriented. That is, the pattern specifies a test that is performed with each line read as input. If the condition is true, then the action is taken. The default pattern is something that matches every line. This is the blank or null pattern.

The above example contains 3 patterns and 3 corresponding action blocks. The first pattern is the BEGIN pattern. The BEGIN pattern is a special keyword / pattern. Its purpose is to specify actions to be taken before the first line is read. The BEGIN pattern is useful for initialization. The second pattern is the empty or blank pattern. This is also known as the default pattern. The third pattern is the END pattern. The END pattern is used to generate the summary or aggregate result.

Example 3. Useful one-liners:

kill -9 `ps -elf | awk '$1 == "chavez" {print $2}'`
awk '{ print $1 }'
awk '{ print }' /path/file // Print each lines of /path/file
awk '{ print $1 }' /path/file // Print the first field of each line
awk -F":" '{ print $1 $3 }' /path/file // Using a different input field separator
awk -F":" '{ print $1 " " $3 }' /path/file
awk -f myscript.awk myfile.in
/foo/ { print } // Patterns in Awk are much like address range similar to sed
$1 == "fred" { print $3 } // Condition can be used as pattern
$5 ~ /root/ { print $3 } // If the fifth field contain "root", print the third field
awk '{ print $2, $1 }' file // Print first two fields in opposite order
awk 'length > 72' file // prints lines that are longer than 72 characters
awk '{print length($2)}' file // Print length of string in 2nd column
awk '{ for (i = NF; i > 0; --i) print $i }' file // print fields in reversed order
awk '{print NR, $0}' file // add line number
awk '{$2 = ""; print}' file // print every line after erasing the second field
awk 'END { print }' // print the last line of a file, or just use the tail command
awk '/root/' ~/temp.txt // print all lines that match root.  awk is similar to sed.  The default action in awk is to print the entire line.
awk '$9 ~/^\./ {gsub(/jessica/, "nicelady"); print;}' ~/temp.txt // if field number 9th starts with a dot, do the global substitution on the entire line and then print
awk '/UUID/' /etc/fstab // print every lines that contains the pattern

What is awk?

It is an excellent filter and report writer. Many UNIX utilities generates rows and columns of information. AWK is an excellent tool for processing these rows and columns, and is easier to use AWK than most conventional programming languages. It can be considered to be a pseudo-C interpretor, as it understands the same arithmatic operators as C. AWK also has string manipulation functions, so it can search for particular strings and modify the output. AWK also has associative arrays, which are incredible useful, and is a feature most computing languages lack. Associative arrays can make a complex problem a trivial exercise.

Awk supports a full range of mathematical functions such as cos, sin, tan, log, exp, sqrt, rand, int and more.

What is the definition of positional variable?

Positional variable is like $1, $2, etc, which represent a particular field or column in the input. Positional variables start with a dollar sign.

What is the definition of user-defined variable?

User defined variables are the ones that you defined, such as:

x=5;
BEGIN { x=5 }
{ print x, $x}

The above code, first we initialize the variable x to 5, and then we print the value of x, and the field (column) that correspond to the value of x.

variableName = arithmetic_expression

What is the purpose of $0?

The variable "$0" is a special positional variable that contains the entire line that AWK read in.

Can we modify the value of positional variables?

Yes. This is what we need to do if we want to change the value of a field. To print out the updated line, we may be able to print $0.

What happens if we assign an empty string to a positional variable?

The actual number of fields does not change. Setting a positional variable to an empty string does not delete the variable. It's still there, but the contents has been deleted.

Can variable be used as a condition?

No. AWK does not accept variables as conditions (or patterns). For example,

NF {print}

does not work. It results in a syntax error. To prevent a syntax error, change it to:

NF != 0 {print}

What is the purpose of the BEGIN pattern?

The BEGIN pattern is a special keyword / pattern. Its purpose is to specify actions to be taken before the first line is read.

What is the purpose of the END pattern?

The END pattern is a special keyword / pattern. Its purpose is to specify actions to be taken after the last line is read.

How can we do math with awk?

Awk support a full range of mathematical operators: +, -, *, /, %, ++, —, +=, -=, *=, /=, %=

How can we do initialization?

Use the BEGIN pattern:

BEGIN { x=5 }
{ print x, $x}

How can we change the input separator character?

AWK can be used to parse many system administration files. However, many of these files do not have whitespace as a separator. as an example, the password file uses colons. You can easily change the field separator character to be a colon using the "-F" command line option.

awk -F: '{if ($2 == "") print $1 ": no password!"}' </path/file

You can also change the input field separator:

#!/bin/awk -f
BEGIN {
    FS=":";
}
{
    if ( $2 == "" ) {
        print $1 ": no password!";
    }
}

Can we use a string as an input field separator?

Yes and no. Another difference between the command line option and the internal variable is the ability to set the input field separator to be more than one character. If you specify:

FS=": ";

then AWK will split a line into fields wherever it sees those two characters, in that exact order. You cannot do this on the command line.

Inside an awk script, how frequent can we change the input field separator?

You can change the field separator character as many times as you want while reading a file. Well, at most once for each line. You can even change it depending on the line you read.

Suppose you had the following file which contains the numbers 1 through 7 in three different formats. Lines 4 through 6 have colon separated fields, while the others separated by spaces.

ONE 1 I
TWO 2 II
#START
THREE:3:III
FOUR:4:IV
FIVE:5:V
#STOP
SIX 6 VI
SEVEN 7 VII

The AWK program can easily switch between these formats:

#!/bin/awk -f
{
    if ($1 == "#START") {
        FS=":";
    } else if ($1 == "#STOP") {
        FS=" ";
    } else {
        #print the Roman number in column 3
        print $3
    }
}

What is the default value for the output field separator?

Space.

How can we change the output field separator?

Normally this is a space, but you can change this by modifying the variable "OFS".

OFS=":";

What does the print command do if we do not specify any value?

It prints $0, which is the entire line.

How does the print command behave if we specify multiple variables with commas and without commas?

There is an important difference between

print $2 $3

and

print $2, $3

The first example prints out one field, and the second prints out two fields. In the first case, the two positional parameters are concatenated together and output without a space. In the second case, AWK prints two fields, and places the output field separator between them.

What is the purpose of the NF variable?

The Number of Fields variable. It is useful to know how many fields are on a line. You may want to have your script change its operation based on the number of fields. As an example, the command "ls -l" may generate eight or nine fields, depending on which version you are executing. The System V version, "/usr/bin/ls -l" generates nine fields, which is equivalent to the Berkeley "/usr/ucb/ls -lg" command. If you wanted to print the owner and filename then the following AWK script would work with either version of "ls:"

#!/bin/awk -f
# parse the output of "ls -l"
# print owner and filename
# remember - Berkeley ls -l has 8 fields, System V has 9
{
    if (NF == 8) {
        print $3, $8;
    } else if (NF == 9) {
        print $3, $9;
    } 
}

What happens if we use prefix the NF variable with a dollar sign?

Don't forget the variable can be prepended with a "$". This allows you to print the last field of any column.

#!/bin/awk -f
{ print $NF; }

What is the limit on the number of fields per single line?

There is a limit of 99 fields in a single line.

What is the purpose of the NR built-in variable?

Number of Records. This tells you the number of records, or the line number. You can use AWK to only examine certain lines. This example prints lines after the first 100 lines, and puts a line number before each line after 100:

#!/bin/awk -f
{ if (NR > 100) {
    print NR, $0;
}

What is the purpose of the RS built-in variable?

Record Separator. Normally, AWK reads one line at a time, and breaks up the line into fields. You can set the "RS" variable to change AWK's definition of a "line". If you set it to an empty string, then AWK will read the entire file into memory. You can combine this with changing the "FS" variable.

This example treats each line as a field, and prints out the second and third line:

#!/bin/awk -f
BEGIN {
# change the record separator from newline to nothing    
    RS=""
# change the field separator from whitespace to newline
    FS="\n"
}
{
# print the second and third line of the file
    print $2, $3;
}

This will only work if the input file is less than 100 lines, therefore this technique is limited.

You can also use this technique to break words up, one word per line, using this:

#!/bin/awk -f
BEGIN {
    RS=" ";
}
{
    print ;
}

But this only works if all of the words are separated by a space. If there is a tab or punctuation inside, it would not.

What is the purpose of the ORS built-in variable?

The default output record separator is a newline, like the input. This can be set to be a newline and carriage return, if you need to generate a text file for a non-UNIX system.

#!/bin/awk -f
# this filter adds a carriage return to all lines
# before the newline character
BEGIN {    
    ORS="\r\n"
}
{ print }

What is the purpose of the FILENAME built-in variable?

It tells you the name of the file being read.

What is the purpose of the exit function?

Causes awk to end reading its input and execute END operations, if any are specified.

What is the purpose of the next statement?

Causes awk to immediately get another line of input and begin scanning it from the first match statement.

Join lines.

What is the purpose of the int function?

Truncate a floating-point number to make an integer.

How can we use the length function?

The length() function calculates the length of a string. I often use it to make sure my input is correct. If you wanted to ignore empty lines, check the length of the each line before processing it with:

if (length($0) > 1) {
    . . .
}

How can we use the index function to search for a sub-string?

If you want to search for a special character, the index() function will search for specific characters inside a string. To find a comma, the code might look like this:

sentence="This is a short, meaningless sentence.";
if (index(sentence, ",") > 0) {
printf("Found a comma in position \%d\n", index(sentence,","));
}

How can we use the substr function to extract a sub-string?

The substr() function can extract a portion of a string. There are two ways to use it:

substr(string,position)
substr(string,position,length)

where string is the string to search, position is the number of characters to start looking, and length is the number of characters to extract (default is 1).

Can we assign value to a variable and use it in a condition?

Yes.

#!/bin/awk -f
{
# field 1 is the e-mail address - perhaps
    if ((x=index($1,"@")) > 0) {
        username = substr($1,1,x-1);
        hostname = substr($1,x+1,length($1));
# the above is the same as
#        hostname = substr($1,x+1);
        printf("username = %s, hostname = %s\n", username, hostname);
    }
}

How can we use the split function?

The split functions takes three arguments: the string, an array, and the separator. The function returns the number of pieces found.

#!/usr/bin/awk -f
BEGIN {
# this script breaks up the sentence into words, using 
# a space as the character separating the words
    string="This is a string, is it not?";
    search=" ";
    n=split(string,array,search);
    for (i=1;i<=n;i++) {
        printf("Word[%d]=%s\n",i,array[i]);
    }
    exit;
}

How can we convert a string to lower or upper case?

GAWK has the toupper() and tolower() functions:

#!/usr/local/bin/gawk -f
{
    print tolower($0);
}

What is the purpose of the sub function?

Performs a string substitution. To replace "old" with "new" in a string, use:

sub(/old/, "new", string)

If the third argument is missing, $0 is assumed to be string searched. The function returns 1 if a substitution occurs, and 0 if not. If no slashes are given in the first argument, the first argument is assumed to be a variable containing a regular expression. The sub() only changes the first occurrence.

What is the difference between the sub function and the gsub function?

The sub() only changes the first occurrence. The gsub() function is similar to the g option in sed: all occurrence are converted, and not just the first.

What is the return value of the sub function and the gsub function?

These functions return 1 if a substitution occurs, and 0 if not.

How can we use the match function?

As the above demonstrates, the sub() and gsub() returns a positive value if a match is found. However, it has a side-effect of changing the string tested. If you don't wish this, you can copy the string to another variable, and test the spare variable. NAWK also provides the match() function. If match() finds the regular expression, it sets two special variables that indicate where the regular expression begins and ends. Here is an example that does this:

#!/usr/bin/nawk -f
# demonstrate the match function

BEGIN {
    regex="[a-zA-Z0-9]+";
}
{
    if (match($0,regex)) {
#           RSTART is where the pattern starts
#           RLENGTH is the length of the pattern
            before = substr($0,1,RSTART-1);
            pattern = substr($0,RSTART,RLENGTH);
            after = substr($0,RSTART+RLENGTH);
            printf("%s<%s>%s\n", before, pattern, after);
    }
}

What is the purpose of the system function?

NAWK has a function system() that can execute any program. It returns the exit status of the program.

if (system("/bin/rm junk") != 0)
print "command didn't work";

What is the purpose of the getline function?

AWK has a command that allows you to force a new line. It doesn't take any arguments. It returns a 1, if successful, a 0 if end-of-file is reached, and a -1 if an error occurs. As a side effect, the line containing the input changes. This next script filters the input, and if a backslash occurs at the end of the line, it reads the next line in, eliminating the backslash as well as the need for it.

#!/usr/bin/awk -f
# look for a  as the last character.
# if found, read the next line and append
{
    line = $0;
    while (substr(line,length(line),1) == "\\") {
# chop off the last character
        line = substr(line,1,length(line)-1);
        i=getline;
        if (i > 0) {
            line = line $0;
        } else {
            printf("missing continuation on line %d\n", NR);
        }
    }
    print line;
}

Instead of reading into the standard variables, you can specify the variable to set:

getline a_line
print a_line;

NAWK and GAWK allow the getline function to be given an optional filename or string containing a filename.

NAWK's getline can also read from a pipe. If you have a program that generates single line, you can use:

"command" | getline;
print $0;

"command" | getline abc;
print abc;

If you have more than one line, you can loop through the results:

while ("command" | getline) {
    cmd[i++] = $0;
}
for (i in cmd) {
    printf("%s=%s\n", i, cmd[i]);
}

Only one pipe can be open at a time. If you want to open another pipe, you must execute:

close("command");

This is necessary even if the end of file is reached.

What is the purpose of the systime function?

The systime() function returns the current time of day as the number of seconds since Midnight, January 1, 1970.

How can we define our own user-defined function?

function functionName ( variableName ) {
}

Does awk have address range like sed?

Yes.

/start/,/stop/ {print}

This form defines, in one line, the condition to turn the action on, and the condition to turn the action off. That is, when a line containing "start" is seen, it is printed. Every line afterwards is also printed, until a line containing "stop" is seen. This one is also printed, but the line after, and all following lines, are not printed.

What is the purpose of the FNR builtin variable?

The FNR variable contains the number of lines read, but is reset for each file read. The NR variable accumulates for all files read. Therefore if you execute an awk script with two files as arguments, with each containing 10 lines:

nawk '{print NR}' file file2
nawk '{print FNR}' file file2

The first program would print the numbers 1 through 20, while the second would print the numbers 1 through 10 twice, once for each file.

What is the purpose of the OFMT built-in variable?

The OFMT variable specifies the default format for numbers. The default value is "%.6g".

What is the purpose of the RSTART variable and RLENGTH variable?

After the match() function is called, these variables contain the location in the string of the search pattern. RLENGTH contains the length of this match.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License