Header Shadow Image


AWK for Human Beings

This is a quick and dirty AWK tutorial demonstrating the most common usages and features that one can use on a daily basis by human beings.  Three variants: AWK standalone, integrated into ksh, and one-liner are provided. 

 

On my system, AWK pointed to GAWKGAWK has an expanded command set but here we will only focus on specific to AWK.

 

AWK is a powerful language rarely used beyond printing a few columns but often can stack up better in performance, then other scripting languages.  AWK can also make for a relatively easy learn due to it's small set of functions.  This is similar to C without the complexities of such things as pointers:

[root@tom awk]# ./dfm.awk
Filesystem           1M-blocks      Used Available Use% Mounted on
/dev/sdc2                29526     26568      1458  95% /
/dev/sdc5                19689     18226       463  98% /home
/dev/sdb1                 9538      9079       459  96% /mnt/c
[root@tom awk]# cat dfm.awk
#!/bin/awk -f

function dfm ( USAGEL ) {
        CMD="df -m";
        while (1) {
                rvar = ( ( CMD | getline ) > 0 );

                if ( rvar == 0 )                 
                        break;

                DUSE=$5;                                      
                sub(/[%]/, "", DUSE);    
                                                         

                if ( DUSE > USAGEL || $0 ~ /Filesystem/ )
                        print $0;
        }
}

BEGIN {
        dfm(85);
}

[root@tom awk]#

Below is a breakdown of the above commands.

function dfm ( USAGEL ) { This declares a function dfm.  Since we want to pass a parameter to this function, we pick a name and enter it within the brackets () .  The name will be how we will reference the parameter.
CMD="df -m"; Set a variable, CMD to the text we will execute as the command: df -m
while(1) This starts an infinite loop.  The condition in this case, 1, is always true.
rvar = ( CMD | getline ); This command executes the text in the variable CMD.  The output is piped to getline.  The entire operation returns 1 if a row can be read.  If no rows can be read, 0 is returned.  The number of rows read is saved in our custom variable rvar.

 

 

 

 

 

 

 

Behind the scenes of this command in AWK, $0 is set followed by $1, $2, $3 etc for each row read. 

  if ( rvar == 0 )                 
      break;
Check if rvar is 0.  If this is 0, it means no more rows can be read and break is called to exit the inner most loop.
DUSE=$5;          Set another variable to the fifth column.  This is the percentage field such as 95%
sub(/[%]/, "", DUSE);     Substitute the percentage in the usage field with a blank "" essentially removing it.  This leaves only the number.
if ( DUSE > USAGEL || $0 ~ /Filesystem/ ) Print the header and any line where the usage is greater then what we specify in the function parameter USAGEL.  The OR operator is used || to indicate we want to match the header or any filesystem where usage is greater then what we specified in the function call. 
print $0 If the above condition is met, print the entire line to the terminal.
dfm(85); This is how the function is called.  The 85 is assigned to the variable USAGEL above.

 

The below script demostrates how to embed the above AWK script into a KSH script.  Notice how everything is the same with the exception of the parts in green which we add for KSH.  We'll add a twist to the below by creating a thread from the KSH function that completes the task 5 seconds after the main thread exits.

 

[root@tom awk]# cat dfm.ksh
#!/bin/ksh

function dfm {
        awk '
                function dfm ( USAGEL ) {
                        CMD="df -m";
                        while (1) {
                                rvar = ( CMD | getline );

                                if ( rvar == 0 )                        # If this is 0
                                        break;

                                DUSE=$5;                          
                                sub(/[%]/, "", DUSE);
                                                                  

                                if ( DUSE > USAGEL || $0 ~ /Filesystem/ )
                                        print $0;
                        }
                }

                BEGIN {
                        SCMD="sleep 5";
                        system(SCMD);

                        dfm(85);
                        close(SCMD);
                }
        ';
}

function main {
        dfm &;
}
main;
 

A breakdown of the components:

SCMD="sleep 5"; We are setting this variable to the text of the OS command we want to run.
system(SCMD); The AWK system built-in function executes the text in the SCMD variable.  In this case it is to sleep for 5 seconds.
dfm & The & at the end of calling that function, makes the function run as a separate thread and detaches it from the currently running thread.  The newly created thread will now be running in parallel and separate to the main thread.

The thread created in the above sleeps for 5 seconds but the execution of the main function continues, due to use of the & after the dfm function.  After sleeping 5 seconds, the thread exits and prints the results:

# ./dfm.ksh
#  (press enter a few times or run ps -aux|grep dfm.ksh to verify the thread is still there.)
#
#
# Filesystem           1M-blocks      Used Available Use% Mounted on
/dev/sdc2                29526     26569      1457  95% /
/dev/sdc5                19689     18226       463  98% /home
/dev/sdb1                 9538      9079       459  96% /mnt/c
#

 

The below line is a one liner that does the same.  It can simply be pasted to the CLI and ran.  The result is the same in a more compact form:

[root@tom awk]# df -m|awk ‘{ DUSE=$5; sub(/[%]/, "", DUSE); if ( DUSE > 85 || $0 ~ /Filesystem/ ) print; }'
Filesystem           1M-blocks      Used Available Use% Mounted on
/dev/sdc2                29526     26569      1457  95% /
/dev/sdc5                19689     18226       463  98% /home
/dev/sdb1                 9538      9079       459  96% /mnt/c
[root@tom awk]#

Performance of a script can be tested using the date function as in the below example:

# date +"%s:%N"; df -m|awk '{ DUSE=$5; sub(/[%]/, "", DUSE); if ( DUSE > 85 || $0 ~ /Filesystem/ ) print; }'; date +"%s:%N";
1327730034:862739185
Filesystem           1M-blocks      Used Available Use% Mounted on
/dev/sdc2                29526     26569      1457  95% /
/dev/sdc5                19689     18226       463  98% /home
/dev/sdb1                 9538      9079       459  96% /mnt/c
1327730034:873508735
#

Doing the math on the nanoseconds, we can see the code above took approx 10.7ms to run to produce the output or essentially just a little over one time slice to complete.

 

UPDATE: Wed April 25nd 2012

Here's another handy example that someone has asked me about recently.  The equivalent of the below is the UNIX comm command available on most distributions and this logic accomplishes this comparison on those systems lacking the comm command.  Here's the equivalent in awk that the reader can tweak to suit their tastes:

# cat scmp.awk
#!/bin/awk -f

function scmp ( SERVERF, HOSTF ) {
        FIL=SERVERF;
        while (1) {
                if ( FIL != "" )        rvar = ( ( getline < FIL ) > 0 );
                if ( rvar == 0 ) break;
                SVR[$1]=1;
        }

        FIL=HOSTF;
        while (1) {
                if ( FIL != "" )        rvar = ( ( getline < FIL ) > 0 );

                if ( rvar == 0 ) break;
                if ( SVR[$1] == 1 ) SVR[$1]++; else SVR[$1]=-1;
        }

        for ( key in SVR ) {
                if ( SVR[key] == 2 )    printf("%12s %12s %12s\n", key, key, "=");
                if ( SVR[key] == -1 )   printf("%12s %12s %12s\n", "-", key, ">")
                if ( SVR[key] == 1 )    printf("%12s %12s %12s\n", key, "-", "<")
        }
}

BEGIN {
        scmp( ARGV[1], ARGV[2] );
}

 

And the output is:

# ./scmp.awk server.txt host.txt
       host8                         <
                    host9            >
      host10                         <
       host1                         <
                    host2            >
       host3        host3            =
       host4        host4            =
                    host5            >
                    host6            >
       host7                         <
#
#
# cat server.txt

host1
host10
host3
host4
host7
host8
# cat host.txt
host2
host3
host4
host5
host6
host9
#

 

UPDATE: Wed Oct 24 2012

For this example, we'll also benchmark code of either language to see how simple array assignments stack up AWK vs KSH vs PERL.  Here's the test setup and results of each:

[root@mbpc bin]# date "+%s:%N"; awk 'BEGIN { while (1){ if ( getline < "test.log" == 0 ) break; if ( $1 ~ /^sdg$/ ) print $2; } }' 1>./out.awk; date "+%s:%N"
1351126862:506934628
1351126862:618688198
[root@mbpc bin]# wc -l out.awk
5882 out.awk
[root@mbpc bin]#

About 100ms.  And now another variation:

[root@mbpc bin]# date "+%s:%N"; awk '{ if ( $1 ~ /^sdg/ ) print; }' < test.log 1>./out.awk; date "+%s:%N"
1351129013:669502069
1351129013:766415288
[root@mbpc bin]# wc -l out.awk
5882 out.awk
[root@mbpc bin]#

But same results.  About 100ms to process the file.  Let's try a combination of grep and AWK instead:

[root@mbpc bin]# date "+%s:%N"; grep ^sdg test.log|awk '{ print $2; }' 1>./out.awk; date "+%s:%N";
1351132472:887647078
1351132472:921227085
[root@mbpc bin]# wc -l out.awk
5882 out.awk
[root@mbpc bin]#

And that gave us about 35ms for grep / AWK combination.  (Which is somewhat surprising as I'm calling two separate executables and piping output.)  Allright, allright, allright let's remove AWK alltogether (gasp) and use cut instead:

[root@mbpc bin]# date "+%s:%N"; grep ^sdg test.log| cut -d " " -f 1 1>./out.cut; date "+%s:%N";
1351135443:722405427
1351135443:774799428
[root@mbpc bin]# wc -l out.cut
5882 out.cut
[root@mbpc bin]#

Allright 52ms.  So AWK actually did better (phew).  Let's do some array assignments instead.  Here I'll get about 1.1 Million per second with AWK:

# date "+%s:%N"; ./second.awk; date "+%s:%N"
1351129587:862415690
1351129588:857317365
[root@mbpc bin]#
[root@mbpc bin]# cat second.awk
#!/bin/gawk -f

function malloc () {
        for ( i = 0; i < 1100000; i++ ) {
                MARY[i]="abcdefghijklmnopqrstuvwxyz012345";
        }
}

BEGIN {
        malloc();
}
#
# rpm -aq|grep awk
gawk-3.1.7-6.el6.x86_64
#

Yet when using ksh (Version JM 93t+ 2010-06-21) I get about 4.3 seconds:

# cat second.ksh
#!/bin/ksh

for (( i = 0; i < 1100000; i++ )); do
        MARY[$i]="abcdefghijklmnopqrstuvwxyz012345";
done
# echo ${.sh.version}
Version JM 93t+ 2010-06-21
# rpm -aq|grep -i ksh
ksh-20100621-6.el6.x86_64
# date "+%s:%N"; ./second.ksh; date "+%s:%N"
1351130028:821851805
1351130033:426989480
#

Here's my CPU make and model for this low end server:

AMD Athlon(tm) 5200 Dual-Core Processor

So now I was interested to see if this will improve with larger number of elements.  (Perhaps load time of each command plays a part.)  So I set it to 11 Million elements per array and got this:

# date "+%s:%N"; ./second.ksh; date "+%s:%N"
1351130479:478252192
./second.ksh: line 4: MARY: subscript out of range
1351130497:110762105
#
# date "+%s:%N"; ./second.awk; date "+%s:%N"
1351130583:363844953
1351130593:171463455
# cat second.awk
#!/bin/gawk -f

function malloc () {
        for ( i = 0; i < 11000000; i++ ) {
                MARY[i]="abcdefghijklmnopqrstuvwxyz012345";
        }
}

BEGIN {
        malloc();
}
# cat second.ksh
#!/bin/ksh

for (( i = 0; i < 11000000; i++ )); do
        MARY[$i]="abcdefghijklmnopqrstuvwxyz012345";
done
#

So about 10 seconds for AWK and KSH is undeterminate because of the error.  Now I wonder if there isn't a quicker way of writing the above in KSH that would match AWK's speed in this regard.  So now let's give perl a test:

[root@mbpc-pc bin]# date "+%s:%N";./second.perl; date "+%s:%N";
1408641955:906252899
abcdefghijklmnopqrstuvwxyz012345
1408641960:032583093
[root@mbpc-pc bin]# cat ./second.perl
#!/usr/bin/perl
use strict;
my $i=0;
my @MARY = ();
for ( $i=0; $i < 11000000; $i++ ) {
        push( @MARY, "abcdefghijklmnopqrstuvwxyz012345" );
}
printf("%s\n", @MARY[$i-1]);
[root@mbpc-pc bin]#

 

Wow!  Beat out all of them at ~4.5s.  But now let's try the hashes in perl:

 

[root@mbpc-pc bin]# date "+%s:%N";./third.perl; date "+%s:%N";
1408642955:840563455
abcdefghijklmnopqrstuvwxyz012345
1408642983:929218669
[root@mbpc-pc bin]# cat ./third.perl
#!/usr/bin/perl
use strict;
my $i=0;
my %MHASH;
for ( $i=0; $i < 11000000; $i++ ) {
        $MHASH{$i}="abcdefghijklmnopqrstuvwxyz012345";
}
printf("%s\n", $MHASH{$i-1});
[root@mbpc-pc bin]#

 

Oh no!  28 seconds!  23 seconds on second run!  🙁  So averaging perl arrays+hash performance that's 14-16 seconds.  So the grand summary:
 

[root@mbpc-pc bin]# time ./third.perl; time ./second.perl; time ./second.awk

real    0m23.440s
user    0m21.765s
sys     0m1.637s

real    0m3.334s
user    0m2.723s
sys     0m0.601s

real    0m8.829s
user    0m8.091s
sys     0m0.723s
[root@mbpc-pc bin]# cat third.perl
#!/usr/bin/perl
use strict;
my $i=0;
my %MHASH;
for ( $i=0; $i < 11000000; $i++ ) {
        $MHASH{$i}="abcdefghijklmnopqrstuvwxyz012345";
}
[root@mbpc-pc bin]# cat second.perl
#!/usr/bin/perl
use strict;
my $i=0;
my @MARY = ();
for ( $i=0; $i < 11000000; $i++ ) {
        push( @MARY, "abcdefghijklmnopqrstuvwxyz012345" );
}
[root@mbpc-pc bin]# cat second.awk
#!/bin/gawk -f

function malloc () {
        for ( i = 0; i < 11000000; i++ ) {
                MARY[i]="abcdefghijklmnopqrstuvwxyz012345";
        }
}

BEGIN {
        malloc();
}
[root@mbpc-pc bin]#

Now let's bring out the heavy guns and give mawk a try:

[root@mbpc-pc bin]# time ./second.mawk

real    0m2.695s
user    0m2.304s
sys     0m0.386s
[root@mbpc-pc bin]# time ./second.mawk

real    0m2.714s
user    0m2.283s
sys     0m0.424s
[root@mbpc-pc bin]# time ./second.mawk

real    0m2.721s
user    0m2.281s
sys     0m0.434s
[root@mbpc-pc bin]# rpm -aq|grep -i mawk
mawk-1.3.4-1.20130219.el6.x86_64
[root@mbpc-pc bin]# cat second.mawk
#!/usr/bin/mawk -f

function malloc () {
        for ( i = 0; i < 11000000; i++ ) {
                MARY[i]="abcdefghijklmnopqrstuvwxyz012345";
        }
}

BEGIN {
        malloc();
}
[root@mbpc-pc bin]# time ./second.awk

real    0m8.982s
user    0m8.250s
sys     0m0.715s
[root@mbpc-pc bin]#

Ahem!  Well now.  Though there should be a warning.  MAWK does come with an accuracy disclaimer and that is really the choice maker between the two.  More on mawk can be found here.  (Thanks Pete Ritter.  You're right, time did give slightly different results but the oerall result was still the same. )

Cheers!
TK

3 Responses to “AWK for Human Beings”

  1. […] AWK for Human Beings […]

  2. […] AWK for Human Beings […]

  3. Rather than using date(1) twice to determine execution time of your various scripts, why not use time(1)?  You'd get a closer approximation to the true execution time – one that doesn't include those of the two invocations of date(2).

    petes-mbp:tmp pete$ time for (( i = 0; i < 1100000; i++ )); do MARY[$i]="abcdefghijklmnopqrstuvwxyz012345"; done

    real    0m1.767s
    user    0m1.766s
    sys    0m0.001s
    petes-mbp:tmp pete$

     

Leave a Reply

You must be logged in to post a comment.


     
  Copyright © 2003 - 2013 Tom Kacperski (microdevsys.com). All rights reserved.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License