 |
Index for Section 1 |
|
 |
Alphabetical listing for A |
|
 |
Bottom of page |
|
awk(1)
NAME
awk - Pattern scanning and processing language
SYNOPSIS
awk [-F ERE] [-f program_file]... [-v var=val]... [argument]...
awk [-F ERE] [-v var=val]... ['program_text'] [argument]...
STANDARDS
Interfaces documented on this reference page conform to industry standards
as follows:
awk: XCU5.0
Refer to the standards(5) reference page for more information about
industry standards and associated tags.
OPTIONS
-F ERE
Defines ERE (extended regular expression) as the value of the input
field separator before any input is read. Using this option is
comparable to assigning a value to the built-in variable FS.
-f program_file
Specifies the pathname (program_file) of a file containing a awk
program. If multiple instances of this option are specified, the
concatenation of the files specified as program_file in the order
specified is the awk program. The awk program can alternatively be
specified on the command line as the single argument program_text.
-v var=val
The var=val argument is an assignment operand that specifies a value
(val) for a variable (var). The specified variable assignment occurs
prior to executing the awk program, including the actions associated
with BEGIN patterns (if any are in the program). Multiple occurrences
of the -v option can be specified on the awk command line.
OPERANDS
'program_text'
If -f program_file is not specified, the first parameter to awk is
program_text, delimited by single quotation (') characters.
See the DESCRIPTION section for the processing of this parameter.
argument
The following two types of argument can be intermixed:
input_file
A pathname of a file that contains the input to be read, which is
matched against the set of patterns in the program. If no
input_file operands are specified, or if the input_file argument is
-, standard input is used.
var=val
The characters before the = represent the name of an awk variable.
If that name is an awk reserved word, the behavior is undefined.
The characters following the = are interpreted as if they appeared
in the awk program preceded and followed by a double quotation (")
character, in other words, as a string value. If the value is
considered a numeric string, the variable is assigned a numeric
value. Each such variable assignment occurs just prior to the
processing of the following program_file, if any. Thus, an
assignment before the first program_file argument is executed after
the BEGIN actions (if any), while an assignment after the last
program_file argument occurs before the END actions (if any). If
there are no program_file arguments, assignments are executed
before processing the standard input.
DESCRIPTION
The awk command executes programs written in the awk programming language,
a powerful pattern matching utility for textual data manipulation. An awk
program is a sequence of patterns and corresponding actions that are
carried out when a pattern is read. The awk command is a more powerful
tool for text manipulation than either sed or grep.
The awk command:
· Performs convenient numeric processing
· Allows variables within actions
· Allows general selection of patterns
· Allows control flow in the actions
· Does not require any compiling of programs
The pattern-matching and action statements of the awk language can be
specified either on the command line or in a program file. In either case,
the awk command first reads all program statements.
If -f program_file is not specified, the first operand to awk is
program_text, delimited by single quotation (') characters.
Execution of an awk program starts by executing the actions associated with
all BEGIN patterns in the order they occur in the program. Then, each
operand in an input-file argument (or standard input if an input file is
not specified) is processed in turn by:
· Reading input data until a record separator is seen (a newline
character by default)
· Splitting the current record into fields using the current value of FS
· Evaluating each pattern in the program in the order of occurrence
· Executing the action associated with each pattern that matches the
current record
The action for a matching pattern is executed before evaluating
subsequent patterns. The actions associated with all END patterns are
executed in program order.
Refer to the EXAMPLES section for an example that demonstrates the results
of specifying a variable assignment as a flag argument or command argument
in different positions on the awk command line.
The awk command reads input data in the order stated on the command line.
If you specify input_file as a - (dash) or do not specify a filename, awk
reads standard input.
The awk command reads input data from any of the following sources:
· Any input_file operands or their equivalents, which can be affected by
modifying the awk variables ARGV and ARGC
· Standard input, in the absence of any input_file operands
· Arguments to the getline function
Input files must be text files. When the built-in variable RS is set to a
value other than a newline character, awk supports records terminated with
the specified separator up to LINE_MAX bytes.
Pattern-action statements on the command line are enclosed in ' (single
quote characters) to protect them from interpretation by the shell.
Consecutive pattern-action statements on the same command line are
separated by a ; (semicolon), within one set of quote delimiters.
By default, the awk command treats input lines as records, separated by
spaces, tabs, or a field separator you set with the FS variable. (When a
space character is the field separator, multiple spaces are recognized as a
single separator.) Fields are referenced as $1, $2, and so on. The
reference $0 specifies the entire record (by default, a line).
Program Structure
A awk program is composed of pairs of the form:
pattern { action}
Either the pattern or the action (including the enclosing brace characters)
can be omitted.
If pattern lacks a corresponding action, awk writes the entire record that
contains the pattern to standard output. If action lacks a corresponding
pattern, awk applies the action to every record.
Actions
An action is a sequence of statements that follow C language syntax. Any
single statement can be replaced by a statement list enclosed in braces.
When statement is a list of statements, they must be separated by newline
characters or semicolons, and are executed sequentially in order of
appearance. Statements in the awk language include:
break
continue
delete array [expression]
exit [expression]
for (expression;expression;expression) statement
for (variable in array) statement
if (expression) statement [else statement]
next
print [expression_list][>file|>>file][| command]
printf format[ ,expression_list][>file|>>file][| command]
printf format[,expression_list ][>file]
while (expression) statement
variable=expression
Statements can end with a semicolon, a newline character, or the right
brace enclosing the action:
{ [ statement ... ] }
Expressions can have string or numeric values and are built using the
operators +, -, *, /, %, a space for string concatenation, and the C
operators ++, --, +=, -=, *=, /=, =, ^=, ?:, >, >=, <, <=, ==, $, (), ~,
!~, in, ||, &&, !, and !=.
Because the actions process fields, input white space is not preserved in
the output.
The file and command arguments in awk statements can be literal names or
expressions enclosed in double quotation (") characters. Identical string
values in different statements refer to the same open file.
The print statement writes its arguments to standard output (or to a file
if > file or >> file is present), separated by the current output field
separator and terminated by the current output record separator.
The printf statement formats its expression list according to the format of
the printf subroutine, and writes it arguments to standard output,
separated by the output field separator and terminated by the output record
separator. You can redirect the output into a file using the print ...
file or printf( ...) > file statements.
Variables
Variables can be scalars, array elements (denoted x[i]), or fields. With
the exception of function parameters, variables are not explicitly
declared.
Variable names can consist of uppercase and lowercase alphabetic letters,
the underscore character, the digits (0 to 9), and extended characters.
Variable names cannot begin with a digit. Field variables are designated by
$ (dollar sign), followed by a number or numerical expression. The effect
of the field number expression evaluating to anything other than a non-
negative integer is unspecified.
Variables are initialized to the null string. Array subscripts can be any
string; they do not have to be numeric. This allows for a form of
associative memory. Enclose string constants in expressions in double
quotation (") characters.
There are several variables with special meaning to awk. They include:
ARGC
The number of elements in the ARGV array.
ARGV
An array of command line arguments, excluding options and the
program_file arguments, numbered from zero to ARGC-1.
The arguments in ARGV can be modified or added to; ARGC can be altered.
As each input file ends, awk treats the next non-null element of ARGV,
up to and including the current value of ARGC-1, as the name of the
next input file. Therefore, setting an element of ARGV to null means
that it is not be treated as an input file. When the element is the
character -, standard input is specified. When the element matches the
format for an assignment (variable=value), the element is treated as an
assignment rather than as the name of an awk input file.
CONVFMT
The PRINTF format for converting numbers to strings (except for output
statements, where OFMT is used); %.6g by default.
ENVIRON
The variable ENVIRON is an array representing the value of the
environment. The indexes of the array are strings consisting of the
names of the environmental variables, and the value of each array
element is a string consisting of the value of that variable.
FILENAME
The name of the current input file. Inside a BEGIN action, the
FILENAME value is undefined. Inside an END action, the value is the
name of the last input file processed.
FNR The ordinal number of the current input line (record) in the current
file. Inside a BEGIN action, the value is zero. Inside an END action,
the value is the number of the last record processed in the last file
processed.
FS Input field separator (default is a space). If it is a space, then any
number of spaces and tabs can separate fields.
NF The number of fields in the current input line (record) with a limit of
199.
NR The number of the current input line (record).
OFS The print statement output field separator (default is a space).
ORS The print statement output record separator (default is a newline
character).
OFMT
The printf statement output format for converting numbers to strings in
output statements (default is %.6g).
RLENGTH
The length of the string matched by the match function.
RS Input record separator (default is a newline character).
RSTART
The starting position of the string matched by the match function,
numbering from 1. This is always equivalent to the return value of the
match function.
SUBSEP
The subscript separator string for multi-dimensional arrays.
Functions
There are a variety of built-in functions that can be used in awk actions.
Arithmetic Functions
The arithmetic functions, except for int, are based on the ISO C standard.
The behavior is undefined in cases where the ISO C standard specifies that
an error be returned or that the behavior is undefined.
atan2 (y,x)
Returns the arctangent of y/x.
cos (x)
Returns the cosine of x, where x is in radians.
sin (x)
Returns the sine of x where x is in radians.
exp (x)
Returns the exponential factor of x.
log (x)
Returns the natural logarithm of x.
sqrt (x)
Returns the square root of x.
int (x)
Truncates its argument to an integer. It is truncated toward 0 when x
> 0.
rand()
Returns a random number n, such that 0 <= n < 1.
srand([expr])
Sets the seed value for rand to expr or uses the time of day if expr is
omitted. The previous seed value is returned.
String Functions
gsub(ere, repl[, in])
Behave like sub (see below), except replace all occurrences of the
regular expression (like the ed utility global substitute) in $0 or in
the in argument, when specified.
index(s, t)
Returns the position, in characters, numbering from 1, in string s
where string t first occurs, or zero if it does not occur at all.
length[([)]
Returns the length, in characters, of its argument taken as a string,
or of the whole record, $0, if there is no argument.
match(s, ere)
Returns the position, in characters, numbering from 1, in string s
where the extended regular expression ere occurs, or zero if it does
not occur at all. RSTART is set to the starting position, zero if no
match is found; RLENGTH is set to the length of the matched string, -1
if no match is found.
split(s, a[fs])
Splits the string s into array elements a[1], a[2], ... a[n], and
return n. The separation is done with the extended regular expression
fs or with the field separator FS if fs is not given. Each array
element has a string value when created. If the string assigned to any
array element, with any occurrence of the decimal point character from
the current locale changed to a period character, would be considered a
numeric string, the array element also has the numeric value of the
numeric string. The effect of a null string as the value of fs is
unspecified.
sprintf(fmt, expr, expr, ...)
Formats the expressions according to the printf format given by fmt and
return the resulting string.
sub(ere, repl[, in])
Substitutes the string repl in place of the first instance of the
extended regular expression ERE in string in and return the number of
substitutions. An ampersand (&) appearing in the string repl is
replaced by the string from in that matches the regular expression.
For each occurrence of backslash (\) encountered when scanning the
string repl from beginning to end, the next character is taken
literally and loses its special meaning (for example, \!> is interpreted
as a literal ampersand character). Except for & and \, it is
unspecified what the special meaning of any such character is. If in
is specified and it is not an lvalue, the behavior is undefined. If in
is omitted, awk substitutes in the current record ($0).
substr(s, m[,n])
Returns the at most n character substring of s that begins at position
m, numbering from 1. If n is missing, the length of the substring is
limited by the length of the string s.
tolower(s)
Returns a string based on the string s. Each character in s that is an
upper case letter specified to have a tolower mapping by the LC_TYPE
category of the current locale is replaced in the returned string by
the lower case letter specified by the mapping. Other characters in s
are unchanged in the returned string.
toupper(s)
Returns a string based on the string s. Each character in s that is a
lower case letter specified to have a toupper mapping by the LC_TYPE
category of the current locale is replaced in the returned string by
the upper case letter specified by the mapping. Other characters in s
are unchanged in the returned string.
Input/Output and General Functions
close(expression)
Closes the file or pipe opened by a print or printf statement or a call
to getline with the same string-valued expression. If the close was
successful, the function returns zero; otherwise, it returns non-zero.
expression | getline [var]
Reads a record of input from a stream piped from the output of a
command. The stream is created if no stream is currently open with the
value of expression as its common name. The stream created is
equivalent to one created by a call to the popen function with the
value of expression as the command argument and a value of r as the
mode argument. As long as the stream remains open, subsequent calls in
which expression evaluates to the same string read subsequent records
from the file. The stream will remain open until the close function is
called with an expression that evaluates to the same string value. At
that time, the stream is closed as if by a call to the pclose function.
If var is missing, $0 and NF are set; otherwise, var is set.
getline
Sets $0 to the next input record from the current input file. This
form of getline sets the NF, NR, and FNR variables.
getline var
Sets variable var to the next input record from the current input file.
This form of getline sets the FNR and NR variables.
getline [var] < expression
Reads the next record of input from a named file. The expression is
evaluated to produce a string that is used as a full pathname. If the
file of that name is not currently open, it is opened. As long as the
stream remains open, subsequent calls in which expression evaluates to
the same string value, read subsequent records from the file. The file
remains open until the close function is called with an expression that
evaluates to the same string value. If var is missing, $0 and NF are
set; otherwise, var is set.
system(expression)
Executes the command given by expression in a manner equivalent to the
system function and returns the exit status to the command.
All forms of getline return 1 for successful input, zero for end of file,
and -1 for an error.
The getline function sets $0 to the next input record from the current
input file; getline < file sets $0 to the next record from file. The
function getlinex sets variable x instead. Finally, command| getline pipes
the output of command into getline. Each call of getline returns the next
line of output from command. In all cases, getline returns 1 for a
successful input, 0 (zero) for End-of-File, and -1 for an error.
The getline function sets $0 to the next input record from the current
input file. The getline function returns 1 for a successful input and 0
for End-of-File.
Where strings are used as the name of a file or pipeline, the strings must
be textually identical. The terminology "same string value" implies that
"equivalent strings", even those that differ only by space characters,
represent different files.
User-defined Functions
The awk language also provides user-defined functions. Such functions can
be defined as:
function name(args,...) { statements }
A function can be referred to anywhere in an awk program; in particular,
the function's use can precede the function definition. The scope of a
function is global.
Function arguments can be either scalars or arrays; the behavior is
undefined if an array name is passed as an argument that the function uses
as a scalar, or if a scalar expression is passed as an argument that the
function uses as an array. Function arguments are passed by value if
scalar and by reference if array name. Argument names are local to the
function; all other variable names are global. The same name is not used as
both an argument name and as the name of a function or special awk
variable. The same name must not be used both as a variable name with
global scope and as the name of a function. The same name must not be used
within the same scope both as a scalar variable and as an array.
The number of parameters in the function definition need not match the
number of parameters in the function call. Excess formal parameters can be
used as local variables. If fewer arguments are supplied in a function
call than are in the function definition, the extra parameters that are
used in the function body as scalars is initialized with a string value of
the null string and a numeric value of zero, and the extra parameters that
are used in the function body as arrays are initialized as empty arrays.
If more arguments are supplied in a function call than are in the function
definition, the behavior is undefined.
When invoking a function, no white space can be placed between the function
name and the opening parenthesis. Function calls can be nested and
recursive calls can be made upon functions. Upon return from any nested or
recursive function call, the values of all the calling function's
parameters are unchanged, except for array parameters passed by reference.
The return statement can be used to return a value.
Patterns
Patterns are arbitrary Boolean combinations of patterns and relational
expressions (the !, ||, and && operators and parentheses for grouping).
You must start and end regular expressions with slashes. You can use
regular expressions as described for grep, including the following special
characters:
+ One or more occurrences of the pattern.
? Zero or one occurrence of the pattern.
| Either of two statements.
( ) Grouping of expressions.
Isolated regular expressions in a pattern apply to the entire line.
Regular expressions can occur in relational expressions. Any string
(constant or variable) can be used as a regular expression, except in the
position of an isolated regular expression in a pattern.
If two patterns are separated by a comma, the action is performed on all
lines between an occurrence of the first pattern and the next occurrence of
the second.
There are two types of relational expressions that you can use. The first
type has the form:
expression match_operator pattern
where match_operator is either: ~ (for contains) or !~ (for does not
contain).
The second type has the form:
expression relational_operator expression
where relational_operator is any of the six C relational operators: <, >,
<=, >=, ==, and !=. An expression can be an arithmetic expression, a
relational expression, or a Boolean combination of these.
Special Patterns
You can use the BEGIN and END special patterns to capture control before
the first and after the last input line is read, respectively. BEGIN must
be the first pattern; END must be the last.
Each BEGIN pattern is matched once and its associated action executed
before the first record of input is read and before command line assignment
is done. Each END pattern is matched once and its associated action
executed after the last record of input has been read. These two patterns
have associated actions.
BEGIN and END do not combine with other patterns. Multiple BEGIN and END
patterns are allowed. The actions associated with the BEGIN patterns is
executed in the order specified in the program, as are the END actions. An
END pattern can precede a BEGIN pattern in a program.
You have two ways to designate an extended regular expression other than
white space to separate fields. You can use the -Fere option on the
command line, or you can assign a string with the expression to the built-
in variable FS. Either action changes the field separator to ere.
There are no explicit conversions between numbers and strings. To force an
expression to be treated as a number, add 0 to it. To force it to be
treated as a string, append a null string ("").
Comment Delimiter
In the awk language, a comment starts with the sharp sign character, #, and
continues to the end of the line. The # does not have to be the first
character on the line. The awk language ignores the rest of the line
following a sharp sign. For example :
# This program prints a nice friendly message. It helps
# Keep novice users from being afraid of the computer.
The purpose of a comment is to help you or another person understand the
program at a later time.
EXIT STATUS
The following exit values are returned:
0 Successful completion.
>0 An error occurred.
EXAMPLES
1. To display the file lines that are longer than 72 bytes, enter:
% awk 'length >72' chapter1
This command selects each line of the file chapter1 that is longer
than 72 bytes. The command then writes these lines to standard output
because no action is specified.
2. To display all lines between the words start and stop, enter:
% awk '/start/,/stop/' chapter1
3. To run an awk program (sum2.awk) that processes a file (chapter1),
enter:
% awk -f sum2.awk chapter1
4. The following awk program computes the sum and average of the numbers
in the second column of the input file:
{
sum += $2
}
END {
print "Sum: ", sum;
print "Average:", sum/NR;
}
The first action adds the value of the second field of each line to
the sum variable. The awk command initializes sum, and all variables,
to 0 (zero) before starting. The keyword END before the second action
causes awk to perform that action after all of the input file is read.
The NR variable, which is used to calculate the average, is a special
variable containing the number of records (lines) that were read.
5. To print the names of the users who have the C shell as the initial
shell, enter:
% awk -F: '$7 ~ /csh/ {print $1}' /etc/passwd
6. To print the first two fields in reversed order, enter:
% awk '{ print $2, $1 }'
7. The following awk program prints the first two fields of the input
file in reversed order, with input fields separated by a comma, then
adds up the first column and prints the sum and average:
BEGIN { FS = "," }
{ print $2, $1}
{ s += $1 }
END { print "sum is", s, "average is", s/NR }
8. The following example shows how command line assignments synchronize
with awk program statements.
Consider the following set of awk statements that make up a program
named test_program:
BEGIN { if (RS == ":")
print "Assignment in effect for BEGIN statements"
}
{ if (RS == ":")
print "Assignment in effect for middle statements"
}
END { if (RS == ":")
print "Assignment in effect for END statements"
}
Notice the different results that are produced by different ways of
assigning a value to RS on the awk command line. The file text_file
contains the line "Hello, Hello".
% awk -f test_program -v RS=: text_file
Assignment in effect for BEGIN statements
Assignment in effect for middle statements
Assignment in effect for END statements
% awk -f test_program RS=: text_file
Assignment in effect for middle statements
Assignment in effect for END statements
% awk -f test_program text_file RS=:
Assignment in effect for END statements
ENVIRONMENT VARIABLES
The following environment variables affect the execution of awk:
LANG
Provides a default value for the internationalization variables that
are unset or null. If LANG is unset or null, the corresponding value
from the default locale is used. If any of the internationalization
variables contain an invalid setting, the utility behaves as if none of
the variables had been defined.
LC_ALL
If set to a non-empty string value, overrides the values of all the
other internationalization variables.
LC_CTYPE
Determines the locale for the interpretation of sequences of bytes of
text data as characters (for example, single-byte as opposed to multi-
byte characters in arguments).
LC_MESSAGES
Determines the locale for the format and contents of diagnostic
messages written to standard error.
NLSPATH
Determines the location of message catalogs for the processing of
LC_MESSAGES.
SEE ALSO
Commands: grep(1), lex(1), sed(1)
Routines: printf(3)
Programming Support Tools
 |
Index for Section 1 |
|
 |
Alphabetical listing for A |
|
 |
Top of page |
|