If a program needs to receive and process input, there must be a means of analyzing the input before it is processed. You can analyze input with one or more routines within the program, or with a separate program designed to filter the input before passing it to the main program. The complexity of the input interface depends on the complexity of the input; complicated input can require significant code to parse it (break it into pieces that are meaningful to the program).
This chapter describes the following two tools that help develop input interfaces:
The
lex
tool uses a set of rules to generate
a program, called a
lexical analyzer, which analyzes
input and breaks it into categories, such as numbers, letters, or operators.
The
yacc
tool uses a set of rules to generate
a program, called a
parser, which analyzes input using
the categories identified by the lexical analyzer and determines what to do
with the input.
The
yacc
tool generates left-associative,
left-recursive (LALR) parsers.
For further information about LALR grammars,
refer to a compiler book such as
Compilers: Principles, Techniques, and Tools, by Alfred Aho, Ravi Sethi, and Jeffrey Ullman.
[Footnote 7]
To avoid confusion between the
lex
and
yacc
programs and the programs they generate,
lex
and
yacc
are referred to throughout this chapter as tools.
The following sections explain the two tools:
4.1 How the Lexical Analyzer Works
The lexical analyzer that
lex
generates
is a
deterministic finite-state automaton.
This design provides for
a limited number of states that the lexical analyzer can exist in, along with
the rules that determine what state the lexical analyzer moves to upon reading
and interpreting the next input character.
The compiled lexical analyzer performs the following functions:
Reads an input stream of characters.
Copies the input stream to an output stream.
Breaks the input stream into smaller strings that match the
regular expressions in the
lex
specification file.
Executes
an action for each regular expression that it recognizes.
Actions are C language
program fragments in the
lex
specification file.
An action
fragment does not have to be complete within itself; it can call subroutines
or other actions.
Figure 4-1
illustrates a simple lexical analyzer that
has three states:
start
,
good
, and
bad
.
The program reads an input stream of characters.
It begins
in the
start
condition.
When it receives the first character,
the program compares the character with the rule.
If the character is alphabetic
(according to the rule), the program changes to the
good
state; if it is not alphabetic, the program changes to the
bad
state.
The program stays in the
good
state until it finds
a character that does not match its conditions, and then it moves to the
bad
state, which terminates the program.
Figure 4-1: Simple Finite State Model
The automaton allows the generated
lexical analyzer to look ahead more than one or two characters in an input
stream.
For example, suppose the
lex
specification file defines a rule that
looks for the string, ab, and another rule that looks for the string, abcdefg.
If the lexical analyzer gets an input string of abcdefh, it reads enough
characters to attempt a match on abcdefg.
When the h disqualifies a match
on abcdefg, the analyzer returns to the rule that looks for ab.
The first
two characters of the input match ab, so the analyzer performs any action
specified in that rule and then begins trying to find another match using
the remaining input, cdefh.
4.2 Writing a Lexical Analyzer Program with lex
The
lex
tool helps write a C language lexical analyzer
program that can receive character stream input and translate that input into
program actions.
To use
lex
, you must write a specification
file that contains the following parts:
Regular expressions - Character patterns that the generated lexical analyzer will recognize
Action statements - C language program fragments that define how the generated lexical analyzer is to react to regular expressions that it recognizes
The actual format and logic allowed in the specification file are discussed in Section 4.3.
The
lex
tool uses the information in the specification
file to generate the lexical analyzer.
The tool names the created analyzer
program
yy.lex.c
.
The
yy.lex.c
program contains a set of standard functions together with the
analysis code that is generated from the specification file.
The analysis
code is contained in the
yylex
function.
Lexical analyzers
created by
lex
recognize simple grammar structures and
regular expressions.
You can compile a simple
lex
analyzer
program with the following command:
%
cc -ll yy.lex.c
The
-ll
option tells the compiler to use the
lex
function
library.
This command yields an executable lexical analyzer.
If your program
uses complex grammar rules, or if it uses no grammar rules, you should create
a parser (by combining the
lex
and
yacc
tools) to ensure proper handling of the input.
(See
Section 4.6.)
The
yy.lex.c
output file can be moved to any other
system having a C compiler that supports the
lex
library
functions.
4.3 The lex Specification File
The format of
the
lex
specification file is as follows:
[ { definitions } ]
%%
[ { rules } ]
[ %%
{ user subroutines } ]
Except for the first pair of percent signs ( %%
),
which mark the beginning of the rules, all parts of the specification file
are optional.
The minimum
lex
specification file contains
no definitions, no rules, and no user subroutines.
Without a specified action for a pattern
match, the lexical analyzer copies the input pattern to the output without
changing it.
Therefore, this minimum specification file produces a lexical
analyzer that copies all input to the output unchanged.
This section contains the following information:
4.3.1 Defining Substitution Strings
You can define string macros before
the first pair of percent signs in the
lex
specification
file.
The
lex
tool expands these macros when it generates
the lexical analyzer.
Any line in this section that begins in column 1 and
that does not lie between
%{
and
%}
delimiters defines a
lex
substitution string.
Substitution string definitions have the
following general format:
name translation
The
name
and
translation
elements are separated by at least one blank or tab, and
name
begins with a letter.
When
lex
finds the
string
name
enclosed
in braces
( { }
)
in the rules part of the specification file, it changes
name
to the string defined in
translation
and deletes the braces.
For example, to define the names
D
and
E
, place the following definitions before the first
%%
delimiter in the specification file:
D [0-9] E [DEde][-+]{D}+
These definitions can be used in the rules section to make identification of integers and real numbers more compact:
{D}+ printf("integer"); {D}+"."{D}*({E})?| {D}*"."{D}+({E})?| {D}+{E} printf("real");
You can also include the following items in the definitions section:
Character set table (described in Section 4.3.3)
List of start conditions (described in Section 4.3.6)
Changes to size of arrays to accommodate larger source programs
The rules section of the specification
file contains control decisions that define the lexical analyzer that
lex
generates.
The rules are in the form of a two-column table.
The left column of the table contains regular expressions; the right column
of the table contains actions, one for each expression.
Actions
are C language
program fragments that can be as simple as a semicolon (the null statement)
or as complex as needed.
The lexical analyzer that
lex
creates contains both the expressions and the actions; when it finds a match
for one of the expressions, it executes the corresponding action.
For example, to create a lexical analyzer to look for the string, integer, and print a message when the string is found, define the following rule:
integer printf ("found keyword integer");
This example uses the C language library function
printf
to print a message string.
The first blank or tab character in the rule indicates
the end of the regular expression.
When you use only one statement in an
action, put the statement on the same line and to the right of the expression
(integer
in this example).
When you use more than one
statement, or if the statement takes more than one line, enclose the action
in braces, as in a C language program.
For example:
integer { printf ("found keyword integer"); hits++; }
A lexical analyzer that changes some words in a file from British spellings to the American spellings would have a specification file that contains rules such as the following:
colour printf("color"); mechanise printf("mechanize"); petrol printf("gas");
This specification
file is not complete, however, because it changes the word petroleum to gaseum.
4.3.2.1 Regular Expressions
With
a few specialized additions,
lex
recognizes the standard
set of extended regular expressions described in
Chapter 1.
Table 4-1
lists the expression operators that are special to
lex
.
Table 4-1: Regular Expression Operators for lex
Operator | Name | Description |
{ name} |
Braces | When braces enclose a name, the name represents
a string defined earlier in the specification file.
For example,
{digit}
looks for a defined string named
digit
and inserts that string at the point in the expression where
{digit}
occurs.
Do not confuse this construct with an interval expression;
both are recognized by
lex . |
" " | Quotation marks | Encloses literal strings to interpret as
text characters.
For example,
"$"
prevents
lex
from interpreting the dollar sign as an operator.
You can use
quotation marks for only part of a string; for example, both
"abc++"
and
abc"++"
match the literal string "abc++". |
a/< b |
Slash | Enables a match on the first expression (a) only if the second expression (b)
follows it immediately.
For example,
dog/cat
matches dog
if, and only if, cat immediately follows dog. |
< x> |
Angle brackets | Encloses a start condition.
Executes the
associated action only if the lexical analyzer is in the indicated start condition
< x> .
If the condition
of being at the beginning of a line is start condition
ONE ,
then the circumflex ( ^ ) operator would
be the same as the expression
<ONE> . |
\n | Newline character | Do not use the actual newline character in
an expression.
Do not confuse the
\n
construct with the
\ n
back-reference operator used in basic
regular expressions. |
\t | Tab | Matches a literal tab character (09 hexadecimal)) |
\b | Backspace | Matches a literal backspace (08 hexadecimal) |
\\ | Backslash | Matches a literal backslash. |
\digits | Digits | The character whose encoding is represented by the three digit octal number. |
\xdigits | xDigits | The character whose encoding is represented by the hexadecimal integer. |
Usually, white
space (blanks or tabs) delimits the end of an expression and the start of
its associated action.
However, you can enclose blanks or tab characters
in quotation marks ( " "
) to include
them in an expression.
Use quotation marks around all blanks in expressions
that are not already within sets of brackets ( [ ]
).
4.3.2.2 Matching Rules
When more than one expression in the rules section of a specification file can match the current input, the lexical analyzer chooses which rule to apply using the following criteria:
The longest matching string of characters
Among rules that match the same number of characters, the rule that occurs first
For example, consider the following rules:
integer printf("found int keyword"); [a-z]+ printf("found identifier");
If the rules are given in this order and integers is the input word,
the analyzer calls the input an identifier because
[a-z]+
matches all eight characters of the word while
integer
matches only seven.
However, if the input is integer, both rules match.
In this case,
lex
selects the keyword rule because it occurs
first.
A shorter input, such as int, does not match the expression
integer
, so
lex
selects the identifier rule.
4.3.2.2.1 Using Wildcard Characters to Match a String
Because the lexical analyzer chooses the longest match
first, you must be careful not to use an expression that is too powerful for
your intended purpose.
For example, a period followed by an asterisk and
enclosed in apostrophes ( '.*'
) might seem
like a good way to recognize any string enclosed in apostrophes.
However,
the analyzer reads far ahead, looking for a distant apostrophe to complete
the longest possible match.
Consider the following text:
'first' quoted string here, 'second' here
Given this input, the analyzer will match on the following string:
'first' quoted string here, 'second'
Because the
period operator does not match a newline character, errors of this type are
usually not far reaching.
Expressions like
.*
stop on
the current line.
Do not try to defeat this action with expressions like
the following:
[.\n]+
Given this expression, the lexical analyzer tries to read the entire input file, and an internal buffer overflow occurs.
The following rule finds the smaller quoted strings `first' and `second' from the preceding text example:
'[^'\n]*'"
This rule stops after matching `first' because it looks for an apostrophe
followed by any number of characters except another apostrophe or a newline
character, then followed by a second apostrophe.
The analyzer then begins
again to search for an appropriate expression, and it will find `second' as
it should.
This expression also matches an empty quoted string ( ' '
).
4.3.2.2.2 Finding Strings Within Strings
Usually, the lexical analyzer program partitions the input stream. It does not search for all possible matches of each expression. Each character is accounted for exactly once. For example, to count occurrences of both `she' and `he' in an input text, consider the following rules:
she s++; he h++; \n ; . ;
The last two rules ignore everything other than the two strings of interest. However, because `she' includes `he', the analyzer does not recognize the instances of `he' that are included within `she'.
A special action,
REJECT
, is provided
to override this behavior.
This directive tells the analyzer to execute the
rule that contains it and then, before executing the next rule, restore the
position of the input pointer to where it was before the first rule was executed.
For example, to count the instances of `he' that are included within `she',
use the following rules:
she {s++; REJECT;} he {h++; REJECT;} \n ; . ;
After counting an occurrence of `she', the analyzer
rejects the input stream and then counts the included occurrence of `he'.
In this example, `she' includes `he' but the reverse is not true, and you
can omit the
REJECT
action on `he'.
In other cases, such
as when a wildcard regular expression is being matched, determining which
input characters are in both classes can be difficult.
In general,
REJECT
is useful whenever the purpose
is not to partition the input stream but rather to detect all examples of
some items in the input where the instances of these items can overlap or
include each other.
4.3.2.3 Actions
When the lexical analyzer matches one of the expressions in the rules section of the specification file, it executes the action that corresponds to the expression. Without rules to match all strings in the input stream, the lexical analyzer copies the input to standard output. Therefore, do not create a rule that only copies the input to the output. Use this default output to find conditions not covered by the rules.
When you
use a
lex
-generated analyzer to process input for a parser
that
yacc
produces, provide rules to match all input strings.
Those rules must generate output that
yacc
can interpret.
For information on using
lex
with
yacc
,
see
Section 4.5.
4.3.2.3.1 Null Action
To ignore the input associated with an
expression, use a semicolon ( ;
), the C
language null statement, as an action.
For example:
[ \t\n] ;
This rule ignores the three spacing characters
(blank, tab, and newline character).
4.3.2.3.2 Using the Same Action for Multiple Expressions
To
use the same action for several different expressions, create a series of
rules (one for each expression except the last) whose actions consist of only
a vertical bar character
( |
).
For the last expression, specify the action as you would
usually specify it.
The vertical bar character indicates that the action
for the rule containing it is the same as the action for the next rule.
For
example, to ignore blank, tab, and newline characters (shown in
Section 4.3.2.3.1),
you could use the following set of rules:
" " | "\t" | "\n" ;
The quotation marks around the special character
sequences (\n
and
\t
) in this example
are not mandatory.
4.3.2.3.3 Printing a Matched String
To find out what text matched an expression in the rules section
of the specification file, include a C language
printf
function as one of the actions for that expression.
When the lexical analyzer
finds a match in the input stream, the program puts that matched string in
an external character array, called
yytext
.
To print the matched string, use a rule like the following:
[a-z]+ printf("%s", yytext);
Printing the output in this way is common.
You can define an expression
like this
printf
statement as a macro in the definitions
section of the specification file.
If this action is defined as
ECHO
, then the rules section entry looks like the following:
[a-z]+ ECHO;
See
Section 4.3.1
for information on defining macros.
4.3.2.3.4 Finding the Length of a Matched String
To
find the number of characters that the lexical analyzer matched for a particular
expression, use the external variable
yyleng
.
For example,
the following rule counts both the number of words and the number of characters
in words in the input:
[a-zA-Z]+ {words++; chars += yyleng;}
This action totals the number of characters in the words
matched and assigns that value to the
chars
variable.
The following expression finds the last character in the string matched:
yytext[yyleng-1]
The lexical analyzer
can run out of input before it completely matches an expression in a rules
file.
In this case, include a call to the
lex
function
yymore
in the action for that rule.
Usually, the next string from
the input stream overwrites the current entry in
yytext
.
The
yymore
action appends the next string from the input
stream to the end of the current entry in
yytext
.
For example,
consider a language that includes the following syntax:
A string is any set of characters between quotation marks
( " "
).
A backslash ( \
) escapes
the next character to make that character part of the string.
For example,
the combination of a backslash and a quotation mark ( \
)
indicates that the quotation mark is part of the string instead of being the
closing delimiter for the string.
The following rule processes these lexical characteristics:
\"[^"]* { if (yytext[yyleng-l] == '\\') yymore(); else ... normal user processing }
When this lexical analyzer receives a string such
as "abc\"def" (with the quotation marks exactly as shown), it first matches
the first five characters, "abc\.
The backslash causes a call to
yymore
to add the next part of the string, "def, to the end.
The
part of the action code labeled "normal user processing" must
process the quotation mark that ends the string.
4.3.2.3.6 Returning Characters to the Input
In some cases the lexical analyzer does not need all of the characters that are matched by the currently successful regular expression; or it might need to return matched characters to the input stream to be checked again for another match.
To return characters to the input stream, use the
yyless(
n)
call, where
n
is the number of characters of the current string that you want to keep.
Characters beyond the
nth character in the stream
are returned to the input stream.
This function provides the same type of
look-ahead that the slash operator ( /
)
uses, but
yyless
allows more control over the look-ahead.
Using
yyless(0)
is equivalent to using
REJECT
.
Use the
yyless
function to process text more than
once.
For example, a C language expression such as
x=-a
is ambiguous.
It could mean
x = -a
, or it could
be an obsolete representation of
x -= a
, which
is evaluated as
x = x - a
.
To
treat this ambiguous expression as
x = -a
and
print a warning message, use a rule such as the following:
=-[a-zA-Z] { printf("Operator (=-) ambiguous\n"); yyless(yyleng-1); ... action for =-... }
4.3.3 Using or Overriding Standard Input/Output Routines
The
lex
program provides a set of input/output
(I/O) routines for the lexical analyzer to use.
Include calls to the following
routines in the C code fragments in your specification file:
input
- Returns the next input character
output(
c)
- Writes the character
c
on the
output
unput(
c)
- Pushes the character
c
back
onto the input stream to be read later by
input
These routines are provided as macro definitions. You can override them by writing your own code for routines of the same names in the user subroutines section. These routines define the relationship between external files and internal characters. If you change them, change them all in the same way. They should follow these rules:
All routines must use the same character set.
The input routine must return a value of 0 to indicate end-of-file.
If you write your own code, you must undefine these macros in the definitions section of the specification file before the code for your own definitions:
%{ #undef input #undef unput #undef output }%
Note
Changing the relationship of
unput
toinput
causes the look-ahead functions not to work.
When you are using a
lex
-generated lexical analyzer
as a simple transformer/recognizer for piping from standard input to standard
output, you can avoid writing the framework by using the
lex
library (libl.a
).
This library contains the
main
routine, which calls the
yylex
function for
you.
The standard
lex
library lets the lexical analyzer
back up a maximum of 100 characters.
If you need to be able
to read an input file containing the
NUL
character (00
hexadecimal), you must create a different version of the
input
routine.
The standard version of
input
returns a value
of 0 when reading a null, and the analyzer interprets this value as indicating
the end of the file.
The lexical analyzers that
lex
generates process
character I/O through the
input
,
output
,
and
unput
routines.
Therefore, to return values in
yytext
, the analyzer uses the character representation that these
routines use.
Internally, however, each character is represented with a small
integer.
With the standard library, this integer is the value of the bit
pattern that the computer uses to represent the character.
Usually, the letter ,a, is represented in the same form as the character
constant
a
.
If you change this interpretation with different
I/O routines, you must include a translation table in the definitions section
of the specification file.
The translation table begins and ends with lines
that contain only the
%T
keyword, and it contains lines
of the following form:
[{integer } {characterstring }
]
The following example shows table entries that associate the letter ,A, and the digit ,0, (zero) with their standard values:
%T {65} {A} {48} {0} %T
When the
lexical analyzer reaches the end of a file, it calls a library routine named
yywrap
.
This routine returns a value of 1 to indicate to the lexical
analyzer that it should continue with normal wrap-up (operations associated
with the end of processing).
However, if the analyzer receives input from
more than one source, you must change the
yywrap
function.
The new function must get the new input and return a value of 0 to the lexical
analyzer.
A return value of 0 indicates that the program should continue
processing.
Multiple files specified on the command line are treated as a single input file for the purpose of end-of-file handling.
You can also include code to print summary reports and tables in a special
version of
yywrap
.
The
yywrap
function
is the only way to force
yylex
to recognize the end of
the input.
4.3.5 Passing Code to the Generated Program
You can define
variables in either the definitions section or the rules section of the specification
file.
When you process a specification file,
lex
changes
statements in the file into a lexical analyzer.
Any line in the specification
file that
lex
cannot interpret is passed unchanged to the
lexical analyzer.
The following four types of entries can be passed to the
lexical analyzer in this manner:
Lines beginning with a blank or tab that are not a part of
a
lex
rule are copied into the lexical analyzer.
If this
entry occurs before the first pair of percent signs ( %%
)
in the specification file, the entry is external to any function in the code.
If the entry occurs after the first
%%
,
it must be a C language program fragment that defines a variable.
You must define these statements before the first
lex
rule
in the specification file.
Lines beginning with a blank or tab that are program comments are included as comments in the generated lexical analyzer. The comments must be in the C language format for comments.
Any lines that lie between lines containing only
%{
and
%}
are
copied to the lexical analyzer.
The symbols
%{
and
%}
are not copied.
Use this format to enter preprocessor statements
that must begin in column 1, or to copy lines that do not look like program
statements.
Any lines occurring after the third
%%
delimiter are copied to the lexical analyzer without format restrictions.
Any rule can be associated with a start condition; the lexical analyzer recognizes that rule only when the analyzer is in that start condition. You can change the current start condition at any time.
You define start conditions in the definitions section of the specification file by using a line with the following format:
% Start
[name1
]
[
[name2 ...
]
]
The
name1
and
name2
symbols represent conditions.
There is no limit to the number
of conditions, and they can appear in any order.
You can abbreviate
Start
to either
S
or
s
.
Start-condition
names cannot be reserved words in C, nor can they be declared as the names
of variables, fields, and so on.
When using a start condition in the rules section of the specification
file, enclose the name of the start condition in angle brackets ( < >
) at the beginning of the rule.
The following
format defines a rule with a start condition:
[<name1
]
[
[,name2 ...
]
]
[> expression
]
The lexical analyzer recognizes
expression
only when the analyzer is in the condition corresponding to one of the names.
To put
lex
in a particular
start condition, execute the following action statement (in the action part
of a rule):
BEGIN
name;
This statement changes the start condition to name. To resume the normal state, use the following action:
BEGIN 0;
As shown in the preceding syntax diagram, a rule can be active in several start conditions. For example:
<start1,start2,start3> [0-9]+ printf("integer");
This rule prints integer only if it finds an integer while
in one of the three named start conditions.
Any rule that does not begin
with a start condition is always active.
4.4 Generating a Lexical Analyzer
Generating a
lex
-based lexical analyzer program is a two-step process, as follows:
Run
lex
to change the specification file
into a C language program.
The resulting program is in a file named
lex.yy.c
.
Process
lex.yy.c
with the
cc -ll
command to compile the program and link it with a library of
lex
subroutines.
The resulting executable program is named
a.out
.
For example, if the
lex
specification file is called
lextest
, enter the following commands:
%
lex lextest
%
cc lex.yy.c -ll
Although the default
lex
I/O routines use the C language
standard library, the lexical analyzers that
lex
generates
do not require them.
You can include different copies of the
input
,
output
, and
unput
routines
to avoid using those in the library.
(See
Section 4.3.3.)
Table 4-2
describes the options for the
lex
command.
Table 4-2: Options for the lex Command
Option | Description |
-n |
Suppresses the statistics summary that is
produced by default when you set your own table sizes for the finite state
machine.
See the
lex (1)
reference page for information about specifying
the state machine. |
-t |
Writes the generated lexical analyzer code
to standard output instead of to the
lex.yy.c
file. |
-v |
Provides a one-line summary of the general finite state machine statistics. |
Because
lex
uses fixed names for intermediate and output files, you can
have only one
lex
-generated program
in a given directory unless you use the
-t
option
to specify an alternative file name.
4.5 Using lex with yacc
When
used alone, the
lex
tool creates a lexical analyzer that
recognizes simple one-word input or receives statistical input.
You can also
use
lex
with a parser generator, such as
yacc
.
The
yacc
tool generates a program, called a
parser, that analyzes the construction of multiple-word input.
This parser
program operates well with lexical analyzers that
lex
generates;
these lexical analyzers recognize only regular expressions and format them
into character packages called
tokens.
A token
is
the smallest independent unit of meaning as defined by either the parser or
the lexical analyzer.
A token can contain data, a language keyword, an identifier,
or other parts of a language syntax.
A token can be any string of characters;
it can be part or all of a word or series of words.
The
yacc
tool produces parsers that recognize many types of grammar with no regard
to context.
These parsers need a preprocessor, such as a
lex
-generated
lexical analyzer, to recognize input tokens.
When a
lex
-generated lexical analyzer is used as
the preprocessor for a
yacc
-generated
parser, the lexical analyzer partitions the input stream.
The
parser assigns structure to the resulting pieces.
Figure 4-2
shows how
lex
and
yacc
generate programs
and how the programs work together.
You can also use other programs along
with those generated by
lex
or
yacc
.
Figure 4-2: Producing an Input Parser with lex and yacc
The parser program requires that its preprocessor (the lexical analysis
function) be named
yylex
.
This is the name
lex
gives to the analysis code in a lexical analyzer it generates.
If a lexical analyzer is used by itself, the default
main
program in the
lex
library calls the
yylex
routine, but if a
yacc
-generated parser is loaded and its
main
program
is used, the parser calls
yylex
.
In this case, each
lex
rule
should end with the following line, where the appropriate token value is returned:
return(
token);
To find the names for tokens that
yacc
uses, compile the lexical analyzer (the
lex
output file)
as part of the parser (the
yacc
output file) by placing
the following line in the last section of the
yacc
grammar
file:
#include lex.yy.c
Alternatively,
you can include the
yacc
output (the
y.tab.h
file) into your
lex
program specification file, and use
the token names that
y.tab.h
defines.
For example, if
the grammar file is named
good
and the specification file
is named
better
, the following command sequence creates
the final program:
%
yacc good
%
lex better
%
cc y.tab.c -ly -ll
To get a
main
program that invokes the
yacc
parser, load the
yacc
library (-ly
in the preceding example) before the
lex
library.
You
can generate
lex
and
yacc
programs in
either order.
4.6 Creating a Parser with yacc
To generate a parser with
yacc
, you must write
a grammar file that describes the input data stream and what the parser is
to do with the data.
The grammar file includes rules describing the input
structure, code to be invoked when these rules are recognized, and a routine
to do the basic input.
The
yacc
tool uses the information in the grammar
file to generate
yyparse
, a program that controls the input
process.
This is the parser that calls the
yylex
input routine (the lexical analyzer) to pick up tokens from
the input stream.
The parser organizes these tokens according to the structure
rules in the grammar file.
The structure rules are called grammar rules.
When the parser recognizes a grammar rule, it executes the user code (action)
supplied for that rule.
Actions return values and use
the values returned by other actions.
In addition to the specifications that
yacc
recognizes
and uses, the grammar file can also contain the following functions:
main
-
A C language function that contains, as a minimum, a call to the
yyparse
function, which
yacc
generates.
A limited
form of this function is in the
yacc
library.
yyerror
-
A C language function to handle errors that can occur during parser operation.
A limited form of this function is in the
yacc
library.
yylex
- A C
language function to perform lexical analysis on the input stream and pass
tokens (with values, if required) to the parser.
The function must return
an integer that represents the kind of token that was read.
The integer is
called the token number.
In addition, if a value is associated with the token,
the lexical analyzer must assign that value to the external variable
yylval
.
See
Section 4.7.1.3
for more information on token numbers.
To build
a lexical analyzer that works well with the parser that
yacc
generates, use the
lex
tool (see
Section 4.3).
The
yacc
tool processes a grammar file to generate
a file of C language functions and data, named
y.tab.c
.
When compiled
using the
cc
command, these functions form a combined function
named
yyparse
.
This
yyparse
function calls
yylex
, the lexical analyzer, to get input tokens.
The analyzer continues providing input until the parser detects an error
or the analyzer
returns
an
endmarker
token to indicate the end of the operation.
If an error occurs and
yyparse
cannot recover,
yyparse
returns a value of 1 to the
main
function.
If it finds the
endmarker
token,
yyparse
returns a value of 0 to
main
.
Use the C programming language to write the action code and other subroutines.
The
yacc
program uses many of the C language syntax conventions
for the grammar file.
4.6.1 The main and yyerror Functions
You must provide
function routines named
main
and
yyerror
in the grammar file.
To ease the initial effort of using
yacc
,
the
yacc
library provides simple versions of the
main
and
yyerror
routines.
You can include these
routines by using the
-ly
option to the loader or the
cc
command.
The source code
for the
main
library function is as follows:
main() { yyparse(); }
The source code for the
yyerror
library
function follows:
#include <stdio.h> void yyerror(s) char *s; { fprintf( stderr, " %s\n" ,s); }
The argument to
yyerror
is a string
containing an error message, usually the string syntax error.
These are very limited programs.
You should provide more sophistication
in these routines, such as keeping track of the input line number and printing
it along with the message when a syntax error is detected.
You can use the
value of the external integer variable
yychar
.
This
variable contains the look-ahead token number at the time the error was detected.
4.6.2 The yylex Function
The
yylex
program input
routine that you supply must be able to do the following:
Read the input stream
Recognize basic patterns in the input stream
Pass the patterns to
yyparse
along with
tokens that identify them
A token is a symbol or name that tells
yyparse
which
pattern is being sent to it by the input routine.
A symbol can be in one
of the following two classes:
Terminal symbols - Values returned by
yylex
to represent the primitive building blocks of the grammar,
as bricks are the primitive elements of a wall.
Nonterminal symbols - The composite symbols
that are used by the
yacc
grammar to describe more complex
orderings or aggregations of the terminal symbols, as a wall is an assembly
of bricks.
For example, if the lexical analyzer recognizes any numbers, names,
and operators, these elements are taken to be terminal symbols.
Nonterminal
symbols that the
yacc
grammar recognizes are elements like
EXPR
,
TERM
, and
FACTOR
.
Suppose
the input routine separates an input stream into the tokens of
WORD
,
NUMBER
, and
PUNCTUATION
.
Consider the input sentence "I have 9 turkeys." The analyzer
could pass the following strings and tokens to the parser:
String | Token |
I | WORD |
have | WORD |
9 | NUMBER |
turkeys | WORD |
. | PUNCTUATION |
The
yyparse
function must contain
definitions for the tokens that the input routine passes to it.
The
yacc
command's
-d
option causes the program
to generate a list of tokens in a file named
y.tab.h
.
This list is a set of
#define
statements that let
yylex
use the same tokens as the parser.
To avoid conflict with the parser, do not use names that begin with
the letters
yy
.
You can use
lex
to generate
the input routine, or you can write it in the C language.
See
Section 4.3
for information about using
lex
.
4.7 The Grammar File
A
yacc
grammar
file consists of the following three sections:
Declarations
Rules
Programs
Two percent signs
( %%
) that appear together separate the sections of the grammar
file.
To make the file easier to read, put the percent signs on a line by
themselves.
A grammar file has the following format:
[
declarations]
%%
rules[ %%
programs]
Except
for the first pair of percent signs ( %%
),
which mark the beginning of the rules, and the rules themselves, all parts
of the grammar file are optional.
The minimum
yacc
grammar
file contains no definitions and no programs, as follows:
%% rules
Except within names or reserved
symbols, the
yacc
program ignores blanks, tabs, and newline
characters in the grammar file.
You can use these characters to make the
grammar file easier to read.
Do not use blanks, tabs, or newline characters
in names or reserved symbols.
4.7.1 Declarations
The declarations section
of the
yacc
grammar file contains the following elements:
Declarations for any variables or constants used in other parts of the grammar file
The
#include
statements to call in other
files as part of this file (used for library header files)
Statements that define processing conditions for the generated parser
Declarations for variables or constants conform to the syntax of the C programming language, as follows:
type-specifier declarator
;
In this syntax, type-specifier is a data type keyword and declarator is the name of the variable or constant. Names can be any length and can consist of letters, dots, underscores, and digits. A name cannot begin with a digit. Uppercase and lowercase letters are distinct. The names used in the body of a grammar rule can represent tokens or nonterminal symbols.
If
you do not declare a name in the declarations section, you can use that name
only as a nonterminal symbol.
Define each nonterminal symbol by using it
as the left side of at least one rule in the rules section.
The
#include
statements are identical to C language syntax and perform
the same function.
The
yacc
tool has a set of keywords, listed in
Table 4-3, that define processing conditions for the generated
parser.
Each of the keywords begins with a percent sign
( %
)
and is followed by a list of tokens.
Table 4-3: Processing-Condition Definition Keywords in yacc
Keyword | description |
%left |
Identifies tokens that are left-associative with other tokens. |
%nonassoc |
Identifies tokens that are not associative with other tokens. |
%right |
Identifies tokens that are right-associative with other tokens. |
%start |
Identifies a name for the start symbol. |
%token |
Identifies the token names that
yacc
accepts.
Declare all token names in the declarations section. |
All tokens listed on the same line have the same precedence level and association; lines appear in the file in order of increasing precedence or binding strength. For example:
%left '+' '-' %left '*' '/'
This example describes the precedence
and associativity of the four arithmetic operators.
The addition ( +
) and subtraction ( -
)
operators are left-associative and have lower precedence than the multiplication
( *
) and division ( /
)
operators, which are also left-associative.
4.7.1.1 Defining Global Variables
You can define global variables to be used by some or all parser
actions, as well as by the lexical analyzer, by enclosing the declarations
for those variables in matched pairs of symbols consisting of a percent sign
and a brace
(%{
and
%}
).
For example, to make the
var
variable available to all parts of the complete program, place
the following entry in the declarations section of the grammar file:
%{ int var = 0; %}
The parser recognizes a special symbol called the start symbol.
The start symbol is the name assigned to the grammar rule that describes
the most general structure of the language to be parsed.
Because it is the
most general structure, it is the structure where the parser starts in its
top-down analysis of the input stream.
You declare the start symbol in the
declarations section by using the
%start
keyword.
If you
do not declare a start symbol, the parser uses the name of the first grammar
rule in the file.
For example, in parsing a C language procedure, the following is the most general structure for the parser to recognize:
main() { code_segment }
The start symbol should point to the rule that describes
this structure.
All remaining rules in the file describe ways to identify
lower-level structures within the procedure.
4.7.1.3 Token Numbers
Token numbers are nonnegative integers that represent the names of tokens. Because the lexical analyzer passes the token number to the parser instead of the actual token name, the programs must assign the same numbers to the tokens.
You can assign numbers to the tokens used in the
yacc
grammar file.
If you do not assign numbers to the tokens,
yacc
assigns numbers using the following rules:
A literal character is assigned the numeric value of the character in the ASCII character set.
Other names are assigned token numbers starting at 257.
Note
Do not assign a token number of 0 (zero). This number is assigned to the
endmarker
token. You cannot redefine it.
To assign a number to a token (including literals) in the declarations
section of the grammar file, put a nonzero positive integer immediately after
the token name in the
%token
line.
This integer is the
token number of the name or literal.
Each number must be unique.
Any lexical
analyzer used with
yacc
must return either 0 (zero) or
a negative value for a token when the end of the input is reached.
4.7.2 Grammar Rules
The rules section of the
yacc
grammar file contains one or more grammar rules.
Each rule
describes a structure and gives it a name.
A grammar rule has the following
format:
[nonterminal-name : BODY ;
]
In this syntax,
BODY
is a sequence
of zero or more names and literals.
The colon and the
semicolon are required
yacc
punctuation.
If there are several grammar rules with the same nonterminal name, use
the vertical bar
( |
) to avoid rewriting the left side.
In addition, use the
semicolon ( ;
) only at the end of all rules
joined by vertical bars.
The two following sets of grammar rules are equivalent:
Set 1
A : B C D ; A : E F ; A : G ;
Set 2
A : B C D | E F | G ;
To indicate a nonterminal symbol that matches the null string, use a semicolon by itself in the body of the rule, as follows:
nullstr : ;
When the lexical analyzer reaches the end of the input stream,
it sends a special token, called
endmarker
, to the parser.
This token signals the end of the input and has a token value of 0.
When
the parser receives an
endmarker
token, it checks to see
that it has assigned all of the input to defined grammar rules and that the
processed input forms a complete unit (as defined in the
yacc
grammar file).
If the input is a complete unit, the parser stops.
If the
input is not a complete unit, the parser signals an error and stops.
The lexical analyzer must send the
endmarker
token
at the correct time, such as the end of a file, or the end of a record.
4.7.2.3 Actions in yacc Parsers
With each grammar rule, you can specify actions to be performed each time the parser recognizes the rule in the input stream. Actions return values and obtain the values returned by previous actions. The lexical analyzer can also return values for tokens.
An action is a C language statement that does input and output, calls
subprograms, and alters external vectors and variables.
You specify an action
in the grammar file with one or more statements enclosed in braces ( { }
).
For example, the following are grammar rules
with actions:
A : '('B')' { hello(1, "abc" ); }; XXX : YYY ZZZ { printf("a message\n"); flag = 25; }
An action can receive values generated by other actions
by using numbered
yacc
parameter
keywords ($1
,
$2
, and so on).
These
keywords refer to the values returned by the components of the right side
of a rule, reading from left to right.
For example:
A : B C D ;
When this code is executed,
$1
has the value returned by the rule that recognized
B
,
$2
the value returned by the rule that recognized
C
,
and
$3
the value returned by the rule that recognized
D
.
To return a value, the action sets the pseudovariable
$$
to some value.
For example, the following action returns a value of 1:
{ $$ = 1;}
By
default, the value of a rule is the value of the first element in it ( $1
).
Therefore, you do not need to provide actions for rules
that have the following form:
A : B ;
To get control of the parsing
process before a rule is completed, write an action in the middle of a rule.
If this rule returns a value through the
$
n
parameters, actions that come after it can use that value.
The action can use values returned by actions that come before it.
Therefore,
the following rule sets
x
to 1 and
y
to the value returned by
C
:
A : B { $$ =1; } C { x = $2; y = $3; } ;
Internally,
yacc
creates
a new nonterminal symbol name for the action that occurs in the middle, and
it creates a new rule matching this name to the null string.
Therefore,
yacc
treats the preceding program as if it were written in the following
form, where
$ACT
is an empty action:
$ACT : /* null string */ { $$ = 1; } ; A : B $ACT C { x = $2; y = $3; } ;
The programs section of
the
yacc
grammar file contains C language functions that
can be used by the actions in the rules section.
In addition, if you write
a lexical analyzer
(yylex
, the input routine to the parser), include
it in the programs section.
4.7.4 Guidelines for Using Grammar Files
The following sections describe some general guidelines for using
yacc
grammar files.
They provide information on the following:
Comments in the grammar file explain what the program is doing.
You
can put comments anywhere in the grammar file that you can put a name.
However,
to make the file easier to read, put the comments on lines by themselves at
the beginning of functional blocks of rules.
Comments in a
yacc
grammar file have exactly the same form as comments in a C language
program; that is, they begin with a slash and an asterisk ( /*
) and end with an asterisk and a slash ( */
).
For example:
/* This is a comment on a line by itself. */
A literal string is one or more characters
enclosed in apostrophes, or single quotation marks ( ' '
).
As in the C language, the backslash ( \
) is an escape character within literals, and all the C
language special-character sequences are recognized, as follows:
\n |
Newline character |
\r |
Return |
\' |
Apostrophe, or single quote |
\\ |
Backslash |
\t |
Tab |
\b |
Backspace |
\f |
Form feed |
\ nnn |
The value nnn in octal |
Never
use
\0
or
0
(the null character) in
grammar rules.
4.7.4.3 Guidelines for Formatting the Grammar File
The following guidelines will
help make the
yacc
grammar file more readable:
Use uppercase letters for token names and lowercase letters for nonterminal symbol names.
Put grammar rules and actions on separate lines to allow for changing either one without changing the other.
Put all rules with the same left side together.
Enter the
left side once and use vertical bars ( |
)
to begin the rest of the rules for that left side.
For each set of rules with the same left side, enter the semicolon
( ;
) once on a line by itself following
the last rule for that left side.
You can then add new rules easily.
Indent rule bodies by two tab stops and action bodies by three tab stops.
4.7.4.4 Using Recursion in a Grammar File
Recursion is the process of using a function to define itself. In language definitions, these rules usually take the following form:
rule :
end case| rule,
end case
The simplest case of
rule
is the end case, but
rule
can also be made up of more than one occurrence of
end case.
The entry in the second line that uses
rule
in the definition of
rule
is the instance of
recursion.
Given this rule, the parser cycles through the input until the
stream is reduced to the final end case.
The
yacc
tool supports left-recursive grammar, not
right-recursive.
When you use recursion in a rule, always put the call to
the name of the rule as the leftmost entry in the rule (as it is in the preceding
example).
If the call to the name of the rule occurs later in the line, as
in the following example, the parser can run out of internal stack space and
crash:
rule :
end case|
end case, rule
4.7.4.5 Errors in the Grammar File
The
yacc
tool cannot produce
a parser for all sets of grammar specifications.
If the grammar rules contradict
themselves or require matching techniques different from those that
yacc
has,
yacc
will not produce a parser.
In
most cases,
yacc
provides messages to indicate the errors.
To correct these errors, redesign the rules in the grammar file or provide
a lexical analyzer to recognize the patterns that
yacc
cannot handle.
4.7.5 Error Handling by the Parser
When the parser reads an input stream, that input stream can fail to match the rules in the grammar file. If there is an error-handling routine in the grammar file, the parser can allow for entering the data again, skipping over the bad data, or for a cleanup and recovery action. When the parser finds an error, for example, it might need to reclaim parse tree storage, delete or alter symbol table entries, and set switches to avoid generating any further output.
When an error occurs, the parser stops unless you provide error-handling routines. To continue processing the input to find more errors, restart the parser at a point in the input stream where the parser can try to recognize more input. One way to restart the parser when an error occurs is to discard some of the tokens following the error, and try to restart the parser at that point in the input stream.
The
yacc
tool has a special token name,
error
, to use for error handling.
Put this
token in your grammar file at places where an input error might occur so that
you can provide a recovery routine.
If an input error occurs in a position
protected by the
error
token, the parser executes the action
for the
error
token rather than the normal action.
To prevent a single error from producing many error messages, the parser
remains in an error state until it successfully processes three tokens following
an error.
If another error occurs while the parser is in the error state,
the parser discards the input token and does not produce a message.
You can
also specify a point at which the parser should resume processing by providing
an argument to the
error
action.
For example:
stat : error ';'
This rule tells the parser
that, when there is an error, it should skip over the token and all following
tokens until it finds the next semicolon.
All tokens after the error and
before the next semicolon are discarded.
When the parser finds the semicolon,
it reduces this rule and performs any cleanup action associated with it.
4.7.5.1 Providing for Error Correcting
You can allow the person entering the input stream in an interactive environment to correct any input errors by reentering a line in the data stream. For example:
input : error '\n' { printf(" Reenter last line: " ); } input { $$ = $4; } ;
In the previous example the parser stays in the
error state for three input tokens following the error.
If an error is encountered
in the first three tokens, the parser deletes the tokens and does not display
a message.
Use the
yyerrok;
statement for recovery.
When
the parser encounters the
yyerrok;
statement, it leaves
the error state and begins normal processing.
Following is the recovery example:
input : error '\n' { yyerrok; printf( "Reenter last line: " ); } input { $$ = $4 } ;
4.7.5.2 Clearing the Look-Ahead Token
The look-ahead token is the next token to
be examined by the parser.
When an error occurs, the look-ahead token becomes
the token at which the error was detected.
However, if the error recovery
action includes code to find the correct place to start processing again,
that code must also change the look-ahead token.
To clear the look-ahead
token, include the
yyclearin;
statement in the error recovery
action.
4.8 Parser Operation
The
yacc
program turns the grammar file into a C
language program that, when compiled and executed, parses the input according
to the grammar rules.
The parser is a finite state machine with a stack. The parser can read and remember the next input token (the look-ahead token). The current state is always the state that is on the top of the stack. The states of the finite state machine are represented by small integers. Initially, the machine is in state 0 (zero), the stack contains only 0 (zero), and no look-ahead token has been read.
The machine can perform one of the following four actions:
shift
n |
The parser pushes the current state onto the stack, makes n the current state, and clears the look-ahead token. |
reduce
r |
The
r
argument
is a rule number.
When the parser finds a token sequence matching rule number
r
in the input stream, the parser replaces that sequence with the
rule number in the output stream. |
accept |
The parser has looked at all input, matched it to the grammar specification, and recognized the input as satisfying the highest level structure (defined by the start symbol). This action appears only when the look-ahead token is the end marker and indicates that the parser has successfully done its job. |
error |
The parser cannot continue processing the input stream and still successfully match it with any rule defined in the grammar specification. The input tokens it looked at, together with the look-ahead token, cannot be followed by anything that would result in a legal input. The parser reports an error and attempts to recover the situation and resume parsing. |
The parser performs the following actions during one process step:
Based on its current state, the parser decides whether it
needs a look-ahead token to decide the action to take.
If it needs one and
does not have one, it calls
yylex
to obtain the next token.
Using the current state and the look-ahead token if needed, the parser decides on its next action and carries it out. This can result in states being pushed onto the stack or popped off the stack and in the look-ahead token being processed or left alone.
The
shift
action is the most common action
the parser takes.
Whenever the parser does a
shift
, there
is always a look-ahead token.
Consider the following parser action rule:
IF shift 34
When the parser is in the state that
contains this rule and the look-ahead token is
IF
, the
parser performs the following steps:
Pushes the current state down on the stack
Makes state 34 the current state (puts it on the top of the stack)
Clears the look-ahead token
The
reduce
action prevents the stack from growing
too large.
The parser uses reducing actions after it has matched the right
side of a rule with the input stream and is ready to replace the tokens in
the input stream with the left side of the rule.
The parser might have to
use the look-ahead token to decide if the pattern is a complete match.
Reducing actions are associated with individual grammar rules.
Because
grammar rules also have small integer numbers, you can easily confuse the
meanings of the numbers in the
shift
and
reduce
actions.
For example, the first of the two following actions refers
to grammar rule 18; the second refers to machine state 34:
reduce 18 IF shift 34
For example, consider reducing the following rule:
A : x y z ;
The parser pops off the top three states from
the stack.
The number of states popped equals the number of symbols on the
right side of the rule.
These states are the ones put on the stack while
recognizing
x
,
y
, and
z
.
After popping these states, the parser uncovers the state the parser was in
before beginning to process the rule (the state that needed to recognize rule
A
to satisfy its rule).
Using this uncovered state and the symbol
on the left side of the rule, the parser performs a
goto
action, which is similar to a
shift
of
A
.
A new state is obtained and pushed onto the stack, and parsing continues.
The
goto
action is different from an ordinary
shift
of a token.
The look-ahead token is cleared by a
shift
but is not affected by a
goto
.
When the
three states are popped in this example, the uncovered state contains an entry
such as the following:
A goto 20
This entry causes state 20 to be pushed onto the stack and become the current state.
The
reduce
action is also important in the treatment
of user-supplied actions and values.
When a rule is reduced, the parser executes
the code that you included in the rule before adjusting the stack.
In addition
to the stack holding the states, another stack running in parallel with it
holds the values returned from the lexical analyzer and the actions.
When
a
shift
takes place, the external variable
yylval
is
copied onto the value stack.
After executing the code that you provide, the
parser performs the reduction.
When the parser performs the
goto
action, it copies the external variable
yylval
onto
the value stack.
The
yacc
variables whose names begin
with a dollar sign ( $
) refer to the value
stack.
4.8.3 Ambiguous Rules and Parser Conflicts
A set of grammar rules is ambiguous if any possible input string can be structured in two or more different ways. For example:
expr : expr '-' expr
This rule forms an arithmetic expression by putting two other expressions together with a minus sign between them, but this grammar rule does not specify how to structure all complex inputs. For example:
expr - expr - expr
Using the preceding rule, a program could structure this input as either left-associative or right-associative:
( expr - expr ) - expr
or
expr - ( expr - expr )
These two forms produce different results when evaluated.
When the parser tries to handle an ambiguous rule, it can become confused over which of its four actions to perform when processing the input. The following two types of conflicts develop:
Shift/reduce
conflict |
A rule can be evaluated correctly using either
a
shift
action or a
reduce
action, with
different results. |
Reduce/reduce
conflict |
A rule can be evaluated correctly using one
of two different
reduce
actions, producing two different
actions. |
A
shift/shift
conflict is not possible.
These conflicts result when a rule is not as complete as it could be. For example, consider the following input and the preceding ambiguous rule:
a - b - c
After reading the first three parts of the input, the parser has the following:
a - b
This input matches the right side of the grammar rule. The parser can reduce the input by applying this rule. After applying the rule, the input becomes the following:
expr
This is the left side of the rule. The parser then reads the final part of the input, as follows:
- c
The parser now has the following:
expr - c
Reducing this input produces a left-associative interpretation.
However, the parser can also look ahead in the input stream. If, after receiving the first three parts of the input, it continues reading the input stream until it has the next two parts, it then has the following input:
a - b - c
Applying the rule to the rightmost three
parts reduces
b - c
to
expr.
The parser then has the following:
a - expr
Reducing the expression once more produces a right-associative interpretation.
Therefore, at the point where the parser has read the first three parts,
it can take one of two legal actions: a
shift
or a
reduce
.
If the parser has no rule by which to decide between the
actions, a
shift/reduce
conflict results.
A similar situation occurs if the parser can choose between two valid
reduce
actions.
That situation is called a
reduce/reduce
conflict.
When
shift/reduce
or
reduce/reduce
conflicts occur,
yacc
produces a parser by selecting a
valid step wherever it has a choice.
If you do not provide a rule to make
the choice,
yacc
uses the following rules:
In a
shift/reduce
conflict, shift.
In a
reduce/reduce
conflict, reduce by
the grammar rule that can be applied at the earliest point in the input stream.
Using actions within rules can cause conflicts if the action must be
done before the parser can be sure which rule is being recognized.
In these
cases, using the preceding rules leads to an incorrect parser.
For this reason,
yacc
reports the number of
shift/reduce
and
reduce/reduce
conflicts that it has resolved by applying its rules.
4.9 Turning on Debug Mode
For normal operation, the external integer
variable
yydebug
is set to 0.
However, if you set it to
any nonzero value, the parser generates a running description of the input
tokens that it receives and the actions that it takes for each token.
You
can set the
yydebug
variable in one of the following two
ways:
Use the
yydebug
function by including the
following C language statement in the declarations section of the
yacc
grammar file:
yydebug = 1;
Use a debugger to execute the final parser, and set the
yydebug
variable on or off using debugger commands.
For further
details about using debuggers, such as
dbx
, see the reference
pages for the various debuggers.
4.10 Creating a Simple Calculator Program
You
can use the programs for a
lex
-generated lexical analyzer
and a
yacc
-generated parser, shown in
Example 4-1
and
Example 4-2, to create a simple desk calculator program
that performs addition, subtraction, multiplication, and division operations.
The calculator program also lets you assign values to variables (each designated
by a single lowercase letter) and then use the variables in calculations.
The files that contain the programs are as follows:
calc.l
- The
lex
specification file that defines the lexical analysis rules
calc.y
- The
yacc
grammar file that defines the parsing rules, and calls the
yylex
function created by
lex
to provide input
By convention,
lex
and
yacc
programs
use the letters
.l
and
.y
respectively
as file name suffixes.
Example 4-1
and
Example 4-2
contain the program fragments exactly as they should be entered.
The following processing instructions assume that the files are in your
current directory; perform the steps in the order shown to create the calculator
program using
lex
and
yacc
:
Process the
yacc
grammar file by using
the following command.
The
-d
option tells
yacc
to create a file that defines the tokens it uses in addition
to the C language source code.
%
yacc -d calc.y
This command creates the following files:
y.tab.c
- The C language source file
that
yacc
created for the parser
y.tab.h
- A header file containing
define
statements for the tokens used by the parser
Process the
lex
specification file by using
the following command:
%
lex calc.l
This command creates the
lex.yy.c
file, containing the C language source file that
lex
created for the lexical analyzer.
Compile and link the two C language source files by using the following command:
%
cc -o calc y.tab.c lex.yy.c
Use the
ls
command to verify that the following
files were created:
y.tab.o
- The object file for
y.tab.c
lex.yy.o
- The object file for
lex.yy.c
calc
- The executable program file
You can run the program by entering the
calc
command.
You can then enter numbers and operators in algebraic fashion.
After you
press Return, the program displays the result of the operation.
You can assign
a value to a variable as follows:
m=4
You can use variables in calculations as follows:
m+5
9
Example 4-1
shows the contents of the
calc.y
file.
This file has entries in all three of the sections of a
yacc
grammar file: declarations, rules, and programs.
The grammar
defined by this file supports the usual algebraic hierarchy of operator precedence.
Descriptions of the various elements of the file and their functions
follow the example.
Example 4-1: Parser Source Code for a Calculator
%{ #include <stdio.h> [1] int regs[26]; [2] int base; %} %start list [3] %token DIGIT LETTER [4] %left '|' [5] %left '&' %left '+' '-' %left '*' '/' '%' %left UMINUS /*supplies precedence for unary minus */ %% /* beginning of rules section */ list: /*empty */ | list stat'\n' | list error'\n' { yyerrok; } ; stat: expr { printf("%d\n",$1); } | LETTER '=' expr { regs[$1] = $3; } ; expr: '(' expr ')' { $$ = $2; } | expr '*' expr { $$ = $1 * $3; } | expr '/' expr { $$ = $1 / $3; } | expr '%' expr { $$ = $1 % $3; } | expr '+' expr { $$ = $1 + $3; } | expr '-' expr { $$ = $1 - $3; } | expr '&' expr { $$ = $1 & $3; } | expr '|' expr { $$ = $1 | $3; } | '-' expr %prec UMINUS { $$ = -$2; } | LETTER { $$ = regs[$1]; } | number ; number: DIGIT { $$ = $1; base = ($1==0) ? 8:10; } | number DIGIT { $$ = base * $1 + $2; } ; %% main() { return(yyparse()); } yyerror(s) char *s; { fprintf(stderr," %s\n",s); } yywrap() { return(1); }
The declarations section contains entries that perform the following functions:
Include standard I/O header file [Return to example]
Define global variables [Return to example]
Define the rule list as the place to start processing [Return to example]
Define the tokens used by the parser [Return to example]
Define the operators and their precedence [Return to example]
The rules section defines the rules that parse the input stream.
The programs section contains the following routines.
Because these
routines are included in this file, you do not need to use the
yacc
library when processing this file.
main( )
- The required main program
that calls
yyparse( )
to start the program
yyerror(s)
- The error-handling routine,
which prints a syntax error message
yywrap( )
- The wrap-up routine that
returns a value of 1 when the end of input occurs
4.10.2 Lexical Analyzer Source Code
Example 4-2
shows the contents of the
calc.l
file.
This file contains
#include
statements
for standard input and output and for the
y.tab.h
file, which is
generated by
yacc
before you run
lex
on
calc.l
.
The
y.tab.h
file defines
the tokens that the parser program uses.
Also,
calc.l
defines
the rules to generate the tokens from the input stream.
Example 4-2: Lexical Analyzer Source Code for a Calculator
%{ #include <stdio.h> #include "y.tab.h" int c; extern int yylval; %} %% " " ; [a-z] { c = yytext[0]; yylval = c - 'a'; return(LETTER); } [0-9] { c = yytext[0]; yylval = c - '0'; return(DIGIT); } [^a-z0-9\b] { c = yytext[0]; return(c); }