 |
Index for Section 1 |
|
 |
Alphabetical listing for F |
|
 |
Bottom of page |
|
flex(1)
NAME
flex - Generates a C Language lexical analyzer
SYNOPSIS
flex [-bcdfinpstvFILT8] -C[efmF] [-Sskeleton] [file...]
OPTIONS
-b Generates backtracking information to lex.backtrack. This is a list of
scanner states that require backtracking and the input characters on
which they do so. By adding rules you can remove backtracking states.
If all backtracking states are eliminated and -f or -F is used, the
generated scanner will run faster.
-d Makes the generated scanner run in debug mode. Whenever a pattern is
recognized and the global yy_lex_debug is nonzero (which is the
default), the scanner writes to stderr a line of the form:
--accepting rule at line 53 ("the matched text")
The line number refers to the location of the rule in the file defining
the scanner (the input to lex). Messages are also generated when the
scanner backtracks, accepts the default rule, reaches the end of its
input buffer (or encounters a NULL), or reaches an End-of-File.
-f Specifies full table (no table compression is done). The result is
large but fast. This option is equivalent to -Cf.
-i Instructs flex to generate a case-insensitive scanner. The case of
letters given in the flex input patterns will be ignored, and tokens in
the input will be matched regardless of case. The matched text given
in yytext will have the original case (as read by the scanner).
-p Generates a performance report to stderr. This identifies features of
the flex input file that will cause a loss of performance in the
resulting scanner.
-s Causes the default rule (that unmatched scanner input is echoed to
stdout) to be suppressed. If the scanner encounters input that does
not match any of its rules, it aborts with an error.
-t Instructs flex to write the scanner it generates to standard output
instead of lex.yy.c.
-v Specifies that flex should write to stderr a summary of statistics
regarding the scanner it generates.
-F Specifies that the fast scanner table representation should be used.
This representation is about as fast as the full table representation
(-f), and for some sets of patterns will be considerably smaller (and
for others, larger). This option is equivalent to -CF.
-I Instructs flex to generate an interactive scanner; that is, a scanner
that stops immediately rather than looking ahead if it knows that the
currently scanned text cannot be part of a longer rule's match. Note,
-I cannot be used in conjunction with full or fast tables; that is, the
-f, -F, -Cf, or -CF options.
-L Instructs flex not to generate #line directives in lex.yy.c. The
default is to generate such directives so error messages in the actions
will be correctly located with respect to the original lex input file.
-T Makes flex run in trace mode. It will generate a lot of messages to
stdout concerning the form of the input and the resultant
nondeterministic and deterministic finite automata. This option is
mostly for use in maintaining flex.
-8 Instructs flex to generate an 8-bit scanner (which is the default).
-C[efmF]
Controls the degree of table compression. The default setting is -Cem
which provides the highest degree of table compression. Faster-
executing scanners can be traded off at the cost of larger tables with
the following generally being true:
Slowest and smallest
-Cem
-Cm
-Ce
-C
-C{f,F}e
-C{f,F}
Fastest and largest
The -C options are not cumulative; whenever the option is encountered,
the previous -C settings are forgotten. The -f or -F and -Cm options
do not make sense together; there is no opportunity for meta-
equivalence classes if the table is not being compressed. Otherwise,
the options may be freely mixed.
-C A lone -C specifies that the scanner tables should be compressed
and neither equivalence classes nor meta-equivalence classes should
be used.
-Ce Directs flex to construct equivalence classes; for example, sets of
characters that have identical lexical properties. Equivalence
classes usually give dramatic reductions in the final table/object
file sizes (typically a factor of 2 to 5) and are inexpensive
performance-wise (one array look-up per character scanned).
-Cm Directs flex to construct meta-equivalence classes, which are sets
of equivalence classes (or characters, if equivalence classes are
not being used) that are commonly used together. Meta-equivalence
classes are often a big win when using compressed tables, but they
have a moderate performance impact (one or two "if" tests and one
array look-up per character scanned).
-Cf Specifies that the full scanner tables should be generated; flex
should not compress the tables by taking advantage of similar
transition functions for different states.
-CF Specifies that the alternative fast scanner representation should
be used.
-Sskeleton_file
Overrides the default skeleton file from which flex constructs its
scanners. This is useful for flex maintenance or development.
-c Specifies table-compression options. (Obsolescent)
-n Suppresses the statistics summaries that the -v option typically
generates. (Obsolete)
DESCRIPTION
The flex command is a tool for generating scanners: programs which
recognize lexical patterns in text. The flex command reads the given input
files, or its standard input if no filenames are given or if a file operand
is - (dash) for a description of a scanner to generate. The description is
in the form of pairs of regular expressions and C code, called rules. The
flex command generates as output a C source file, lex.yy.c, which defines a
routine yylex(). This file is compiled and linked with the -ll library to
produce an executable. When the executable is run, it scans its input and
the regular expressions in its rules looking for the best match (longest
input). When it has selected a rule it executes the associated C code which
has access to the matched input sequence (commonly referred to as a token).
This process then repeats until input is exhausted.
The flex command treats multiple input files as one.
Syntax for Input
This section contains a description of the flex input file, which is
normally named with a .l suffix. The section provides a listing of the
special values, macros, and functions recognized by flex.
The flex input file consists of three sections, separated by a line with
just %% in it:
[ definitions ]
%%
[ rules ]
[ %%
[ user functions ]]
definitions
Contains declarations to simplify the scanner specification, and
declarations of start states which are explained below.
rules
Describes what the scanner is to do.
user functions
Contains user-supplied functions that copied straight through to
lex.yy.c.
With the exception of the first %% sequence all sections are optional.
The minimal scanner %%, copies its input to standard output.
Each line in the definitions section can be:
name regexp
Defines name to expand to regexp. name is a word beginning with a
letter or an underscore (_) followed by zero or more letters, digits,
underscores or dashes (-). In the regular-expression parts of the rules
section, flex substitutes regexp wherever you refer to {name} (name
within braces).
%x state [ state ... ]
%s state [ state ... ]
Defines names for states used in the rules section. A rule may be made
conditionally active based on the current scanner state. Multiple lines
defining states can appear, and each can contain multiple state names,
separated by white space. The name of a state follows the same syntax
as that of regexp names except that dashes ('-') are not permitted.
Unlike regexp names, state names share the C #define namespace. In the
rules section states are recognized as <state> (state within angle
brackets).
The %x directive names exclusive states. When a scanner is in an
exclusive state, only rules prefixed with that state are active.
Inclusive states are named with the %s directive.
%{
%} When placed on lines by themselves, these symbols enclose C code to be
passed verbatim into the global definitions of the output file. Such
lines commonly include preprocessor directives and declarations of
external variables and functions.
space
tab Lines beginning with a space or tab in the definitions section are
passed directly into the lex.yy.c output file, as part of the initial
global definitions.
The rules section follows the definitions, separated by a line consisting
of %%. The rules section contains rules for matching input and taking
actions, in the following format:
pattern [action]
The pattern starts in the first column of the line and extends until the
first non-escaped white space character. The flex command attempts to find
the pattern that matches the longest input sequence and execute the
associated action. If two or more patterns match the same input the one
which appears first in the rules section is chosen. If no action exists the
matched input is discarded. If no pattern matches the input the default is
to copy it to standard output.
All action code is placed in the yylex() function. Text (C code or
declarations) placed at the beginning of the rules section is copied to the
beginning of the yylex() function and may be used in actions. This text
must begin with a space or a tab (to distinguish it from rules). In
addition, any input (beginning with a space or within %{ and %} delimiter
lines) appearing at the beginning of the rules section before any rules are
specified will be written to lex.yy.c after the declarations of variables
for the yylex() function and before the first line of code in yylex().
Elements of each rule are:
state
A pattern may begin with a comma separated list of state names enclosed
by angle brackets (< state [,state...] >). These states are entered
via the BEGIN statement. If a pattern begins with a state, the scanner
can only recognize it when in that state. The initial state is 0
(zero).
regexp
A regular expression to match against the input stream. The regular
expressions in flex provide a rich character matching syntax.
The following characters, shown in order of decreasing precedence have
special meanings:
x Matches the character x.
(double quotes)
Enclose characters and treat them as literal strings. For example,
"*+" is treated as the asterisk character followed by the plus
character.
\str (backslash)
If str is one of the characters a, b, f, n, r, t, or v, then the
ANSI C interpretation is adopted (for example, \n is a newline).
If str is a string of octal digits it is interpreted as a character
with octal value str. If str is a string of hexadecimal digits with
a leading x it is interpreted as a character with that value.
Otherwise, it is interpreted literally with no special meaning. For
example, x\*yz represents the four characters x*yz.
[ ] (brackets)
Represents a character class in the enclosed range ([.-.]) or the
enclosed list ([...]). The dash character is used to define a range
of characters from the ASCII value or the 8-bit class of the
character that comes before it to the ASCII value or the 8-bit
class of the character that follows it. For example, [abcx-z]
matches a, b, c, x, y, or z.
The circumflex when it appears as the first character in a
character class, indicates the complement of the set of characters
within that class. For example, [^abc] matches any character
except a, b or c, including special characters like newline.
( ) (parentheses)
Groups regular expressions. For example, (ab) will be considered as
a single regular expression.
{ } (braces)
When enclosing numbers, indicates a number of consecutive
occurrences of the expression that comes before it. For example,
(ab){1,5} indicates a match for from 1 to 5 occurrences of the
string ab.
When enclosing a name, the name represents a regular expression
defined in the definitions section. For example, {digit} is
replaced by the defined regular expression for digit. Note that the
expansion takes place as if the definition were enclosed in
parentheses.
. (period)
Matches any single character except newline.
? (question mark)
Matches zero or one of the preceding expressions. For example, ab?c
matches both ac and abc.
* (asterisk)
Matches zero or more of the preceding expressions. For example, a*
is zero or more consecutive a characters. The utility of matching
zero occurrences is more obvious in complicated expressions. For
example, the expression, [A-Za-z][A-Za-z0-9]* indicates all
alphanumeric strings with a leading alphabetic character, including
strings that are only one alphabetic character.
+ (plus sign)
Matches one or more of the preceding expressions. For example, [a-
z]+ is all strings of lowercase letters.
xy (concatenation)
Matches the expression x followed by the expression y.
(br (vertical bar)
Matches either the preceding expression or the following
expression. For example, a(br matches either ab or cd.
x/y (slash)
Matches expression x only if expression y (trailing context)
immediately follows it. For example, ab/cd matches the string ab
but only if followed by cd. Only one trailing context is permitted
per pattern.
^ (circumflex)
When it appears at the beginning of the pattern matches the
beginning of a line. For example, ^abc will match the string abc if
it is found at the beginning of a line.
$ (dollar sign)
When it appears at the end of a pattern matches the end of a line.
It is equivalent to /\n. For example, abc$ will match the string
abc if it is found at the end of a line.
<<EOF>>
Matches an End-of-File.
<x> (angle bracket)
Identifies a state name (see above) and may only appear at the
beginning of a pattern. For example, <done><<EOF>> matches an End-
of-File, but only if it is in state done.
In addition, the following rules apply for bracket expressions:
Equivalence class expressions
These represent the set of collating elements in an equivalence
class and are enclosed within bracket-equal delimiters ([= =]). An
equivalence class generally is designed to deal with primary-
secondary sorting; that is, for languages like French that define
groups of characters as sorting to the same primary location, and
then have a tie-breaking, secondary sort. For example, if a, `, and
^ belong to the same equivalence class, then [[=a=]b], [[=`=]b],
and [[=^=]b] are each equivalent to [a`^b].
Character class expressions
These represent the set of characters in the current locale
belonging to the named ctype class. These are expressed as a ctype
class name enclosed in bracket-colon delimiters ([: :]).
In the C or POSIX locale, this operating system supports the
following character class expressions: [:alpha:], [:upper:],
[:lower:], [:digit:], [:alnum:], [:xdigit:], [:space:], [:print:],
[:punct:], [:graph:], [:cntrl:].
Other locales may define additional character classes.
Letters and digits never have special meanings. A character such as ^
or -, which has a special meaning in particular contexts, refers simply
to itself when found outside that context. Spaces and tabs must be
escaped to appear in a regular expression; otherwise they indicate the
end of the expression.
action
Each pattern in a rule has a corresponding action, which can be any
arbitrary C statement. The pattern ends at the first non-escaped white
space character; the remainder of the line is its action. If the action
is empty, then when the pattern is matched the input which matched it
is discarded.
If the action contains a {, then the action spans till the balancing }
is found, and the action may cross multiple lines. Using a return
statement in an action returns from yylex().
An action consisting solely of a vertical bar (|) means same as the
action for the next rule.
The flex variables which can be used within actions are:
yytext
A string (char *) containing the current matched input. It cannot
be modified.
yyleng
The length (int) of the current matched input. It cannot be
modified.
yyin
A stream (FILE *) that flex reads from (stdin by default). It may
be changed but because of the buffering flex uses this makes sense
only before scanning begins. Once scanning terminates because an
End-of-File was seen, void yyrestart (FILE *new_file) may be called
to point yyin at a new input file. Alternatively, yyin may be
changed whenever a new or different buffer is selected (see
yy_switch_to_buffer()).
yyout
A stream (FILE *) to which ECHO output is written (stdout by
default). It can be changed by the user.
YY_CURRENT_BUFFER
Returns the current buffer (YY_BUFFER_STATE) used for scanner
input.
The flex command macros and functions that may be used within actions
are:
ECHO
Copies yytext to the scanner's output.
BEGIN state
Changes the scanner state to be state. This affects which rules
are active. The state must be defined in a %s, or %x definition.
The initial state of the scanner is INITIAL or 0 (zero).
REJECT
Directs the scanner to proceed immediately to the next best pattern
that matches the input (which may be a prefix of the current
match). yytext and yyleng are reset appropriately. Note that
REJECT is a particularly expensive feature in terms of scanner
performance; if it is used in any of the scanner's actions, it will
slow down all of the scanner's pattern matching operations. REJECT
cannot be used if flex is invoked with either -f or -F options.
yymore()
Indicates that the next matched text should be appended to the
currently matched text in yytext (rather than replace it).
yyless(n)
Returns all but the first n characters of the current token back to
the input stream, where they will be rescanned when the scanner
looks for the next match. yytext and yyleng are adjusted
accordingly.
yywrap()
Returns 0 (zero) if there is more input to scan or 1 if there is
not. The default yywrap() always returns 1. Currently it is
implemented as a macro, however in future implementations it may
become a function.
yyterminate()
Can be used in lieu of a return statement in an action. It
terminates the scanner and returns a 0 (zero) to the scanner's
caller.
yyterminate() is automatically called when an End-of-File is
encountered. It is a macro and may be redefined.
yy_create_buffer(file, size)
Returns a YY_BUFFER_STATE handle to a new input buffer large enough
to accommodate size characters and associated with the given file.
When in doubt, use YY_BUF_SIZE for the size.
yy_switch_to_buffer(new_buffer)
Switches the scanner's processing to scan for tokens from the given
buffer, which must be a YY_BUFFER_STATE.
yy_delete_buffer(buffer)
Deletes the given buffer.
YY_NEW_FILE
Enables scanning to continue after yyin has been pointed at a new
file to process.
YY_DECL
Controls how the scanning function, yylex() is declared. By
default, it is int yylex(), or, if prototypes are being used, int
yylex(void). This definition may be changed by redefining the
YY_DECL macro. This macro is expanded immediately before the {...}
(braces) that delimit the scanner function body.
YY_INPUT(buf,result,max_size)
Controls scanner input. By default, YY_INPUT reads from the file-
pointer yyin. Its action is to place up to max_size characters in
the character array buf and return in the integer variable result
either the number of characters read or the constant YY_NULL to
indicate EOF. Following is a sample redefinition of YY_INPUT, in
the definitions section of the input file:
%{
#undef YY_INPUT
#define YY_INPUT(buf,result,max_size)\
{\
int c = getchar();\
result = (c == EOF) ? YY_NULL : (buf[0] = c, 1);\
}
%}
When the scanner receives an End-of-File indication from YY_INPUT,
it checks the yywrap() function. If yywrap() returns zero, it is
assumed that the yyin has been set up to point to another input
file, and scanning continues. If it returns non-zero, then the
scanner terminates, returning zero to its caller.
YY_USER_ACTION
Redefinable to provide an action which is always executed prior to
the matched pattern's action.
YY _USER_INIT
Redefinable to provide an action which is always executed before
the first scan.
YY_BREAK
Is used in the scanner to separate different actions. By default,
it is simply a break, but may be redefined if necessary.
The user functions section consists of complete C functions, which are
passed directly into the lex.y.cc output file (the effect is similar to
defining the functions in separate .c files and linking them with
lex.y.cc). This section is separated from the rules section by the %%
delimiter.
Comments, in C syntax, can appear anywhere in the user functions or
definitions sections. In the rules section, comments can be embedded
within actions. Empty lines or lines consisting of white space are ignored.
The following macros are not normally called explicitly within an action,
but are used internally by flex to handle the input and output streams.
input()
Reads the next character from the input stream. You cannot redefine
input().
output()
Writes the next character to the output stream.
unput(c)
Puts the character c back onto the input stream. It will be the next
character scanned. You cannot redefine unput().
The libl.a contains default functions to support testing or quick use
of a flex program without yacc; these functions can be linked in
through -ll. They can also be provided by the user.
main()
A simple wrapper that simply calls setlocale() and then calls the
yylex() function.
yywrap()
The function called when the scanner reaches the end of an input
stream. The default definition simply returns 1, which causes the
scanner in turn to return 0 (zero).
NOTES
· Some trailing context patterns cannot be properly matched and generate
warning messages
Dangerous trailing context
These are patterns where the ending of the first part of the rule
matches the beginning of the second part, such as zx*/xy*, where the
x* matches the x at the beginning of the trailing context.
· For some trailing context rules, parts that are actually fixed length
are not recognized as such, leading to the previously mentioned
performance loss. In particular, patterns using {n} (such as test{3})
are always considered variable length.
Combining trailing context with the special | (vertical bar) action
can result in fixed trailing context being turned into the more
expensive variable trailing context. This happens in the following
example:
%%
abc|
xyz/def
· Use of unput() invalidates the contents of yytext and yyleng within
the current flex action.
· Use of unput() to push back more text than was matched can result in
the pushed-back text matching a beginning-of-line (^) rule even though
it did not come at the beginning of the line.
· Pattern matching of NULLs is substantially slower than matching other
characters.
· The flex command does not generate correct #line directives for code
internal to the scanner; thus, bugs in flex.skel yield invalid line
numbers.
· Due to both buffering of input and read-ahead, you cannot intermix
calls to <stdio.h> routines, such as, for example, getchar(), with
flex rules and expect it to work. Call input() instead.
· The total table entries listed by the -v option excludes the number of
table entries needed to determine what rule was matched. The number
of entries is equal to the number of deterministic finite-state
automaton (DFA) states if the scanner does not use REJECT, and
somewhat greater than the number of states if it does.
· REJECT cannot be used with the -f or -F options.
EXAMPLES
1. The following command processes the file lexcommands to produce the
scanner file lex.yy.c:
flex lexcommands
This is then compiled and linked by the command:
cc -oscanner lex.yy.c -ll
This produces a program scanner.
2. The scanner program converts uppercase to lowercase letters, removes
spaces at the end of a line, and replaces multiple spaces with single
spaces. The lexcommands command contains:
%%
[A-Z] putchar(tolower(yytext[0]));
[ ]+$
[ ]+ putchar(' ');
FILES
flex.skel
Skeleton scanner.
lex.yy.c
Generated scanner C source.
lex.backtrack
Backtracking information generated from -b option.
SEE ALSO
Commands: yacc(1), sed(1), awk(1)
Files: locale(4)
 |
Index for Section 1 |
|
 |
Alphabetical listing for F |
|
 |
Top of page |
|