flex Tutorial

1.0 Flex and Bison

Flex and Bison are software tools used for processing structured input. Flex stands for “fast lexical analyzer” and is used for scanning the input and breaking it into recognizable chunks of text, called tokens. A token, normally, has a special meaning in an application. A token may be a reserved word, an operator or a punctuation mark. The process of scanning the input for tokens is called lexical analysis and the software for doing it is lexical analyzer or scanner. Next, these tokens need to be analyzed or parsed to see whether these tokens collectively match the grammatical rules for input to the application. This process of parsing the input stream of tokens is called syntax analysis and can be done using the Bison tool.

Flex and Bison have traditionally been used for building compilers. A compiler takes a source program file as input and produces an executable file as output. The process of compilation is broken down into multiple phases. The first phase is lexical analysis, which involves scanning the input source file and producing a stream of tokens. This can be done by using flex. The next phase is syntax analysis to check whether the sequence of tokens are as per the syntax of the programming language. This work can be done by using Bison. Of course, these tasks can also be done by writing a program in a language like C and not using flex and Bison. However, the source code for lexical and syntax analysis is much simpler when flex and Bison tools are used.

The problem of scanning the input for tokens and checking the relationship between tokens vis-a-vis application rules occurs quite often and flex and Bison can be used effectively for building software for these applications.

The functionality of flex and Bison is similar to that of UNIX programs lex and yacc respectively. lex and yacc were important programs in the UNIX programming environment. yacc, which stands for “yet another compiler-compiler”, came first and was developed by Stephen C. Johnson in the early 1970s. lex was developed by Mike Lesk and Eric Schmidt around 1975. Bison, which is an implementation of yacc, was developed by Robert Corbett and Richard Stallman around 1985. Flex has been developed by Jef Poskanzer, Vern Paxson and Van Jacobson around 1987.

2.0 Flex

Making a lexical analyzer, or scanner, is essentially a two step process. First, we run flex on the specification of tokens. Flex takes in a definition of tokens as input and produces C language code of the scanner as the output. The input to flex is a .l file. Flex produces a file named lex.yy.c which contains the scanner code. There is a function named yylex () inside the file, lex.yy.c. yylex () contains the scanner code and can be called from the main function of the final program. Second, the scanner program containing the main function, along with the lex.yy.c file can be compiled to get the final scanner executable file. Or, we can add the main function at the end of the lex.yy.c file and compile it to get the scanner executable file.

2.1 Input .l file

The input .l file has three sections. These are: the definition section, rules section and the user code section. The sections are separated by a line having just two percentage characters, that is, %%.

<Definition section>

%%

<Rules section>

%%

<User code section>

2.1.1 Definition Section

The first section is the definition section. It contains definitions in the form,

name definition

The “name” can be used subsequently as {name}, and, it is expanded to (definition). For example, consider the definitions,

DIGIT [0-9].
LETTER [A-Za-z]

After this, the usage,
{LETTER}{LETTER|DIGIT}*
is identical to
([A-Za-z])([A-Za-z]|[0-9])*
and matches a letter followed by zero or more times a letter or a digit.

All comments in the definition section, indented or unindented, are copied verbatim to the output lex.yy.c file. Comments are text enclosed between /* and */ character sequences. Comments must start and end on lines just by themselves. They must not share a line with a definition or a code block between %{ and %} or a %top block described below.

Text enclosed between %{ and %} is copied verbatim to the output lex.yy.c file. Both %{ and %} must be present at the beginning of lines by themselves. C code for initialization work can be put between the lines containing %{ and %} respectively.

You can also put a %top block, which is similar to the block between %{ and %}. However, a %top block is guaranteed to be placed at the top of the output file, before any flex definitions. The text inside a %top block is demarcated with braces. There can be multiple %top blocks and their order is preserved. %top blocks are useful for including header files, preprocessor macro definitions, etc. For example,

%top {
#include <stdio.h>
#include <stdlib.h>
}

2.1.2 Rules Section

The Rules Section is the main part of input to the flex program. It has a sequence of pattern – action pairs. Pattern and action are separated by whitespace.

A pattern is based on extended regular expressions. A pattern must start from the first column of a line in the Rules Section. Or, in other words, a pattern must not be indented. If there are any comments in the Rules Section, they must be indented. That is, the comments in the Rules Section must have one or more spaces before the initial “/*” characters.

An action comprises of one or more C language statements and must be on the same line as its pattern. If the code for an action spans multiple lines, it must be enclosed in braces. For each matched pattern, the generated scanner would execute the corresponding action. The string matching the pattern is stored and is pointed by the global pointer variable, yytext (declared as, extern char *yytext).

2.1.3 User code Section

The user code section is optional. If present, it is simply copied verbatim to the output lex.yy.c file. It can be used for additional functions required for the scanner code. The rest of the scanner code can be fully, or in part, placed here. If this section is missing in a .l file, the second %% symbol in the file can be omitted too.

2.2 Example: Using flex

As an example, we take a small problem of lexical analysis of C language variable declarations. To keep the example small, we put certain restrictions on the declarations. We can only have declarations using the basic types, char, int and float. Also, our declarations do not have arguments to functions. Some of the examples of the declarations are,

char **argv; // argv: pointer to pointer to char
int *p ();   // p: function returning pointer to int
int (*p) (); // p: pointer to function returning int
int * (*p) (); // p: pointer to function returning pointer to int
...

Thinking of tokens, the first that come to the mind are reserved words, char, int and float. Then, we have the (identifier) NAME. And, to complete the declarations, the other tokens are ASTERISK, LEFT_PARENTHESIS, RIGHT_PARENTHESIS, LEFT_BRACKET, RIGHT_BRACKET and SEMICOLON.

The next step is to make a .l file, which is as follows.

/* decl.l */
/*  Definition section  */

%top {
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
}

%{

/* token types */
enum yytokentype {
   CHAR = 258,
   INT,
   FLOAT,
   ASTERISK,
   LEFT_PARENTHESIS,
   RIGHT_PARENTHESIS,
   LEFT_BRACKET,
   RIGHT_BRACKET,
   NAME,
   SEMICOLON,
};

int num_lines = 0;

%}

%%

   /* Rules section  */
[\t ]+         /* ignore whitespace */
char    return CHAR;
int     return INT;
float   return FLOAT;
\*      return ASTERISK;
\(      return LEFT_PARENTHESIS;
\)      return RIGHT_PARENTHESIS;
\[      return LEFT_BRACKET;
\]      return RIGHT_BRACKET;
[a-zA-Z][a-zA-Z0-9]* return NAME;
;       return SEMICOLON;
\n      num_lines++;
%%

/* User code section */

int main (int argc, char **argv)
{
    int token;

    while (token = yylex ()) {
        switch (token) {
            case CHAR: printf ("CHAR\n");
                       break;
            case INT: printf ("INT\n");
                       break;
            case FLOAT: printf ("FLOAT\n");
                       break;
            case ASTERISK: printf ("ASTERISK\n");
                       break;
            case LEFT_PARENTHESIS: printf ("LEFT_PARENTHESIS\n");
                       break;
            case RIGHT_PARENTHESIS: printf ("RIGHT_PARENTHESIS\n");
                       break;
            case LEFT_BRACKET: printf ("LEFT_BRACKET\n");
                       break;
            case RIGHT_BRACKET: printf ("RIGHT_BRACKET\n");
                       break;
            case NAME: printf ("NAME: %s\n", yytext);
                       break;
            case SEMICOLON: printf ("SEMICOLON\n\n");
                       break;
            default:  printf ("Unexpected token\n");
        }
    }
    printf ("Number of lines scanned: %d\n", num_lines);
}

We have added the enumeration, yytokentype for token types, with CHAR having the value 258. The scanner function, yylex (), returns the token type and also sets the token value, where relevant, in the location pointed by the yytext pointer. yylex () returns zero at the end of input. The value of the first constant in enumeration is chosen as 258, so that the token type returned is outside the range of one byte character values and, in future, yylex () can return characters in addition to the above token types.

We can run flex on the above decl.l file and get the lex.yy.c file as output. Then, we can compile the lex.yy.c file. We need to link with the flex library using the -lfl option. And, finally, we can execute the scanner program on an input file containing declarations. In this example, the declarations are typed line by line on the standard input.

$ flex decl.l
$ gcc lex.yy.c -o decl -lfl
$ ./decl
int x;
INT
NAME: x
SEMICOLON

char **argv;
CHAR
ASTERISK
ASTERISK
NAME: argv
SEMICOLON

int * myfunction ();
INT
ASTERISK
NAME: myfunction
LEFT_PARENTHESIS
RIGHT_PARENTHESIS
SEMICOLON

Number of lines scanned: 3
$

Shell Scripting Tutorial

awk Tutorial

The making of UNIX Operating System

grep Command in Linux

POSIX message queues in Linux