Flex and Bison – Lets use it!


For last few days, a topic is itching us and always come under great discussions with my friends and teachers that should we use tools, to generate lexers and parsers or write our own. As many large programming languages and softwares such as

  • The Ruby programming language (YARV)
  • The PHP programming language (Zend Parser)
  • The Go programming language (GC)
  • Bash shell uses a yacc grammar for parsing the command input.
  • LilyPond
  • PostgreSQL
  • OpenSCAD.

use tools – Flex and Bison.

But starting with these tools to write a parser is a lot complex. A syntax looks like a crap with everything start with ‘yy’, long horrible regular expressions, unpredictable scope of variables and above all writing grammar rules in bison so that it can generate syntax tree is completely a mess. With all these, still you need to write a lot of  C/C++ code for performing different actions. While reading blogs to get the views of different people about using tools, some even said that these tools doesn’t generate the parser the way they wanted and with a hand-coded parser, it is relatively straight forward to modify the tree structure to whatever you want at parsing time.

After all these problems, what makes these tools worth using by so big communities.  I have been exploring different features of these tools since last week, so as to find an optimum conclusion of these discussions.

First of all, lets have an overview about these tools.

FLEX: Flex is a tool for generating scanners: programs which recognized lexical patterns in text. It reads the given input files, or its standard input if no file names are given, for a description of a scanner to be generated. The description is in the form of pairs of regular expressions and C code, called rules. A C source file, `lex.yy.c’ as output is generated. This file is compiled and linked with the `-lfl’ library to produce an executable. When the executable is run, it analyzes its input for occurrences of the regular expressions. Whenever it finds one, it executes the corresponding C code.

BISON: Bison is a general-purpose parser generator that converts a grammar description for an LALR(1) context-free grammar into a C program to parse that grammar. Generally flex is used with bison to divide input text into tokens and return them to bison so that further actions can be taken by the bison according to the defined grammar rules in the bison.

Features

Regular Expression in lexer

The regular expression patterns that looks horrible in first run, start looking pleasant soon they start solving your problems. A tiny pattern of regex can be used to match diversity of texts in input file and general rules can be defined for different input files.
For Example:
[A-Za-z]* will match all the combinations of characters of any length in the input file. This pattern also defines that characters can be case insensitive.
[0-9]* will match all digits and combination of digits of any length.
‘r/s’ – a regex for matching character ‘r’ but only if it is followed by character ‘s’
Likewise, there is a long list with so many possibilities that saves a lot of C/C++ code if written otherwise to match different set of inputs.

Inbuilt Functions

There is a long list of useful inbuilt function defined in the tools, calling them simply solves big issues, such as
yylex() function is called to bring regular expression patterns in action.
yyparse() function is called to bring defined grammar rules in action.
yyin is a FILE* pointer that holds the input file.
Likewise there is a bunch of such useful functions that you just need to know about, once and thenuse forever.

Context Free Grammar (CFG)

A standard LALR(1) i.e. Look Ahead Left to Right parsing one token at a time, context free grammar is used to define different actions for corresponding matching text in the input file. CFG is a grammar in which rules are defined for each token irrespective to its context.

For example: V –> w
Here V is a non-terminal (variable) and w is a string of terminals (constant). According to this rule under CFG, V will always be replaced by w, no matter what symbols surround it.

CFGs have three important characteristics:

  • They provide a precise mathematical definition that clearly rules out certain types of language.
  • The formal definition means that context free grammars are computationally tractable i.e. it is possible to write a computer program which determines whether sentences are grammatical or not.
  • They provide a convenient visual notation for the structure of sentences (the tree).

So, in short we can say, Bison accepting CFG helps us to define certain general rules for our input files.

Unambiguous Rules

Flex generates warning if there is any ambiguity in the regex pattern rules and also tells us the location.

For example: If there are rules such as
[A-Za-z]*  { /*action 1*/}
[a-z0-9]* { /*action 2*/ }

So, flex generates warning at second pattern as the action for [a-z] is already defined so, it can generate ambiguity while returning token to parser to struct tree.
Bison also generates warning or error if there is ambiguity in the grammar rules.

Start Conditions

 Flex provides a mechanism for conditionally activating rules. Any rule whose pattern is prefixed with “<sc>” will only be active when the scanner is in the start condition named “sc”. Start conditions are declared in the definitions (first) section of the input using unindented lines beginning with either ‘%s’ or ‘%x’ followed by a list of names.

For example: To extract filename from #include <file1.h>, the rules will be written as

%x incl
%%
include<    { BEGIN(incl);}
<incl>[^\t\n>]*    { filename = yytext; cout << filename; }
<incl>">"       { BEGIN(INITIAL); }
. ;
%% 

Here BEGIN(incl) begins the start condition ‘incl’ and any rule started with <incl> will be under the condition “include<” unless BEGIN(INITIAL) is defined.

Operator Precedence

 According to Bison Parser Algorithm, when bison gets the list of tokens, it keeps on putting tokens one by one into a stack. This process is known as Shifting. Suppose we have a list of tokens that are ‘Z’, ‘=’, ‘1’, ‘+’, ‘5’, ‘*’, ‘3’. So in the stack, it will put first 3, then * and 5. Now it has two options, either it will shift the next token into the stack or solve the expression already there in stack. This solving of expression is known as Reduction. This choice is made on the base of next token that is yet to shift into the stack and is known as look ahead token. In this example, as ‘+’ operator is having lower precedence than ‘*’, so it will reduce the expression i.e. pop up 3, *, 5 and push 15 into the stack. Now the question is, how can we define the precedence of different operators in our program. For this, bison uses a very simple approach i.e.

The relative precedence of different operators is controlled by the order in which they are declared. The first %left or %right declaration in the file declares the operators whose precedence is lowest, the next such declaration declares the operators whose precedence is a little higher, and so on, such as

%left '<'
%left '-'
%left '*'

Here ‘*’ is having highest precedence and ‘<‘ is having lowest precedence.

Operator Associativity

 Operator associativity in any programming language is defined as a property of an operator that determines how operator of same precedence are grouped in the absence of parentheses. Bison handles the associativity very nicely.

For example:
If we have input such as `1 – 2 – 5′; should this be used as `(1 – 2) – 5′ or  `1 – (2 – 5)’? For most operators we prefer the former, which is called left association. The latter alternative,right association, is desirable for assignment operators.

Bison allows you to specify these choices with the operator precedence declarations %left and %right. Each such declaration contains a list of tokens, which are operators whose precedence and associativity is being declared. The %left declaration makes all those operators left-associative and the %right declaration makes them right-associative.

Suppressing Conflict warning

As bison is very sensitive for conflicts in the grammar if any, but most real grammars have harmless shift/reduce conflicts which are resolved in a predictable way and would be difficult to eliminate. Bison is flexible enough to allow us to suppress these warning unless the number of conflicts changes with the %expect declaration.

The declaration looks like this:

%expect n

Here n is a decimal integer. The declaration says there should be no warning if there are n shift/reduce conflicts. The usual warning is given if there are either more or fewer conflicts.

Multiple Parsers in the same program

With the help of bison, we can parse more than one language with single program by avoiding the name conflicts between different definition of yylex, yyparse, yylval etc., which are global.

The easy way to do this is to use the option `-p prefix‘ . This renames the interface functions and variables of the Bison parser to start with prefix instead of `yy’. You can use this to give each parser distinct names that do not conflict.

Error Recovery

Bison handles errors in the statement very gracefully. Normally, it is required, that the program should not terminate at the parse error, but keep parsing the rest of the input for finding more errors until the end of file. For this, you can define how to recover from syntax error by writing rules to recognize the special token ‘error’. This is a terminal symbol that is already defined and reserved for error handling.
For example:

stmnts:  /* empty string */
        | stmnts '\n'
        | stmnts exp '\n'
        | stmnts error '\n'

The fourth rule in this example says that an error followed by a newline makes a valid addition to any stmnts.

Buffers

Some scanners require reading from several input streams, which can’t be done by using simple YY_INPUT() function. For that, we need to buffer each input. Flex provides a mechanism for creating and switching between multiple input buffers. An input buffer is created by using:
yy_create_buffer ( FILE *file, int size )
which takes a FILE pointer and a size and creates a buffer associated with the given file and large enough to hold size characters.
More such useful function for buffering is explained here.

A simple example of buffering for reading from two input files is here

In the nut shell, flex and bison are there to make your work easy and so called TOOLS to generate lexer and parser even for highly complex languages. Moreover it will be blazing fast and far easier to work with once you understand the tools. This is true, that you have to make more efforts at the starting to understand the tools, as you have to dive in various resources and no line to line tutorial is available.

Major Source of Information -> http://dinosaur.compilertools.net/


Lexertl - a lexer generator GSOC Reunion 2014 at US (a brief)