Lexers and Parsers


Whenever we deal with a computer language, say making a compiler, translator, interpreter or converter. Our first requirement is some tool or engine that can read our code file and divide it into tokens such as identifiers, operators, etc. This is known as Lexical Analysis and is done by a program known as lexer. For generating a lexer, here we are using a tool, Flex. Flex stands for Fast Lexical Analyzer. We can also write lexer manually but using tool is more convenient and provide better results.

Now, we have a list/table of tokens that lexer returned. This list is given to another program known as parser. What parser does is, It groups the token according to certain grammar rules of the language and produce desired output. The tool that we are using here to generate parser is known as Bison.

Bison Parser Algorithm

As bison gets the list of tokens. It keeps on putting tokens one by one into a stack. This process is known as Shifting. Suppose we have a list of tokens that are ‘Z’, ‘=’, ‘1’, ‘+’, ‘5’, ‘*’, ‘3’. So in the stack, it will put first 3, then * and 5. Now it has two options, either it will shift the next token into the stack or solve the expression already there in stack. This solving of expression is known as Reduction. This choice is made on the base of next token that is yet to shift into the stack and is known as look ahead token. In this example, as ‘+’ operator is having lower precedence than ‘*’, so it will reduce the expression i.e. pop up 3, *, 5 and push 15 into the stack.

Grammars

The set of rules, that each computer language has on the basis of which parser groups the tokens and perform desired action is known as Grammar. Chomsky (name of a scientist) classified Grammar into four categories:

  1. Type 0 (General Grammar)
  2. Type 1 (Context Sensitive Grammar)
  3. Type 2 (Context Free Grammar)
  4. Type 3 (Regular Grammar)

General grammar has no restriction as it is and all other grammars are subset of general grammar.
Context Sensitive Grammars has rules like
bcAad → ao
Here A is non terminal (variable) and all others (in small letters) are known as terminal symbols(constant)
According to CSG rule, A will be replaced by ‘ao’ only if surrounded by bc at left (known as left context) and ‘ad’ at right (known as right context).
Context free grammar has rules like
A → as or A → AbbB
Here reduction of A into ‘as’ or ‘AbbB’ doesn’t depend on any other surrounding context. That’s why it is known as context free grammar. The right hand side can consist of any combination of terminals and non terminals.
Regular Grammar has rules like:
A → Ab or A → bA
Here the non terminal (here A) can be either at right most or left most of R.H.S of the expression. If it is at left most side, it is known as left regular grammar, and if at right most, it is known as right regular grammar.
Regular grammar is a subset of Context free grammar. Suppose, if we want to write a grammar for palindromes, (a string that can be read from either side giving same result like (accbcca)), then regular grammar won’t help and we have to use CFG.

Regular Expressions

Next think, that need to be understood before writing input files for flex and bison are regular expressions.
Regular expressions are set of characters that forms patterns. Normally it consists of two types of characters that are

  1. Meta characters – These have special meaning like | for OR, && for AND, * for repeating previous expressions [a-zA-Z] for all text etc.
  2. Regular Characters – These have literal meaning like anything in quotes (“ “) have literal meaning. “Point” “Star” etc.

Regular expressions are basically used in searching applications where searching is done by string match.
For example, we use command GREP to search a particular string from a file or group of files. GREP stands for Global Regular Expression Print and it used regular expression to provide desired results.

Finite State Automaton

Finite State Automaton is a concept in theory of computation, defined as a machine which transforms from one state to another depending upon some input.
Here we’ll use FSA to represent how a regular expression accepts a string while searching from a file.
Suppose a file has: My name is shaina and I want to search “na” from this file.
So initial state of automaton is defined as  —>(q0)

On getting input ‘n’ it should move to the next state (q1)

The final state is represented as ((qf)) and (q1) state will be changed to final state only upon getting ‘a’ as input.

So it’s automaton will look like:
        —>(q0)—-n—->(q1)—–a—–>((qf))
Here the condition, whether you got the desired pattern into the target file is that string will only be accepted if it changes the state of automaton from initial to final. It the automaton doesn’t reach to final state even at the end of file, then it means string is not there in the file.

Lexer.l

It is the file that is given as input to the flex. It mainly contains regular expression that matches the strings/text/operator into the target file. Flex on getting lexer.l file generates a lexer file named as lex.yy.c.

Parser.y

It is the file that is given as input to the Bison. It contains the declaration of tokens and actions that should be performed on getting a particular token. Also it contains C/C++ code to read input file and write into another output file. Bison on reading this file generates parser files, i.e. parser.tab.h and parser.tab.c

Compilation process

$bison -d parser.y
$flex lexer.l
$g++ parser.tab.c lex.yy.c -lfl -o out
$./out input.html

Demo

https://github.com/shaina7837/flex_bison


GSOC Reunion 2014 at US (a brief) Better Educate than force