Table of Contents
There are two phases in traditional language recognition that
are relevant to sid
: the first is lexical analysis
(breaking the input up into terminal symbols); the second is
syntax analysis or parsing (ensuring that the terminal symbols in
the input are in the correct order).
sid
currently does very little to help with
lexical analysis. It is possible to use sid
to
produce the lexical analyser, but sid
provides no
real support for it at present. For now, the programmer is
expected to write the lexical analyser, or use another tool to
produce it.
The lexical analyser should break the input up into a series
of terminals. Each of these terminals is allocated a number. In
sid
, these numbers range from zero to the maximum
terminal number (specified in the grammar description). The
terminals may also have data associated with them (e.g. the value
of a number), known as the attributes of the terminal, or the
results of the terminal.
sid
generates the parser. The parser is a program
that reads the sequence of terminals from the lexical analyser,
and ensures that they form a valid input in the language being
recognised. If the input is not valid, then the parser will fail
(sid
provides mechanisms to allow the parser to
recover from errors).
This section provides a brief introduction to grammars. It is not intended to be an in-depth guide to grammars, more an introduction which defines some terminology.
sid
works with a subset of what are known as
context free grammars. A context free grammar consists of a set
of input symbols (known as terminals), a set of rules
(descriptions of what are legal forms in the language, also known
as non-terminals), and an entry point (the rule from which the
grammar starts).
A rule is defined as a series of alternatives (throughout this
document the definition of a rule is called a production - this
may conflict with some other uses of the term). Each alternative
consists of a sequence of items. An item can be a terminal or a
rule. As an example (using the sid
notation, which
now looks unlike the conventional syntax for grammars), a comma
separated list of integers could be specified as:
list-of-integers = { integer ; || integer ; comma ; list-of-integers ; } ;
This production contains two alternatives: the first matches the
terminal integer
; the second matches the sequence of
terminals integer
followed by comma
,
followed by another list of integers.
There is much documentation available on context free grammars (and other types of formal grammar). The reader is advised to find an alternative source for more information.
sid
grammars are based upon a subset of context
free grammars, known as LL(1) grammars. The main property of such
grammars is that the parser can always tell which alternative it
is going to parse next by looking at the current terminal symbol.
sid
does a number of transforms to turn grammars
that are not in this form into it (although it cannot turn all
possible grammars into this form). It also provides facilities
that allow the user to alter the control flow of the parser.
sid
makes the following changes to the context
free grammars described above:
There may be more than one entry point to the grammar.
As well as being a rule or a terminal, an item may be an action, a predicate or an identity. An action is just a user supplied function. A predicate is a cross between a basic and an action (it is a user defined function but it may alter the control flow). An identity is like an assignment in a conventional programming language.
Rules may take parameters and return results (as may actions; terminals may only return results). These may be passed between items using names.
Each rule can have an exception handler associated with it. Exception handlers are used for error recovery when the input being parsed does not match the grammar.
Rules may be anonymous.
Rules may have non-local names associated with them. These names are in scope for that rule and any rules that are defined inside it. The value of each non-local name is saved on entry to the rule, and restored when the rule is exited.