wlex: a lexer generator for large encodings, derived from ocamlex

runtime system: 
  Copyright (C) 2000, 2001, 2002, 2003 Alain Frisch 
  distributed under the terms of the LGPL : see LICENSE

lexer generator patch:
  distributed under any license that suits your need and complies
  to the restrictions of the QPL

email: Alain.Frisch@ens.fr
web:   http://www.eleves.ens.fr:8080/home/frisch

--------------------------------------------------------------------------
Important Notice

I decided to stop working on wlex. Maintaining a patch is painful,
and wlex design was inherently flawed.

Instead, I started working on ulex, a lexer generator for OCaml with
Unicode support. ulex is written from scratch, which makes it easier
to distribute, maintain, install, and use. Also, it's design is much
cleaner than wlex. I recommend using ulex, and stopping using wlex
as soon as possible. wlex must die!

Please let me know if you have a piece of code that depends on
wlex, and if you are not willing to make the transition to ulex.


You can find ulex here:

http://www.cduce.org/download


--------------------------------------------------------------------------
Overview

This package consists of a lexer generator and the associated runtime
system.

NOTE
 *  The lexer generator is derived from ocamllex; due to licence issues
 *  (this part of the OCaml system is distributed under the terms of the
 *  QPL), I have to distribute it as a patch. This imply you will need
 *  the whole OCaml source tree to build wlex. If you experience any
 *  problem to apply the patch, please don't hesitate to contact me.


The lexer architecture of wlex adds an extra layer (classification)
between the lexbuf and the lexer. This layer extracts "character
classes" from the lexbuf and the lexer itself works with this classes,
not directly on characters.

Usually, the number of classes is small (<< 256) and the
classification may consume more than one byte to produce the next
class. This allow to parse efficiently wide characters encodings
such as UTF-8 (the main motivation for wlex).

Classes form a partition of accepted characters.

A typical example is to group all the letters to a class "letter",
all the digits to a class "digit", ...  The regexps in the
lexer specification (the .mll file) are built on classes. For
instance:

let ident = letter (letter | digit)*

In some cases, it is necessary to change the design of the lexer.
For instance, it is a good idea to have a single for identifiers
and keywords, the distinction between them being done in the semantic
action of the rule. Another possibility is to declare all the
characters from the keywords as single classes. This is a very bad idea.


During lexing, the classification is handled by an "engine".
Some generic engine are provided (a C implementation for speed;
an ML implementation if you want pure bytecode), especially to
support UTF-8. Working with such an encoding with ocamllex
would introduce a *lot* of "waiting" state and a *lot* of
duplicated transitions in the automaton (it is the motivation
for wlex to avoid these).


Note: the current release of wlex does not implement the recent binding feature
of ocamllex ("as" keyword in regexp).


--------------------------------------------------------------------------
Requirements

You need recent releases of:

- the OCaml compiler 3.07 , !! including the source tree
   http://caml.inria.fr/ocaml/index.html

  (older versions may work)

- Findlib (package manager for OCaml)
   http://www.ocaml-programming.de/packages/documentation/findlib/


--------------------------------------------------------------------------
Lexer specification (. mll file)

The syntax of .mll file is modified:

- before the header, there is a new section which declare classes.
  It starts with the keyword *classes*, followed by classes
  declaration. An ident declares a class with this name.
  A literal character 'x' declares a class with name char_ff
  where ff is the hexa code of the x.
  A literal string "xyz" is equivalent to 'x' 'y' 'z'.
 
  The class are assigned sequential number, starting with 1.
  The class 0 is predefined to eof.

- the entry point accept extra argument. Ex:
  rule token arg1 arg2 = ....

- in a regexp, "_" means any class;
  an ident is interpreted as a regexp or as a class name
 
- in a "[ ... ]" regexp, the dash is forbidden; a literal char 'x' or
  an indent references the corresponding class which must be declared;
  a string "xyz" is equivalent to 'x' 'y' 'z'

--------------------------------------------------------------------------
Output of wlex

Output files is very close to the one generated by ocamllex.
The "empty token" error message specify which lexer entry is involved.

Also, by default, the output file begin with class declarations.

For the declarations:
---
classes
  encoding_error

  xml_char      (* used only in negations [^ ...]  (i.e : literal) *)
  base_char
  ascii_digit   (* 0..9       "subset" Digit *)
  ideographic
  combining_char
  xml_digit
  extender

  ".-_:'*/()[]@,|=!<>+-$"  '"' " \t\n\r"
---
wlex produces:

---
let eof = 0
let encoding_error = 1
let xml_char = 2
let base_char = 3
let ascii_digit = 4
let ideographic = 5
let combining_char = 6
let xml_digit = 7
let extender = 8
let char_2e = 9
let char_2d = 10
let char_5f = 11
let char_3a = 12
...
let one_char_classes = [
  (0x2e, 09);
  (0x2d, 10);
  (0x5f, 11);
  (0x3a, 12);
...
 ]

let nb_classes = 34
---

It is possible to output these declarations to a separate .ml file
with the command:
wlex <.mll file>  -cf <.ml file>


--------------------------------------------------------------------------
Running the lexer

An engine is a function: Lexing.lex_tables -> int -> Lexing.lexbuf -> int
It runs the automaton starting at the initial state (second
argument). The engine's job is to extract bytes from the lexbuf
(by calling the refill_buff function when needed), to classify them
into classes, and to run the automaton (the transition are labelled
with class numbers).

The entry points in the file generated by wlex accept as their
first argument (before the lexbuf) an engine.

Generic engines are provided. See wlex_engines.mli.

Usually, when there are less than 256 classes, you will use:

val engine_tiny_8bit:
  string
  -> lex_tables -> int -> lexbuf -> int

for 8 bit encoding like Latin-1 (ISO-8859-1), and:

val engine_tiny_utf8:
  string -> (int -> int)
  -> lex_tables -> int -> lexbuf -> int

for UTF-8.

NOTE:
 *  The same table works for UTF-8 and Latin-1 encodings.

They work with a table compacted into a string. Each code point
is looked up into this table to choose which class it belongs to.
The UTF-8 engine also must be given a function converting codepoints
outside the table (codepoint >= String.length table) to class numbers.

--------------------------------------------------------------------------
Building and installation

There are two parts in this package: wlex itself, and the runtime
support library. Building wlex may be problematic since the OCaml source
tree is required. It is is not necessary to do it if you only need the
runtime support (for instance, PxP includes a pre-compiled wlex lexer).

To build and install wlex:

- edit the Makefile and fill out the line OCAMLLEX_SRC =
- make wlex
- make install_wlex
        (by default, wlex is installed in the same directory as ocaml)

To build and install the runtime support library:

- make runtime
- make runtime.opt
- make install_runtime
        (by default, the findlib package name is wlexing)


Other goals:

- make all
- make all.opt
- make install
- make uninstall
- make uninstall_wlex
- make uninstall_runtime
- make tester : compile a very dummy test program


For a more realistic example, see the Xpath package:
http://www.eleves.ens.fr:8080/home/frisch/soft
(I wrote wlex to support UTF-8 in Xpath).
