Using Sedlex with Menhir
One of my side projects involves parsing a very simple custom language that encodes some data. “Great!” I thought, “A reason to try out OCamllex and Menhir!”. These are a lexer generator and a parser generator, respectively, for the OCaml language. OCaml normally ships with OCamllex and OCamlyacc, OCaml versions of the lex and yacc tools from the C ecosystem. Menhir is an improvement over OCamlyacc.
One shortcoming of OCamllex is that it does not support Unicode: it operates on bytes and does not have a notion of encodings. I would like my tool to be able to work with Unicode characters though so I had to find a replacement for OCamllex. I don’t need to replace Menhir because it does not care about the contents of strings: it works directly over the tokens handed to it by the lexer.
A quick search for “unicode ocamllex” points to Sedlex. Apart from handling of Unicode, one of its other perks is that contrary to OCamllex, it does not define its own syntax; instead it is implemented as a PPX rewriter, a program which hooks into the OCaml parser and modifies the AST there to generate code. Sedlex however cannot work with Menhir out of the box though, because it does not use the same abstraction of a buffer.
When OCamllex and OCamlyacc, the lexer’s state is stored in a
Lexing.
record. This wouldn’t be an issue if our lexer and
compiler were built to be used in a pipeline, where we get the lexer
to lex everything, and then get the parser to iterate over a list of
tokens. This however is not how code generated by OCamllex and
Ocamlyacc operates. Rather, the compiler receives a Lexing.
and a lexer function (of type Lexing.
) and lazily
produces the tokens as needed by the parser. This is to accomodate
cases where the Lexing.
is an abstraction over something
other than a plain in-memory buffer, allowing for example to read
from a file while keeping only chunks of it in
memory. Lexing.
operates on bytes, whereas
Sedlexing.
operates on Unicode codepoints, rendering it
incompatible with our parser.
Thankfully, the maintainers of Menhir have thought about the case of
a lexer which does not operate on Lexing.
. The
MenhirLib.
function lets us
wrap our parser into a more convenient interface. I initially had
trouble making sense of how to use it because I was looking for a
way to adapt my byte lexbuf into a Unicode lexbuf, whereas the API
actually adapts the Parser to give it a lexer-agnostic interface.
let ast_of_string string = let lexbuf = Sedlexing.Utf8.from_string string in let revised_lexer () = Lexer.token lexbuf in let revised_parser = MenhirLib.Convert.Simplified.traditional2revised Parser.main in revised_parser revised_lexer