Victor Schubert’s personal page

Using Sedlex with Menhir

One of my side projects involves parsing a very simple custom language that encodes some data. “Great!” I thought, “A reason to try out OCamllex and Menhir!”. These are a lexer generator and a parser generator, respectively, for the OCaml language. OCaml normally ships with OCamllex and OCamlyacc, OCaml versions of the lex and yacc tools from the C ecosystem. Menhir is an improvement over OCamlyacc.

One shortcoming of OCamllex is that it does not support Unicode: it operates on bytes and does not have a notion of encodings. I would like my tool to be able to work with Unicode characters though so I had to find a replacement for OCamllex. I don’t need to replace Menhir because it does not care about the contents of strings: it works directly over the tokens handed to it by the lexer.

A quick search for “unicode ocamllex” points to Sedlex. Apart from handling of Unicode, one of its other perks is that contrary to OCamllex, it does not define its own syntax; instead it is implemented as a PPX rewriter, a program which hooks into the OCaml parser and modifies the AST there to generate code. Sedlex however cannot work with Menhir out of the box though, because it does not use the same abstraction of a buffer.

When OCamllex and OCamlyacc, the lexer’s state is stored in a Lexing.lexbuf record. This wouldn’t be an issue if our lexer and compiler were built to be used in a pipeline, where we get the lexer to lex everything, and then get the parser to iterate over a list of tokens. This however is not how code generated by OCamllex and Ocamlyacc operates. Rather, the compiler receives a Lexing.lexbuf and a lexer function (of type Lexing.lexbuf -> token) and lazily produces the tokens as needed by the parser. This is to accomodate cases where the Lexing.lexbuf is an abstraction over something other than a plain in-memory buffer, allowing for example to read from a file while keeping only chunks of it in memory. Lexing.lexbuf operates on bytes, whereas Sedlexing.lexbuf operates on Unicode codepoints, rendering it incompatible with our parser.

Thankfully, the maintainers of Menhir have thought about the case of a lexer which does not operate on Lexing.lexbuf. The MenhirLib.Convert.Simplified.traditional2revised function lets us wrap our parser into a more convenient interface. I initially had trouble making sense of how to use it because I was looking for a way to adapt my byte lexbuf into a Unicode lexbuf, whereas the API actually adapts the Parser to give it a lexer-agnostic interface.

let ast_of_string string =
  let lexbuf = Sedlexing.Utf8.from_string string in
  let revised_lexer () = Lexer.token lexbuf in
  let revised_parser =
    MenhirLib.Convert.Simplified.traditional2revised Parser.main
  revised_parser revised_lexer