DEV Community

Hercules Lemke Merscher
Hercules Lemke Merscher

Posted on • Originally published at open.substack.com

Tsonnet #0 - From Zero to Interpreter

This is the first post where I'm going to dive in the process of developing Tsonnet. Welcome to the series!

I'm writing as I go. The more I work on the implementation of Tsonnet, the more I'm writing about it. I hope to be an educative process for the curious minds on how to implement a compiler, including a type checker. Tooling may come at some point in the future.

I'm going to cover the aspects of writing a lexer, a parser, and eventually the type checker implementation. A basic understanding of what a lexer and a parser is will help while reading, but I will try to provide links to related topics whenever I deem relevant.

If you haven't read my previous post on Tsonnet, you can do so on

TLDR, I'm adding types to Jsonnet.

On the tech stack of choice

The language must be fast enough and compile to native code. I thought of a few options:

The go-jsonnet project is done in Go and even though I think Go is amazing in many aspects, writing interpreters is not one of them, IMHO. I've been there, done that! It's not terrible, but it is far from ideal. Ruled out.

Haskell is another great choice for writing interpreters. I've done some parsing using parser combinators in the past and I must say, the ergonomics is quite amazing! Turns out Haskell is not a friendly language to many people and I want people from different backgrounds contributing eventually, without feeling overwhelmed by having to learn too many things before they do a tiny contribution. Ruled out.

Rust is surfing the hype wave lately. Everybody wants to write Rust for some reason. It's a nice language, to be honest. I played with it many years ago and I like it. It certainly can do the job, but do we need all its guarantees to write a compiler? Ruled out.

Then we have OCaml. I'm not a professional OCaml programmer, but I know enough to know that OCaml has great tooling for writing interpreters and compilers, similar to Haskell. It's fast, ergonomic, good development tooling, and after hearing so many good things about it, it seemed a good excuse to give it a try. It checks all boxes, so why not?! I'm enjoying it!

OCaml, I choose you!

Ocaml I choose you, Pokemon reference

Getting started

We'll set up the project structure and proceed to parsing integer literals.

I'm using opam as the package manager and dune as the build system.

The utop REPL is a must have for development. Highly recommend!

I'm using VSCode as my editor of choice. The ocamlformat and ocaml-lsp-server are packages that cannot be missed.

Let's install them:

$ opam install dune utop ocamlformat ocaml-lsp-server menhir
Enter fullscreen mode Exit fullscreen mode

I also added Menhir (the parser generator) to the list. We'll talk more about it later.

With dune available, we can initialize the project. This command will create the project scaffold in the current directory:

$ dune init proj tsonnet .
Enter fullscreen mode Exit fullscreen mode

It generates some files and three directories: bin for binaries, lib for the libraries, and test for automated tests respectively.

I will not go over each file generated here. They are not relevant at this moment. I will cover updates or additions as we go.

Here is the dune-project:

diff --git a/dune-project b/dune-project
new file mode 100644
index 0000000..a11fce7
--- /dev/null
+++ b/dune-project
@@ -0,0 +1,32 @@
+(lang dune 3.16)
+
+(name tsonnet)
+
+(generate_opam_files true)
+
+(source
+ (gitlab bitmaybewise/tsonnet))
+
+(authors "@bitmaybewise")
+
+(maintainers "@bitmaybewise")
+
+(license LICENSE)
+
+(documentation https://gitlab.com/bitmaybewise/tsonnet)
+
+(package
+ (name tsonnet)
+ (synopsis "A typed Jsonnet")
+ (description "Like Jsonnet, but gradually typed")
+ (depends
+  (ocaml
+   (>= 5.1.1))
+  (dune
+   (>= 3.16.0))
+  (menhir
+   (= 20240715)))
+ (tags
+  (jsonnet interpreter compiler)))
+
+; See the complete stanza docs at https://dune.readthedocs.io/en/stable/reference/dune-project/index.html
Enter fullscreen mode Exit fullscreen mode

Yes, the dune configuration language is a Lisp dialect. Not super relevant, as we'll be editing it sporadically to update the project metadata and its dependencies.

I made sure to specify all the relevant dependencies versions.

The main file used to generate the binary has only a Hello, World! for now:

diff --git a/bin/main.ml b/bin/main.ml
new file mode 100644
index 0000000..7bf6048
--- /dev/null
+++ b/bin/main.ml
@@ -0,0 +1 @@
+let () = print_endline "Hello, World!"
Enter fullscreen mode Exit fullscreen mode

Let's make something interesting now.

 Parsing int literals

Here's the simplest Jsonnet sample file containing just an int:

diff --git a/samples/literals/int.jsonnet b/samples/literals/int.jsonnet
new file mode 100644
index 0000000..d81cc07
--- /dev/null
+++ b/samples/literals/int.jsonnet
@@ -0,0 +1 @@
+42
Enter fullscreen mode Exit fullscreen mode

We are going to use ocamllex to parse the file into tokens. Our lexer will live in the library directory. The library will also be home for the parser, implemented by Menhir.

Let's configure dune for that:

diff --git a/lib/dune b/lib/dune
index fef8386..9660313 100644
--- a/lib/dune
+++ b/lib/dune
@@ -1,2 +1,7 @@
 (library
  (name tsonnet))
+
+(menhir
+ (modules parser))
+
+(ocamllex lexer)
Enter fullscreen mode Exit fullscreen mode

This is the specific dune file for the lib folder. It is defining that the ocamllex file will have the name lexer and the menhir parser will have the name parser -- so obvious, right?! XD

And the dune-project file, to include the Menhir syntax version we want to use:

diff --git a/dune-project b/dune-project
index a11fce7..cf8c03f 100644
--- a/dune-project
+++ b/dune-project
@@ -29,4 +29,6 @@
  (tags
   (jsonnet interpreter compiler)))

+(using menhir 3.0)
+
 ; See the complete stanza docs at https://dune.readthedocs.io/en/stable/reference/dune-projec
t/index.html
Enter fullscreen mode Exit fullscreen mode

Disclaimer: covering what's a lexer and parser in more details is out of scope of this post. I might write more about it in the future, however. Suffice to say that ocamllex and menhir are generators of lexers and parsers respectively. There's no need to write them from scratch yet.

We need to parse the Jsonnet file into an abstract syntax tree (AST). This AST will be a tree of expressions.

For now, we'll only have integers as an expression -- let's do baby steps.

We need to define our AST, the lexer, and the parser module. Each one is interconnected and dependent on each other, so it's difficult to expose one at time following a linear line of thought, but bear with me. The high level idea is, the lexer receives the input content as a string or a buffer, identifies the relevant chain of characters (that we call tokens) such as values, identifiers, reserved words, etc. and ignores the irrelevant ones not meaningful for the program, such as white spaces and line breaks. The tokens are them passed to the parser, that will go over the tokens and parse them into expressions, represented as an AST. The interpreter is the glue that binds the lexer, the parser, and any other relevant steps after that, such as type checking, and the final representation -- in our case, JSON output.

The AST expr type:

diff --git a/lib/ast.ml b/lib/ast.ml
new file mode 100644
index 0000000..fa75c4b
--- /dev/null
+++ b/lib/ast.ml
@@ -0,0 +1,2 @@
+type expr =
+  | Int of int
Enter fullscreen mode Exit fullscreen mode

As we want to represent different type of expressions, I'm defining expr here as a sum type. We will add more in the upcoming changes.

The lexer:

diff --git a/lib/lexer.mll b/lib/lexer.mll
new file mode 100644
index 0000000..41f1494
--- /dev/null
+++ b/lib/lexer.mll
@@ -0,0 +1,16 @@
+{
+  open Lexing
+  open Parser
+}
+
+let white = [' ' '\t']+
+let newline = '\r' | '\n' | "\r\n"
+let digit = ['0'-'9']
+let int = '-'? digit+
+
+rule read =
+  parse
+  | white { read lexbuf }
+  | newline { new_line lexbuf; read lexbuf }
+  | int { INT (int_of_string (Lexing.lexeme lexbuf)) }
+  | eof { EOF }
Enter fullscreen mode Exit fullscreen mode

First, a few considerations:

  1. Ocaml files have .ml extension. Ocamllex files are preprocessed, so it requires a .mll extension -- during the compilation process, the mll file will generate a ml file in the end, but it is transparent for the programmer when using the dune build.
  2. We are opening the module Parser, not defined yet, but we are going to get there. Hang tight!

We specify the tokens using a regex-like syntax. The rule is the interesting bit here. This is where the lexer rules will be cobled togheter as a recursive function. It will match the patterns one by one, skipping white spaces and new lines first. When a token pattern is found, such as int, it will wrap it for the parser to consume in the next step, using the parser's defined types.

The parser:

diff --git a/lib/parser.mly b/lib/parser.mly
new file mode 100644
index 0000000..d931e03
--- /dev/null
+++ b/lib/parser.mly
@@ -0,0 +1,14 @@
+%token <int> INT
+%token EOF
+
+%start <Ast.expr> prog
+
+%%
+
+prog:
+  | e = expr; EOF { e }
+  ;
+
+expr:
+  | i = INT { Int i }
+  ;
Enter fullscreen mode Exit fullscreen mode

Similar to the lexer, the parser file will have a different extension: .mly. Menhir will preprocess it during compilation in this case.

Here the parser first defines a token, optionally followed by the equivalent OCaml type <int> and the parser type INT.

The EOF token does not need to be represented as a type.

The %start will establish that the prog parser definition will result in an Ast.expr -- our AST type definition from before. The prog definition is represented as an expression, followed by an EOF. The expr definition will parse an i = INT into an Int i (the only type of Ast.expr we have currently).

Whatever other value will throw a parsing error.

Equipped with this, we can now define our Tsonnet library that will instrument the lexer and the parser:

diff --git a/lib/tsonnet.ml b/lib/tsonnet.ml
new file mode 100644
index 0000000..c3b7bf1
--- /dev/null
+++ b/lib/tsonnet.ml
@@ -0,0 +1,10 @@
+open Ast
+
+(** [parse s] parses [s] into an AST. *)
+let parse (s: string) : expr =
+  let lexbuf = Lexing.from_string s in
+  let ast = Parser.prog Lexer.read lexbuf in
+  ast
+
+let print ast = match ast with
+  | Int i -> Printf.printf "Int %d\n" i
Enter fullscreen mode Exit fullscreen mode

The parse function receives a string as input, lex it in a buffer, pass it down to the parser, that returns our AST.

The print function will come in handy for printing the result.

And finally, the program binary:

diff --git a/bin/main.ml b/bin/main.ml
index 7bf6048..80e8b0f 100644
--- a/bin/main.ml
+++ b/bin/main.ml
@@ -1 +1,15 @@
-let () = print_endline "Hello, World!"
+let usage_msg = "tsonnet <file1> [<file2>] ..."
+let input_files = ref []
+let anonymous_fun filename = input_files := filename :: !input_files
+let spec_list = []
+
+let run_parser filename =
+  let input_channel = open_in filename in
+  let content = really_input_string input_channel (in_channel_length input_channel) in
+  close_in input_channel;
+  let ast = Tsonnet.parse content in
+  Tsonnet.print ast
+
+let () =
+  Arg.parse spec_list anonymous_fun usage_msg;
+  List.iter run_parser !input_files
Enter fullscreen mode Exit fullscreen mode

It will parse the CLI arguments--only the file name inputs, for now--and will pass to the library to parse and print them.

Voilà!

$ dune exec -- tsonnet samples/literals/int.jsonnet
42
Enter fullscreen mode Exit fullscreen mode

 Concluding

This post is already big enough due the introduction of some crucial concepts and details from packages used in the implementation. With the foundational part covered, we can proceed to do more complex stuff.

In the next issue, we'll cover all other JSON's literals.

Please leave in the comments if you’d like me cover in more details certain topics.


Thanks for reading Bit Maybe Wise! Subscribe to receive new posts about the Tsonnet series.

Top comments (0)