Hercules Lemke Merscher

Posted on Jan 29 • Edited on Feb 2 • Originally published at open.substack.com

Tsonnet #1 - JSON literals

#tsonnet #compiler

In the previous post, I bootstrapped the Tsonnet project and implemented the parsing of integer literals.

Tsonnet #0 - From Zero to Interpreter

Hercules Lemke Merscher ・ Jan 27

#tsonnet #jsonnet #compiler

In this post, let's continue parsing other literal values.

Let's start with the most boring and dreaded value: null.

Parsing nulls

We need to add null to our expr type:

diff --git a/lib/ast.ml b/lib/ast.ml
index fa75c4b..c9b1836 100644
--- a/lib/ast.ml
+++ b/lib/ast.ml
@@ -1,2 +1,3 @@
 type expr =
   | Int of int
+  | Null

The lexer:

diff --git a/lib/lexer.mll b/lib/lexer.mll
index 41f1494..982a22d 100644
--- a/lib/lexer.mll
+++ b/lib/lexer.mll
@@ -7,10 +7,12 @@ let white = [' ' '\t']+
 let newline = '\r' | '\n' | "\r\n"
 let digit = ['0'-'9']
 let int = '-'? digit+
+let null = "null"

 rule read =
   parse
   | white { read lexbuf }
   | newline { new_line lexbuf; read lexbuf }
   | int { INT (int_of_string (Lexing.lexeme lexbuf)) }
+  | null { NULL }
   | eof { EOF }

And the parser:

diff --git a/lib/parser.mly b/lib/parser.mly
index d931e03..0440015 100644
--- a/lib/parser.mly
+++ b/lib/parser.mly
@@ -1,4 +1,5 @@
 %token <int> INT
+%token NULL
 %token EOF

 %start <Ast.expr> prog
@@ -11,4 +12,5 @@ prog:

 expr:
   | i = INT { Int i }
+  | NULL { Null }
   ;

The OCaml compiler now complains that the pattern matching on the print function is missing the Null type. We need to fix that:

diff --git a/lib/tsonnet.ml b/lib/tsonnet.ml
index c3b7bf1..6462a03 100644
--- a/lib/tsonnet.ml
+++ b/lib/tsonnet.ml
@@ -8,3 +8,4 @@ let parse (s: string) : expr =

 let print ast = match ast with
   | Int i -> Printf.printf "Int %d\n" i
+  | Null -> print_endline "null"

With null implementation complete, let's move on to handling floating-point numbers.

Parsing floats

Again, adding float to the expr type:

diff --git a/lib/ast.ml b/lib/ast.ml
index c9b1836..17abdd7 100644
--- a/lib/ast.ml
+++ b/lib/ast.ml
@@ -1,3 +1,4 @@
 type expr =
   | Int of int
+  | Float of float
   | Null

The lexer needs a little bit more of attention now:

diff --git a/lib/lexer.mll b/lib/lexer.mll
index 982a22d..b0c87ba 100644
--- a/lib/lexer.mll
+++ b/lib/lexer.mll
@@ -7,6 +7,9 @@ let white = [' ' '\t']+
 let newline = '\r' | '\n' | "\r\n"
 let digit = ['0'-'9']
 let int = '-'? digit+
+let frac = '.' digit*
+let exp = ['e' 'E']['-' '+']? digit+
+let float = digit* frac? exp?
 let null = "null"

 rule read =
@@ -14,5 +17,6 @@ rule read =
   | white { read lexbuf }
   | newline { new_line lexbuf; read lexbuf }
   | int { INT (int_of_string (Lexing.lexeme lexbuf)) }
+  | float { FLOAT (float_of_string (Lexing.lexeme lexbuf)) }
   | null { NULL }
   | eof { EOF }

We add few patterns (frac, exp) to make it easier to add the float token definition. This will allow us to write floats in multiple formats such as 4.2, .42, 4.e2, +4.e-10, etc.

In the rule function, the lexer function call Lexing.lexeme lexbuf is going to return the value for us as a string. We need to convert it to float using the float_of_string.

The parser doesn't need to do much after that:

diff --git a/lib/parser.mly b/lib/parser.mly
index 0440015..0be9c98 100644
--- a/lib/parser.mly
+++ b/lib/parser.mly
@@ -1,4 +1,5 @@
 %token <int> INT
+%token <float> FLOAT
 %token NULL
 %token EOF

@@ -12,5 +13,6 @@ prog:

 expr:
   | i = INT { Int i }
+  | f = FLOAT { Float f }
   | NULL { Null }
   ;

And the print function:

diff --git a/lib/tsonnet.ml b/lib/tsonnet.ml
index 6462a03..2d7207b 100644
--- a/lib/tsonnet.ml
+++ b/lib/tsonnet.ml
@@ -8,4 +8,5 @@ let parse (s: string) : expr =

 let print ast = match ast with
   | Int i -> Printf.printf "Int %d\n" i
-  | Null -> print_endline "null"
+  | Float f -> Printf.printf "Float %f\n" f
+  | Null -> print_endline "Null"

Now that we've handled numeric types, let's implement boolean values.

Parsing booleans

It's becoming repetitive and booleans are boring, you got the idea from the previous examples, so here's the entire diff:

diff --git a/lib/ast.ml b/lib/ast.ml
index 17abdd7..f92acfc 100644
--- a/lib/ast.ml
+++ b/lib/ast.ml
@@ -2,3 +2,4 @@ type expr =
   | Int of int
   | Float of float
   | Null
+  | Bool of bool
diff --git a/lib/lexer.mll b/lib/lexer.mll
index b0c87ba..a1e331e 100644
--- a/lib/lexer.mll
+++ b/lib/lexer.mll
@@ -11,6 +11,7 @@ let frac = '.' digit*
 let exp = ['e' 'E']['-' '+']? digit+
 let float = digit* frac? exp?
 let null = "null"
+let bool = "true" | "false"

 rule read =
   parse
@@ -19,4 +20,5 @@ rule read =
   | int { INT (int_of_string (Lexing.lexeme lexbuf)) }
   | float { FLOAT (float_of_string (Lexing.lexeme lexbuf)) }
   | null { NULL }
+  | bool { BOOL (bool_of_string (Lexing.lexeme lexbuf)) }
   | eof { EOF }
diff --git a/lib/parser.mly b/lib/parser.mly
index 0be9c98..0ea03a0 100644
--- a/lib/parser.mly
+++ b/lib/parser.mly
@@ -1,6 +1,7 @@
 %token <int> INT
 %token <float> FLOAT
 %token NULL
+%token <bool> BOOL
 %token EOF

 %start <Ast.expr> prog
@@ -15,4 +16,5 @@ expr:
   | i = INT { Int i }
   | f = FLOAT { Float f }
   | NULL { Null }
+  | b = BOOL { Bool b }
   ;
diff --git a/lib/tsonnet.ml b/lib/tsonnet.ml
index 2d7207b..c4984cc 100644
--- a/lib/tsonnet.ml
+++ b/lib/tsonnet.ml
@@ -6,7 +6,8 @@ let parse (s: string) : expr =
   let ast = Parser.prog Lexer.read lexbuf in
   ast

-let print ast = match ast with
+let print = function
   | Int i -> Printf.printf "Int %d\n" i
   | Float f -> Printf.printf "Float %f\n" f
   | Null -> print_endline "Null"
+  | Bool b -> Printf.printf "Bool %b\n" b

With primitive types (numbers and booleans) in place, it's time to tackle string parsing.

Parsing strings

diff --git a/lib/ast.ml b/lib/ast.ml
index f92acfc..9a9358c 100644
--- a/lib/ast.ml
+++ b/lib/ast.ml
@@ -3,3 +3,4 @@ type expr =
   | Float of float
   | Null
   | Bool of bool
+  | String of string

Strings need to be handled with extra care when tokenizing them:

diff --git a/lib/lexer.mll b/lib/lexer.mll
index a1e331e..b2eb1b4 100644
--- a/lib/lexer.mll
+++ b/lib/lexer.mll
@@ -1,6 +1,7 @@
 {
   open Lexing
   open Parser
+  exception SyntaxError of string
 }

 let white = [' ' '\t']+
@@ -21,4 +22,22 @@ rule read =
   | float { FLOAT (float_of_string (Lexing.lexeme lexbuf)) }
   | null { NULL }
   | bool { BOOL (bool_of_string (Lexing.lexeme lexbuf)) }
+  | '"' { read_string (Buffer.create 16) lexbuf }
+  | _ { raise (SyntaxError ("Unexpected char: " ^ Lexing.lexeme lexbuf)) }
   | eof { EOF }
+and read_string buf =
+  parse
+  | '"' { STRING (Buffer.contents buf) }
+  | '\\' '/'  { Buffer.add_char buf '/'; read_string buf lexbuf }
+  | '\\' '\\' { Buffer.add_char buf '\\'; read_string buf lexbuf }
+  | '\\' 'b'  { Buffer.add_char buf '\b'; read_string buf lexbuf }
+  | '\\' 'f'  { Buffer.add_char buf '\012'; read_string buf lexbuf }
+  | '\\' 'n'  { Buffer.add_char buf '\n'; read_string buf lexbuf }
+  | '\\' 'r'  { Buffer.add_char buf '\r'; read_string buf lexbuf }
+  | '\\' 't'  { Buffer.add_char buf '\t'; read_string buf lexbuf }
+  | [^ '"' '\\']+
+    { Buffer.add_string buf (Lexing.lexeme lexbuf);
+      read_string buf lexbuf
+    }
+  | _ { raise (SyntaxError ("Illegal string character: " ^ Lexing.lexeme lexbuf)) }
+  | eof { raise (SyntaxError ("String is not terminated")) }

Here we define a new exception SyntaxError to handle typos and unexpected syntax--we were missing it before. The error handling deserves careful attention. Nobody likes to use a programming language with cryptic error messages. But we are not there yet, so I'm postponing the friendlier error messages to a future iteration.

The rule read as soon as it matches a double-quote opening will call the supplementary rule read_string and pass a Buffer of size 16--the buffer size increases automatically if the string needs to allocate more than that.

The read_string rule will call itself recursively, handling escapes sequences, until it finds a closing double-quote, signalling the end of the string. In case of a illegal char or the lexer reaches the end of the file before the closing quote, an exception is raised.

Following to the parser and print function, the implementation is straightforward:

diff --git a/lib/parser.mly b/lib/parser.mly
index 0ea03a0..2e1adc4 100644
--- a/lib/parser.mly
+++ b/lib/parser.mly
@@ -2,6 +2,7 @@
 %token <float> FLOAT
 %token NULL
 %token <bool> BOOL
+%token <string> STRING
 %token EOF

 %start <Ast.expr> prog
@@ -17,4 +18,5 @@ expr:
   | f = FLOAT { Float f }
   | NULL { Null }
   | b = BOOL { Bool b }
+  | s = STRING { String s }
   ;
diff --git a/lib/tsonnet.ml b/lib/tsonnet.ml
index c4984cc..301f3a9 100644
--- a/lib/tsonnet.ml
+++ b/lib/tsonnet.ml
@@ -11,3 +11,4 @@ let print = function
   | Float f -> Printf.printf "Float %f\n" f
   | Null -> print_endline "Null"
   | Bool b -> Printf.printf "Bool %b\n" b
+  | String s -> Printf.printf "\"%s\"\n" s

Disclaimer: it's worth noting that ocamllex is not unicode friendly. I want Tsonnet to support unicode, but I'm postponing this complexity for later. Perhaps using the sedlex package for that.

Having implemented all the basic JSON types, we can now move on to composite types, starting with arrays. Arrays present our first challenge in handling collections of values, and they'll require us to make our expression type recursive to support nested structures.

Parsing arrays

Arrays deserve a bit more of attention. During parsing, each value will be an expression. Fortunately, OCaml's type system is expressive enough to allow recursive type definitions, so we can self reference the expr type and since we want Tsonnet arrays to be immutable, let's use a list of expr:

diff --git a/lib/ast.ml b/lib/ast.ml
index 9a9358c..734e487 100644
--- a/lib/ast.ml
+++ b/lib/ast.ml
@@ -4,3 +4,4 @@ type expr =
   | Null
   | Bool of bool
   | String of string
+  | Array of expr list

The lexer already has code to tokenize the expressions, but we are missing the square brackets and comma tokens:

diff --git a/lib/lexer.mll b/lib/lexer.mll
index b2eb1b4..9ff73b9 100644
--- a/lib/lexer.mll
+++ b/lib/lexer.mll
@@ -23,6 +23,9 @@ rule read =
   | null { NULL }
   | bool { BOOL (bool_of_string (Lexing.lexeme lexbuf)) }
   | '"' { read_string (Buffer.create 16) lexbuf }
+  | '[' { LEFT_SQR_BRACKET }
+  | ']' { RIGHT_SQR_BRACKET }
+  | ',' { COMMA }
   | _ { raise (SyntaxError ("Unexpected char: " ^ Lexing.lexeme lexbuf)) }
   | eof { EOF }
 and read_string buf =

Menhir has the helper function separated_list to facilitate parsing values separated by a token. It fits like a glove for arrays!

diff --git a/lib/parser.mly b/lib/parser.mly
index 2e1adc4..979f43d 100644
--- a/lib/parser.mly
+++ b/lib/parser.mly
@@ -3,6 +3,9 @@
 %token NULL
 %token <bool> BOOL
 %token <string> STRING
+%token LEFT_SQR_BRACKET
+%token RIGHT_SQR_BRACKET
+%token COMMA
 %token EOF

 %start <Ast.expr> prog
@@ -19,4 +22,8 @@ expr:
   | NULL { Null }
   | b = BOOL { Bool b }
   | s = STRING { String s }
+  | LEFT_SQR_BRACKET; values = list_fields; RIGHT_SQR_BRACKET { Array values }
   ;
+
+list_fields:
+  vl = separated_list(COMMA, expr) { vl };

The print function now needs to be converted to a recursive function to handle arrays:

diff --git a/lib/tsonnet.ml b/lib/tsonnet.ml
index 301f3a9..7eae3c8 100644
--- a/lib/tsonnet.ml
+++ b/lib/tsonnet.ml
@@ -6,9 +6,10 @@ let parse (s: string) : expr =
   let ast = Parser.prog Lexer.read lexbuf in
   ast

-let print = function
-  | Int i -> Printf.printf "Int %d\n" i
-  | Float f -> Printf.printf "Float %f\n" f
-  | Null -> print_endline "Null"
-  | Bool b -> Printf.printf "Bool %b\n" b
-  | String s -> Printf.printf "\"%s\"\n" s
+let rec print = function
+  | Int i -> Printf.sprintf "%d" i
+  | Float f -> Printf.sprintf "%f" f
+  | Null -> Printf.sprintf "null"
+  | Bool b -> Printf.sprintf "%b" b
+  | String s -> Printf.sprintf "\"%s\"" s
+  | Array values -> Printf.sprintf "[%s]" (String.concat ", " (List.map print values))

Building on our array implementation, we can now handle objects - the most complex JSON type. Objects are similar to arrays in that they're collections, but they introduce the additional complexity of key-value pairs and require careful consideration of lookup performance.

Parsing objects

An object is a collection of string keys containing expressions as values. I'm keeping it as a list containing tuples (key, expr) for now, for the sake of simplicity:

diff --git a/lib/ast.ml b/lib/ast.ml
index 734e487..4ceb13d 100644
--- a/lib/ast.ml
+++ b/lib/ast.ml
@@ -5,3 +5,4 @@ type expr =
   | Bool of bool
   | String of string
   | Array of expr list
+  | Object of (string * expr) list

Later on, it might be worth using a Hashtbl to improve performance lookups for attributes and methods. Right now, we don't have this need yet. I'm going to refactor the object definition when the time comes.

The lexer and parser need only to be updated to add tokens for the curly brackets and colon symbols and the print function is almost the same as for arrays:

diff --git a/lib/lexer.mll b/lib/lexer.mll
index 9ff73b9..a581fcd 100644
--- a/lib/lexer.mll
+++ b/lib/lexer.mll
@@ -25,7 +25,10 @@ rule read =
   | '"' { read_string (Buffer.create 16) lexbuf }
   | '[' { LEFT_SQR_BRACKET }
   | ']' { RIGHT_SQR_BRACKET }
+  | '{' { LEFT_CURLY_BRACKET }
+  | '}' { RIGHT_CURLY_BRACKET }
   | ',' { COMMA }
+  | ':' { COLON }
   | _ { raise (SyntaxError ("Unexpected char: " ^ Lexing.lexeme lexbuf)) }
   | eof { EOF }
 and read_string buf =
diff --git a/lib/parser.mly b/lib/parser.mly
index 979f43d..6872459 100644
--- a/lib/parser.mly
+++ b/lib/parser.mly
@@ -6,6 +6,9 @@
 %token LEFT_SQR_BRACKET
 %token RIGHT_SQR_BRACKET
 %token COMMA
+%token LEFT_CURLY_BRACKET
+%token RIGHT_CURLY_BRACKET
+%token COLON
 %token EOF

 %start <Ast.expr> prog
@@ -23,7 +26,14 @@ expr:
   | b = BOOL { Bool b }
   | s = STRING { String s }
   | LEFT_SQR_BRACKET; values = list_fields; RIGHT_SQR_BRACKET { Array values }
+  | LEFT_CURLY_BRACKET; attrs = obj_fields; RIGHT_CURLY_BRACKET { Object attrs }
   ;

 list_fields:
   vl = separated_list(COMMA, expr) { vl };
+
+obj_field:
+  k = STRING; COLON; v = expr { (k, v) };
+
+obj_fields:
+    obj = separated_list(COMMA, obj_field) { obj };
diff --git a/lib/tsonnet.ml b/lib/tsonnet.ml
index 7eae3c8..98ac680 100644
--- a/lib/tsonnet.ml
+++ b/lib/tsonnet.ml
@@ -13,3 +13,10 @@ let rec print = function
   | Bool b -> Printf.sprintf "%b" b
   | String s -> Printf.sprintf "\"%s\"" s
   | Array values -> Printf.sprintf "[%s]" (String.concat ", " (List.map print values))
+  | Object attrs ->
+    let print_key_val = function
+      | (k, v) -> Printf.sprintf "\"%s\": %s" k (print v)
+    in
+    Printf.sprintf "{%s}" (
+      String.concat ", " (List.map print_key_val attrs)
+    )

Formatting the JSON output with Printf is becoming annoying and doesn't scale properly, but we'll come back to it soon.

Concluding

$ dune exec -- tsonnet samples/literals/object.jsonnet
{
  "int_attr": 1,
  "float_attr": 4.2,
  "string_attr": "Hello, world!",
  "null_attr": null,
  "array_attr": [ 1, false, {} ],
  "obj_attr": { "a": true, "b": false, "c": { "d": [ 42 ] } }
}

So far, so good! We can now handle plain JSON.

I've been ignoring an important part, super relevant to guarantee the language spec in meeting its requirements: tests. We can't proceed without tests. I'm excited to share how I'm using cram tests to write automated tests.

Until then!

Thanks for reading Bit Maybe Wise! Subscribe to receive new posts about the Tsonnet series.

DEV Community

Tsonnet #1 - JSON literals

Tsonnet #0 - From Zero to Interpreter

Hercules Lemke Merscher ・ Jan 27

Parsing nulls

Parsing floats

Parsing booleans

Parsing strings

Parsing arrays

Parsing objects

Concluding

Top comments (0)

Read next

Criando um aplicativo de segurança para mulheres: precisamos de programadores para transformar essa ideia em realidade"

Simplify your environment variable management

Angular 17/19 | Signals con RxJS usando signalPipeRxjs

How To Use Google Fonts In React JS With Tailwind CSS