Saturday, 28 January 2012

Scheme as an External DSL in Clojure

This is a follow-up post to my previous "Scheme in Clojure" post.

This time we implement a Scheme interpreter as an external DSL. This means that we consider the DSL as completely foreign to the host language, so we need to write our own parser (or reader as it's called in Clojure) and interpreter. I have to admit that this is a bit of an academic exercise because the internal DSL version I wrote about previously is both smaller (less code) and faster (as fast as any other Clojure code). However, this can serve as an example of how to write parsers in Clojure and it also highlights how elegant and succinct such a parser/interpreter can be. And of course, it's pretty darn fun :-)

All source code is available on github.

Parsing

The reader is made up of two parts; a tokenizer and a parser.

The tokenizer reads a string and produces a list of tokens. I use a "tagged list" as described by Abelson / Sussman in SICP to distinguish between the types of tokens. This is a flexible technique for dynamic languages where we can't express token types like we can in a typed language like F#. Here is an example;
Next up is the parser which takes the list of tokens as input and produces a list of expressions (or combinations as they are called in SICP). Please note that the parse functions in this example first calls tokenize on the provided string.
One big benefit of parsing a "simple" language like a Lisp is how clean and simple the parser becomes. The whole thing is about 50 lines of code, and very elegantly expressed in Clojure (if you ask me :-).

Both the parser and interpreter relies heavily on Clojure's "destructing" feature to pick elements out of strings, lists etc. This is loosely related to pattern matching found in other languages (or in Clojure via core.match for instance). In my F# implementation of the Scheme interpreter, it's indeed this destructing feature of it's pattern matching I rely most on. Here is an example of extracting the head and tail (which happens to be the operator and it's arguments!) of a combination returned by the parser;

Eval-ing
Now that we have our list of expressions we can evaluate it (or run it). I use the mutually recursive eval/apply as described in SICP. Everytime I write this I am amazed by how simple yet powerful this is, here it is in all it's glory;
One trick I use is to store all built-in functions (as closures) in an environment map. So when we evaluate a combination like [:+ 1 1] the head of that vector (:+) is looked up in the environment and a fn is returned and punted over to apply. User defined functions are represented by lists in the environment and become a separate cond-case in the code above.

In the interpreter then environment is a stack of maps, with "roots" at the bottom containing all the built-in functions. When evaluating a let statement for instance, the locals are added to a new environment map at the top of the stack, in this way bindings can be overloaded in the local context. This is also how the arguments to functional calls are bound (and how recursion can terminate).

The bulk of the interpreter code is the implementations of the built-in functions, but then again none of them are especially large. All in all the interpreter clocks in at about 200 lines, giving us an entire solution (parser, interpreter and all) in about 300 lines!

Conclusion
Even though we wrote an entire tokenizer/parser/evaluator it ended up a small and readable. It's quite a bit smaller than the F# implementation, mainly because of the lack of any type declarations. How fast is it though? The embedded DSL implementation runs (fact 10) about 1.5 times faster than the external one. The F# version runs roughly as fast as the Clojure embedded one although is doing much more work (running interpreted).