How To Make a Fast Dynamic Language Interpreter
37 points by m0th
37 points by m0th
When I compare my latest interpreter to Zef, I see that language design decisions have an impact on performance. The first draft of my interpreter was much faster than the Zef first draft, and I'm considering why.
The one "sophisticated" choice I thought I was making was to use NaN boxing, like Zef does. Otherwise, it's a simple dynamic language with a tree walking interpreter.
Zef has both 32 bit ints and 64 floats as separate cases of tagged immediate data in their NaN box implementation. I don't see why. The range of a 32 bit int is entirely contained in a 64 bit float. And this design choice increases code complexity and decreases performance. I decided to just support one numeric type for simplicity (like Javascript). If a boxed value is a number, then the representation is just a valid 64 bit float. Other values (including pointers) are represented as NaNs and require tag checking. To add two numbers, for the fast path, I just directly sum the NaN boxes, and if the result is not NaN, then I got lucky and added two floats, so that's the result, already in boxed form. Otherwise, I enter the slow path and check tags to support array math, and also to check if I legitimately got NaN due to adding two numbers. So this makes arithmetic fast.
For local variable lookup, instead of hashing a string (the local variable name) and walking a chain of hash tables at run time, I allocate a stack frame object for each function call. The stack frame contains slots for local variables. Reading or writing a local variable at runtime is just indexing into the slot array. You can't get any faster than that for a simple interpreter. The code for assigning indexes to local variables was not difficult to implement in the first draft. I did use two passes: first parse to create a concrete syntax tree, then perform "semantic analysis" to generate an AST, which is then interpreted. Zef now speeds up local variable access using JIT-style inline caching, which is more complicated than what I would personally choose for a simple interpreter.
Zef allows any construct to be arbitrarily nested in any other construct, and so does my language. Thus I have to implement closures. In my VM, a closure has an array of slots for non-local, non-constant variable references. At runtime, non-local variables are referenced by indexing into this array (which is very fast), rather than the Zef technique of walking a chain of hash tables and speeding this up using inline caching and run-time compilation. The run-time cost of my closure representation is allocating a chunk of memory to represent the closure object, and populating the non-locals array by copying NaN boxes into this array. My closure representation also doesn't support any form of monkey-patching, which is okay because Curv is a mostly functional language with immutable data. However, I note that Zef also claims to be functional and not support monkey-patching.
I appreciate all the performance analysis and reading the rationale for the various performance improvements.
This was a great article. I appreciate the table form showing each step before getting into the details, as well as a check-in at the end of each step to show the overall progress and relation to other interpreters. On top of that, there's some great info in here! In particular, the NuN tagging reference led me to read https://arxiv.org/html/2411.16544v1 today, which was very interesting and relevant to some optimization I've been doing.
Fil's not only been doing a ton of great work in the compiler space, his articles, gists, and comments have been incredibly helpful for me.
I would also not pick Rust, since a garbage collected language requires a heap representation that has global mutable state and cyclic references
You can have cyclic references in Rust, it’s just more complicated (e.g. requires pinning).
Then there are high-level crates that implement exactly the kind of things you need for a garbage collected language runtime.
You can implement cyclic, garbage collected smart pointers in Rust that are concurrently collected without any pinning that have the same perf overhead as Arc, although you do need some way to trace the inner type's references. I don't think that's much of a caveat though, considering you'd need that same functionality for tracing GC.
Shameless plug, this is done in my Scheme implementation: the smart pointer code the collection code
I wouldn't use Rust either, for the VM of an interpreted language, even if it is theoretically possible to do so. Memory safety in a VM is guaranteed by the language's compiler, its runtime and virtual machine, not by the type system of the implementation language. And as you said, the code is less complicated if you use C++. The availability of multiple memory-safe runtimes for C++, such as the author's use of Fil-C++, is very handy for debugging memory safety issues in the unavoidably unsafe code that you must write for a VM.
Im missing the JIT section, but I enjoyed the benchmarks and in depth optimisation showcase. I will definitely steal / port the benchmark test suite to my own language and attempt to optimise around it. Great article