Wasm is not quite a stack machine
50 points by LesleyLai
50 points by LesleyLai
If i'd guess, the underlying reasoning behind not making it a full stack machine is that Wasm was always intended to be JIT compiled to register machines, so a "swap" operation would likely have complicated things.
But it's just a guess.
Overall, very nice article and a good overview over stack machines.
If I recall correctly from digging through design issues on github when wasm was first released, this is pretty much correct. The debate was "stack machine vs SSA", and the conclusion was "use a stack machine because the resulting code is more compact and it needs to be sent over the network" -- you can losslessly turn the stack code back into SSA at the other end which will be cheaper than shoving more bytes through the wire. Thus I consider wasm's stack-based nature to be basically a serialization format for an SSA bytecode that will then be fed into a JIT.
The debate was "stack machine vs SSA", and the conclusion was "use a stack machine because the resulting code is more compact and it needs to be sent over the network"
Weird, given that Dalvik showed that a register architecture can be as dense as a stack machine.
Thanks for inspiring me, 'cause it looks like I misremebered a little. I was recalling this: https://github.com/WebAssembly/design/issues/755#issuecomment-238745151 but they were discussing the tradeoff of a tree-structured bytecode vs a stack machine and saying both were just serializations of SSA. I actually can't find much discussion at all about stack vs register machine, which seems pretty odd now...
It was not a very well interrogated decision. I think they had some Java veterans advising. I really wish they had taken the traditional route of multiple competing designs.
It's unfortunate, given the size of the WASM ecosystem.
This article compares WASM to stack-based languages, but not to other stack-based virtual machines. There's some conceptual overlap between the two, but when talking about stack-based virtual machines, my understanding is that "dup" and "swap" aren't really common, and "load" and "store" instructions are used instead to freely move stuff between registers and the stack.
Quoting from Crafting Interpreters:
Take this little Lox script:
var a = 1; var b = 2; var c = a + b;In our stack-based VM, the last statement will get compiled to something like:
load <a> // Read local variable a and push onto stack. load <b> // Read local variable b and push onto stack. add // Pop two values, add, push result. store <c> // Pop value and store in local variable c.In a register-based instruction set, instructions can read from and store directly into local variables. The bytecode for the last statement above looks like:
add <a> <b> <c> // Read values from a and b, add, store in c.
edit: To add a real-world example, in CPython (stack-based):
a = 1
b = 2
c = a + b
disassembles to:
6 10 LOAD_FAST 1 (a)
12 LOAD_FAST 2 (b)
14 BINARY_OP 0 (+)
18 STORE_FAST 3 (c)
I was mostly comparing Wasm to JVM, since that's what I'm most familiar with.
JVM has load/store instructions, but rarely introduces them in calculations, preferring dup/swap and alike instead. Both JVM and your CPython example demonstrate that accesses to locals that were present in source code are compiled to local accesses, which is true and necessary for performance. However, they don't do so for compiler-generated logic. For example:
>>> def f():
... return [x for x in [1, 2, 3]]
...
>>> dis.dis(f)
1 RESUME 0
2 LOAD_CONST 1 ((1, 2, 3))
GET_ITER
LOAD_FAST_AND_CLEAR 0 (x)
SWAP 2
L1: BUILD_LIST 0
SWAP 2
L2: FOR_ITER 4 (to L3)
STORE_FAST_LOAD_FAST 0 (x, x)
LIST_APPEND 2
JUMP_BACKWARD 6 (to L2)
L3: END_FOR
POP_ITER
L4: SWAP 2
STORE_FAST 0 (x)
RETURN_VALUE
-- L5: SWAP 2
POP_TOP
2 SWAP 2
STORE_FAST 0 (x)
RERAISE 0
ExceptionTable:
L1 to L4 -> L5 [2]
Note the liberal application of SWAP. And here's an example for COPY, i.e. dup:
>>> def f(arr, idx):
... arr[idx] += 1
...
>>> dis.dis(f)
1 RESUME 0
2 LOAD_FAST_BORROW_LOAD_FAST_BORROW 1 (arr, idx)
COPY 2
COPY 2
BINARY_OP 26 ([])
LOAD_SMALL_INT 1
BINARY_OP 13 (+=)
SWAP 3
SWAP 2
STORE_SUBSCR
LOAD_CONST 1 (None)
RETURN_VALUE
In contrast, Wasm would require locals here, even with an optimizing compiler that could theoretically emit better code for a true stack VM.
Interesting! To be honest I was mostly going by Crafting Interpreters on this, whose interpreter doesn't even have dup and swap instructions. I guess that also makes it a not-quite-stack VM. TIL, thanks.
You're talking about stack-based (as opposed to register-based) VMs and talking about how load and store are used to move stuff between stack and registers.
Based on my (limited) fact-finding mission, JVM is a stack VM and doesn't really have registers (except the program counter) or virtual register file. Each frame (function) has an operand stack and a local variable array (fixed sized). Load and store are used to move stuff between the local variable array and the stack. Instructions like add pop values from the stack, compute and put the result back in the stack, not registers like in a register-based VM.
Here’s an amusing example of what ~tcard is describing: The transputer has a (shallow) stack for evaluating expressions, and a “workspace pointer” that refers to an array of local variables in memory. The transputer instruction set is basically a stack machine bytecode; it can access the first 16 locals with only a single byte instruction. In the t9000 these local variables were implemented as a register file.
For stack manipulation the transputer has swap, dup and pop; dup was added relatively late in the t800 and pop didn’t turn up until the t9000 (which never shipped). see also.
IIUC we're describing the same thing. When I talk about "registers" in the context of a stack-based VM, I'm referring to what you call "local variable array". (Depending on the language, there will be ways to load/store from/to global and closure variables too.)
Put another way, the key difference between stack-based and register-based isn't the presence of registers, but whether there is an intermediate step of moving data between registers and a stack before/after operating on it.
I went through and read the four "WebAssembly Troubles" posts starting from the link at the bottom of this post and, checking on Lobsters, quite appreciated this comment by @kylewlacy on the Microwasm post regarding the state of Microwasm in 2025.
That Microwasm post (Jan 2019) gives a good bit of credence to the idea that transitioning Wasm to use something like a standard control flow graph would be blocked by the V8 team at Chrome:
the main answer is that V8 (Chrome’s JavaScript and WebAssembly engine) cannot easily support arbitrary CFGs - which are one of the two most important components of this format. In order to consume this format, V8 would have to either change their internal representation of control flow or implement the Relooper/Stackifier algorithm in the engine itself, and V8’s engineers have made it very clear that they have no interest in doing so.
However, this led to me remembering the Land Ahoy announcement on the V8 blog (March 2025) that they were transitioning away from their Sea of Nodes representation towards a more traditional CFG-based internal representation. So maybe there could be hope for such a proposal now?
WASM Troubles Part 1 is somewhat out of date because WASM blocks and functions do now support receiving and returning multiple values, so value lifetimes can be communicated more clearly by an optimising compiler. WASM does still support local variables, though.
We know lossless compression doesn’t exist, though, so what expression power is lost by making indices implicit?
This is probably being too pedantic but lossless compression definitely does exist? Not really sure what is meant here.
That was a brain fart, sorry. I meant always-shrinking lossless compression: there's no way to losslessly compress arbitrary data that always produces an output smaller than the input, and at first glance it seems like stack-based machines can drop indices "for free".
I was also a bit confused at that part. It's true that [u8] → [u8] always-shrinking lossless compression isn't possible, but that shouldn't make us conclude a priori that something like SSA → stack-based always-shrinking lossless compression isn't possible, since we're changing representations and know more about the structure of the domain and codomain.
(Good article, thanks for writing it!)
You're right, perhaps it's not the best analogy, but I couldn't think of anything better. Hopefully it makes some intuitive sense in context at least.
(Good article, thanks for writing it!)
Glad you enjoyed it!
Fun to see a shoutout to the minecraft mod Hex Casting.
Well, she was forced to learn it by ~liferooter who put her in a hole and made her learn how to use spells to get out of said hole