Rust should have stable tail calls
46 points by LesleyLai
46 points by LesleyLai
Cool to see this gain momentum! Some thoughts on the missing arch support
Whenever a function in a different object (e.g. a dynamic library) is called, the call sequence will swap out the caller's TOC for the callee's TOC, call the function, and then switch back to the caller's TOC. But, with a tail call, the callee never returns to the caller and restoring the TOC value never happens.
Might it be possible to, when calling one of these functions, generate something that looks like:
That is, we're preserving the most important aspects of a tail call (the fact it re-uses a stack frame) while maintaining correctness of the external call. This slight extra instruction cost would also only be paid when crossing the dynamic linking boundary, which should hopefully be rare.
Now, I know nothing about powerpc or the compiler backends for it, so this may well be unworkable in practice for a multitude of reasons :P
Here's an article that goes more in depth about the TOC in PowerPC and talks about tail calls too: https://maskray.me/blog/2023-02-26-linker-notes-on-power-isa It doesn't seem like very good design, like the ISA/ABI designers completely neglected tail calls? And the whole song and dance between compiler and linker required to maintain the TOC doesn't look fun to deal with..
The problem with your sketch is that it needs a stack frame for (at least) the saved TOC and link register, so a sequence of tail calls fails to run in constant space as required.
The article’s description of the TOC manipulation is a bit too brief for me to be sure how it works. I guess the caller has to set the callee’s TOC before the call because the caller knows where the callee is, and the callee doesn’t know where it is without its TOC. If the caller sets the TOC then I would expect it to restore the TOC too, so that the callee behaves the same for intra-objet and inter-object calls, and intra-object calls don’t do any TOC manipulation. If so then a tail call ought to be easy: set up the tail call target’s TOC and jump to it, and let the original caller restore its TOC when the tail call target returns. But it must be more complicated than that for it to be an obstacle :-) Perhaps the tail caller has to fish out the original caller’s TOC and LR from “somewhere” and put them where the tail caller’s TOC and LR would normally be, before jumping to the tail call target; I imagine “somewhere” might be an awkwardly movable feast.
Anyway, I would have thought that this PowerPC TOC awkwardness could be avoided if tail calls are only supported within a crate, which seems like a reasonable limitation to me.
The problem with your sketch is that it needs a stack frame for (at least) the saved TOC and link register, so it fails to run in constant space as required.
Ah, I was originally thinking about using some of the callee-saved registers for this purpose, but now I see that's just passing the buck, we'd have to reserve space to restore those ourselves too.
A way to become truly stackless would be to emit a small amount of per-function thread local storage to save these registers, but that also seems more complex than we'd want.
Anyway, I would have thought that this PowerPC TOC awkwardness could be avoided if tail calls are only supported within a crate, which seems like a reasonable limitation to me.
I agree; it's not a great limitation, but it'll be enough for 80% of usecases. The alternative is "cross-crate tail calls on PowerPC are strange/aren't tail-called at all" which is worse.
On this topic, don't other targets like RISC-V also have a global table in a register of some kind? How does this differ from the PPC TOC and how are they handled in tail calls on those platforms? Is the difference because the PPC TOC is for reaching far-away functions while the GOT on other platforms is mainly for data?
This talks about the performance implications, but tail calls can be required for correctness. e.g. this trivial function (because I can't be bothered remembering the better examples)
len_tail [] acc = acc
len_tail (h:t) acc = len_tail t (acc + 1)
len l = len_tail l 0
is only correct if tail recursion is guaranteed
What happens to this function without tail calls?
It risks a stack overflow if the list is too long.
If you're following a linked list that deep you've arguably got more problems than just stack overflows (all that pointer chasing can really hurt performance), but tail call optimization is good for lots of other use cases, like encoding state machines with mutually recursive functions. It makes function calling more expressive.
"This may exhaust some resource if not optimized" is not normally what I'd understand by "required for correctness". Any optimization that reduces memory footprint might be the difference between OOM and non-OOM, but I wouldn't call it required for correctness.
Consider a tailcall-based interpreter. If it uses proper tail calls it will work indefinitely. If tail calls don’t work properly the interpreter will run out of stack space in a few milliseconds and crash.
And is that incorrect? Very few language specs actually make forward progress guarantees, and the ones that do usually have a carve-out that permits trapping/aborting on resource exhaustion.
It is surprisingly hard to define in a language spec that you require tail call optimization without accidentally banning regular OOM or by having annoyingly low-level semantics (e.g. admitting that stack frames exist in memory, yuck). Scheme et al. just hand-wave it and say "okay you know what I mean though, please just do tail call optimization".
And is that incorrect? Very few language specs actually make forward progress guarantees, and the ones that do usually have a carve-out that permits trapping/aborting on resource exhaustion.
The point is that an unbounded number of tail calls must not cause resource exhaustion, so it doesn’t make sense to quibble by asking, what if it doesn’t work properly. In that case it’s incorrect! It’s the same kind of failure as an implementation that crashes if any while loop iterates more than a hundred thousand times (eg, by transforming loops into stackful recursion): it makes the compiler nearly useless for its intended purpose.
R6RS is actually quite precise on this topic: https://www.r6rs.org/final/html/r6rs/r6rs-Z-H-8.html#node_sec_5.11
(ETA: If you haven't seen it, the cited paper https://dl.acm.org/doi/10.1145/277650.277719 does an outstanding job of laying out the issues with precision and clarity.)
Not just scheme (as tonyg pointed out) and other functional languages, Lua also precisely defines tail calls: https://lua.org/manual/5.5/manual.html#3.4.10 So it's definitely a common thing for interpreted languages.
Defining tail calls is easy, defining the spec for optimization is the hard part. Lua goes the machine way by just explicitly saying there is a stack and stack frames. Systems languages like C, C++ and Rust have so far eschewed that and instead base their specifications on an abstract machine.
I'd agree with you for things like pointer compression, that improve memory footprint by some constant factor.
Tail call optimization is different because it's there to uphold a certain big-O guarantee. It's similar to how a language might promise that goto is O(1), or that unreachable objects will eventually be collected. If the language guarantees that tail calls are O(1), you rely on that to reason about the big-O of programs in that language.
Sure, but then I don't know what the example function is trying to show. If tail call optimization itself is the thing guaranteed by the language, then the optimization is required for correctness at any tail call.
it overflows the stack - obviously in practice something this trivial gets lowered to a while loop, but that's essentially a special case of tail call optimization.
So while many languages do treat tail calls as an optimization, in languages with "proper tail calls", it's a correctness feature - debug builds have to tail call as well. However tail call "optimization" is not anything like as complicated as actual perf optimizations. The problem for debug builds is it means you do lose the stack trace so in less trivial functions it can make debugging harder.
This is mentioned in the article.
A tail call is a function call that re-uses its caller's stack frame. That means that tail calls can be used for recursion without the risk of a stack overflow.
The compiler will attempt to turn normal calls in tail position into tail calls. But sometimes it is useful, or even required for correctness, to have the compiler guarantee that a particular call is compiled into a tail call (and error when a tail call is not possible).
or even required for correctness
I was trying to find this, but I think I assumed it was looking for a section so I think the start of the paragraph would have caused me to assume/read it as "here's how a compiler decides/what it tries" paragraph and skip the side point :-/