Which programming languages are most token-efficient?
13 points by diktomat
13 points by diktomat
This seems to assume the tokenizer is fixed in stone. Until the hypertrainers run out of money, it seems much more feasible to retrain than to shift language preferences (not to mention you'd lose all the training materials!).
I would expect that if this were a serious problem, now that we know how important the coding assistant use case is, somebody's going to retrain with a more efficient code tokenization.
After the bubble bursts and we have to work with just tuning whatever giant models we're left with in the fossil record, things might be different.
Uiua
Token efficiency is only one measure that AI effectiveness depends on. Another, possibly more relevant is how strict the syntax is. Notably, a more strict syntax will improve the output, because it squeezes the target space into something more easily worked with – the training data will look more alike than with a more forgiving syntax. Also locality of information is paramount: AI models have a hard time keeping stuff straight over hundreds or even thousands of lines of code.
Rust optimizes for both those measures, having a very strict syntax that nudges programmers towards keeping information local.
Which language is the most token efficient? In my experience, Chinese is more token efficient (~1.5x more?) than English (I don't know other languages).
I used to do F# like 10 years ago. I actually quite enjoyed it as a language, maybe I should revisit it with vibecoding.
I wonder if there's any benefit to having a multi-step process where (1) a more expensive model writes token-efficient pseudocode in comparably fewer tokens, which is then (2) fed to a cheaper model to write the actual code in many tokens. This is not unlike a common pattern of using a more capable but expensive model to create a plan that's then implemented by a less capable and less expensive model. But I'm imagining multiple layers where each layer is more concrete, more tokens, and cheaper.
Human brain > planner model > pseudocoder model > coder model
Interesting that Go is doing badly on this benchmark but LLMs have been said to do pretty well with Go codebases. IIRC @mitsuhiko said something like this too?
Could a custom tokenizer that’s adapted to Go make LLMs do better with Go? I guess the main cause of token bloat is the boilerplate idioms that are unnecessary in a language like Clojure or Haskell, and a Go-specific tokenizer couldn’t eliminate that.
I just don't think token efficiency matters that much compared to some other issues. Locality of behavior is the biggest one – if Claude can figure out what a piece of code is doing without having to look at other files, it seems way more likely to interpret it correctly.
Token efficiency hardly matters, at least for code generated. Token efficiently matters for tool turns and there go is great (because for instance compiler errors are brief, detailed and tests are cached).
Because the benchmarks focus on small contexts, not a huge codebase. Golang certainly is very verbose which is pretty bad for llms once you want it to understand a lot and not just a single file.