LLM Neuroanatomy: How I Topped the AI Leaderboard Without Changing a Single Weight
54 points by knl
54 points by knl
This is a fascinating read about how the models are structured. Even if you are tired of all the vibecoding articles lately, this one is correctly tagged as ai, because it gets much more into how the things work and what structural changes to a model ended up doing to it.
It bothers me that Transformer architecture spends equal amount of compute on a yes/no answer to an arbitrarily complex riddle as it does on "The " in any other message.
It's fascinating that layers can be looped. Perhaps the next step would be to have a model dynamically select number of loops or choose to skip groups of layers MoE style?
Inference time compute (Chain-of-Thought) allows models to dynamically use more compute on harder problems. There's an active area of research around whether making them write down their "thoughts" is actually useful (i.e. potentially recompressing a rich representation down into token space), Meta looked at having the last hidden layer being fed back into to first layers of the model with Coconut, allowing the model to "think" in latent space rather than with words.
CoT is orthogonal to this. It still doesn't remove inference bottlenecks (which is memory-bound by a ridiculous factor).
Using a smaller set of layers in a loop could make them cacheable, or skipping layers conditionally could save memory bandwidth on easy tokens: https://arxiv.org/html/2507.10524v1
Meta looked at having the last hidden layer being fed back into to first layers of the model with Coconut, allowing the model to "think" in latent space rather than with words.
Why hasn’t this approach become more widespread?
I suspect a few reasons:
My guess is that going through human-language tokens isn't lossy enough to be a problem, and the models can recover the fuzzy latent space meaning with a few layers (as demonstrated by conversations in base64), so the gain from using latent space directly is incremental.
We don't have any training data in that latent space, and it would be freaky if LLMs developed a "neuralish" language we don't understand.
I'd guess you could get even better results by including this layer duplication at training time. Let the model optimize what params/circuits it puts in this explicitly repeated section. Hard to imagine that wouldn't be even more effective.
Seems to somewhat correspond to the way that thought tokens are used in modern models. I remember reading somewhere that a lot of the benefit from those might just be that they give the model more trips through the transformer to process advanced concepts before generating its output while keeping param count constant. This is a pretty compelling idea in that it can make that extra compute power more explicit.
It seems like a simple idea in retrospect, but a lot of things in AI tend to be I guess. I wouldn't be surprise if this or something similar was already in use by some of the big labs, though.
I think first training with sparse looping, then progressively looping more and more of the chunk would improve trainability.
Also initially loop it once, then add more iterations as long as it proves beneficial. All while training.
But plausibly one could go the other way, too. Try to figure out exactly what circuits were involved and add them as hardened units.
I still think that in the upcoming decades we will see more and more of this. Even hardening proven nets as semi-analog circuits. Especially since we see quantization working out decently. Once we stop streaming weights and they just sit there as unclocked transistors, the inference speeds will be incredible.
I personally don't think this anatomy is described here and previous studies that try to get to truthiness or the structure of intermediate layers has resulted in a fragile signal. There are many unfalsifiable conjectures communicating an exuberance that is hard to tell what is inferred or actually measured.
I'm not particularly familiar with how transformer models are actually structured beyond what the article described but I cant help but wonder if it would be possible, by a similar method, to identify layers that all serve the same or similar purposes and replace all but one of them with pointers. It seems that it might be possible to reduce memory requirements pretty dramatically depending on the number of "redundant" layers. I know pruning layers is already relatively common. This feel similar, maybe without some of the potential performance sacrifices though.
It sounds similar to Mixture of Experts architecture, but with experts stacked serially instead of parallel.
Any gene that can be duplicated once, can be duplicated many times. What's the marginal improvement from one more duplication?
Depends on the selection pressure. Elephants have ~40 copies of the guardian of the genome, p53 vs the two homo sapiens have.