The Open/Closed Problem in AI
12 points by mempko
12 points by mempko
I don't see a problem with baking LLM inference into hardware.
Our capital overlords decided they're going to stuff LLMs into everything, even if it boils rivers dry, so any efficiency improvement is helpful.
The GPUs we have are ridiculously ill-suited for LLM inference. They have way too much compute compared to RAM bandwidth. Inference at large scale is a complicated slicing of models across GPUs with very fast interconnects, and even that only improves throughput, not latency and not efficiency.
I don't think model architecture changes will obsolete such ASICs before chip making process will. We haven't moved past the Transformer model since the original GPT. All the inference improvements can be boiled down to more matrix multiplications. The ASIC aren't literally a single model, just a dumb GPGPU without any pixels or triangles, just maximising matmul and memory.
The ASIC aren't literally a single model
Only partially true! TPUs and GPUs aren't, but Taalas chips are literally single-model ASICs (and crazy fast).
Also, just matmuls+transformers don't get you all the way to the end these days anymore. Modern LLMs, because of the computational costs of self-attention, only use it on a handful of layers, with others using one of sliding-window attention, recurrent blocks, or convolutional blocks. (SWA is just attention, so that's fine, but the others are quite different).
It is not 100% clear to me if Taalas allow at least fine-tuning (for the exact same structure); some fragments of their website say yes, some other parts of the site contain literally «Lorem ipsum», so I am not sure of anything.
I see your argument that the 'deciders' made their decision around LLMs. I tried to make the claim that decision is wrong. But you are right in that if they won't change their minds, might as well make it more efficient.
The AI part of GPUs are already just a bunch of matrix multiplication units, and e.g. Google have been using (and leasing to others) their TPU ASICs for ages now.
Our brains use a closed loop to learn. Our brains have a model of the outside world; they make predictions on what our senses should sense, and then check our senses to see how far off the prediction is. If the prediction is wrong, the brain is surprised and updates the model to make a more accurate prediction. In other words, there is no outside process for our brains to accumulate knowledge. It's done all inside our brain, a closed loop.
Wrong in a very fundamental way. The world is the outside process. Kant's division of phenomena and noumena forgets that we have immediate access to the objects of our contemplation and can make direct changes to them.
Interesting.
Could you make the same point about the stack register? Back in the day there used to be a lot of different programming paradigms, but now we all just use stacks. But stacks seem fine so it's not a problem?
C++ got co-routines which are a generalization of function calls. The stack is ok but languages today still have to work around it to do interesting things.