Opensource AI Must Win
30 points by Yogthos
30 points by Yogthos
Open source "AI" doesn't exist. Locally-runnable LLMs are opaque blobs provided through the largesse of megacorporations who spend millions of dollars to train each revision and exercise total control over what goes in. As an individual, non-independently-wealthy person you cannot inspect everything about how they're constructed, tinker with their original training set, and rebuild them from scratch at will. Running a necessarily pre-compiled gratis LLM on your personal computer is an act of cultivated dependence upon centralized infrastructure and an endorsement of technology which- in our present, real world- structurally cannot be controlled by individuals. You might as well write a manifesto about how cold fusion must win.
You are largely right, but there are actually models out there funded by public money whose weights, training data, methods, etc are released openly.
Granted, those models are not on the same level as frontier models for the most part if you just look at their technical capabilities (development and all that). Some of them don't aim to be, the EU has funded development of models specifically aimed at better representing languages within the EU aimed at things like translation. I also remember reading various news articles in the past year or so about various other initiatives like the EU one.
With that in mind, I feel like you read past the message on the website a bit.
Moving on from that, I also don't think that open source models (truly open source like the EU examples) by definition should be models that can be run locally. I also don't get the impression that the manifest on the website thinks that either, though it is a bit sparse on information. If the model is truly open, the risk of a company behind it going belly up is greatly reduced as the basis for further development of that model isn't locked in that company.
I guess what I am getting at is that OSS doesn't necessarily mean "cheap to do on your local machine".
There is spectrum of openness in locally-runnable LLMs, some really are opaque blobs with little information on training data and methods, for others training datasets and source code is published.
Two recent, fairly open LLMs are
which means, in theory anyone can reproduce a similar model, decide which data is used for training or modify the training recipe. Your point still stands, for a non-wealthy individual pre-training is still out of reach.
There are absolutely fully open source models. These are not frontier models, but they very much do exist. OLMo is one of the models explicitly mentioned as having passed the OSI's validation phase. Pythia was also validated by the OSI as meeting its requirements for an open-source AI system. Lucie-7B is a multilingual model is one of the first LLM compliant with the OSI AI definition. Its creators explicitly state that the training dataset, data preparation code, and model weights are all publicly available under open licenses.
The OSI AI definition is controversial because it doesn't require that all training data be made open source as well.
I personally think that's a pragmatic choice, because you can't scrape a big chunk of the web (basically every usable model out there does that) and slap as open source license on that data. If the OSI AI definition required that it would be a license which applied to nothing at all.
(The OSI FAQ sidesteps that issue by talking about things like medically confidential training data.)
A whole lot of people will not accept the OSI AI definition as "open source" as a result of that decision.
Even the Lucie-7B training data includes a sizable chunk of public data scraped from the web:
The Lucie Training Dataset is a curated collection of text data in English, French, German, Spanish and Italian culled from a variety of sources including: web data, video subtitles, academic papers, digital books, newspapers, and magazines
In particular, FineWebEdu is described as "15 trillion tokens of curated data from 96 Common Crawl dumps".
For practical purposes, if the weights and code are open and the model is reproducible, I think that's pretty reasonable definition for it being genuinely open.
you cannot inspect everything about how they're constructed, tinker with their original training set, and rebuild them from scratch at will.
what's true is that not all marketed "open" models are truly transparent and open about data, training protocol etc (best would be: reproducible). It's also true it would cost a fortune to train one from scratch.
This said, would you also call using the Linux kernel "cultivated dependence"? because the same caveats apply to it.
you could theoretically train a model and produce a ZKSNARK or similar proof that you trained it the way you said you did
the ZK part isn't really important. but those ZKS* proofs have a nice property of not getting larger as you're proving a longer thing. it'd be saying something like "starting with this data set, and updating with this procedure, there exists an ordering of parameter updates that produces this model"
(i don't see a world where this is feasible, unless someone found out a very cheap way to do it)
AI is a civilizational infrastructure for work, education, science, software, creativity, public services, and national capacity.
No, it isn‘t. The people controlling it would like it to be and they‘re desperately trying to make it so but it isn‘t.
Or you could, you know, not outsource your thinking to a resource-hungry copyright-infringing hallucination machine 😘
EDIT: yes, not concentrating this power in the hands of a few powerful corporations solves one of the big problems with the current crop of AI, but it does nothing to solve the other problems.
Or you could, you know, not outsource your thinking to a resource-hungry copyright-infringing hallucination machine 😘
"Resource-hungry" is an accurate claim for training frontier models, or for letting hundreds of millions of people use the frontier models. But if someone made this claim about local AI, then they would be either getting the math wrong, or they would be promoting a fairly extreme environmental position. (Which would be a philosophically consistent position, to be clear!)
The smallest genuinely useful local coding agent is Qwen3.6 27B. This runs acceptably in occasional bursts of 280-300W on power-limited NVIDIA cards. To put this in perspective, a day of coding with Qwen3.6 27B is going to use less electricity than a couple of hours of playing Subnautica 2 on a desktop gaming machine. This is partly because you can't outsource as much thinking to smaller models. And so there will be more time when the AI is doing nothing and the human is thinking.
Training costs are larger! But if you only wanted to train a few 27B-sized models per year, that would be easily lost in background of industrial civilization. I worked through the math at one point—using the best numbers I could get for small models and public training runs for research 7-8B models, IIRC—and it came out to something like "add the equivalent of one more geothermally-powered aluminum foundry in Iceland, and we could easily train a few 27B-sized models per year." It isn't free, but it's a rounding error.
So the inference-side power usage is less than 3 incandescent lightbulbs, and then only when the model is actually generating. The training-side power usage is equivalent to a single major industrial facility, but that could be done almost entirely renewably. (Iceland's neat like that.)
"Copyright infringing" does not seem to be true under current US case law, and I would be reluctant to further increase the powers of copyright. I have been fighting copyright expansion since the 90s. So while I have no love lost for the way LLMs get trained, this seems to have moved out of the realm of copyright law (except when Anthropic just pirated books) and into the real of politics and legislation.
But the "outsource your thinking" thing, yeah, that one is getting messy fast out there. Lots of people are trying to make themselves into meat puppets for their machine god, and it's scary.
"Copyright infringing" does not seem to be true under current US case law, and I would be reluctant to further increase the powers of copyright. I have been fighting copyright expansion since the 90s. So while I have no love lost for the way LLMs get trained, this seems to have moved out of the realm of copyright law (except when Anthropic just pirated books) and into the real of politics and legislation.
Thank you for writing this. (And for fighting copyright expansion!)
Copyright law is a human invention. God didn't hand it to us, nor is it a cultural tradition stretching back millenia. We just made it up one day! We made it up because we thought that it would promote the arts and sciences. And maybe it was a good idea at the time—maybe it still is.
But we should stay open-minded that laws that worked centuries ago (or even decades ago) might not be optimally tuned for the current era.
Copyright infringement frequently gets brought up as a reason why LLMs and AI companies are bad. But I think it's a distraction to rally around that issue.
It's like saying murder is bad because it's illegal. Yes, it's bad, but it's a false statement because that's not why it's bad. And laws can change. If we build momentum around the idea that AI is bad because of copyright reasons, and then Congress passes a law to legalize it… is everyone suddenly going to be a-okay with AI? Or maybe we had other objections more core to our actual concerns, ones that would have been more productive to rally around?
Alexandra Elbakyan is also a copyright infringer. Yet she is a saint, and there should be a statue of her at every university.
We will soon reach peak data, beyond this I mostly see advances in agentic coagulation of a SOTA LLM. The last open source model to be released will probably be used for years to come as a foundation for ever-changing open-source agentic (or other) superstructures.
Close. More like the public must become aware of the immense public funding these tech giants have recieved over the last two decades especially, and claim rights to the infrastructure their tax dollars paid for. These firms would not, and could not exist without massive infusions of public money, not to mention the training data is largely directly drawn from the public commons.
These models are not artifacts of private effort, they are the end result of a massive collective effort, and they must be legally recognized as a public commons.
There is no future in truly open and just machine learning and large language models without consideration of the full supply chain, I urge folks to read https://time.com/6247678/openai-chatgpt-kenya-workers/
Important or not, bubble or not, hallucinating token predictor or not, it still is super important for every country to have a legal framework to force every "Frontier lab" to open source (open weights, open training data, methods, etc) all but their latest models. Or to force them all to open source each model after 10-15 years. For the development of human knowledge and to prevent segregation of the "haves" and the"have-nots", every model should be open sourced after some time has passed. You might think that AGI isn't close, but the intent of these frontier labs is to be the first one to reach AGI and to keep it to themselves behind a paywall. That, probable or not, needs to be prevented for the greater human good.