Using LLMs at Oxide
77 points by steveklabnik
77 points by steveklabnik
This seems like a very reasonable take.
Additional factors that affect my personal decision of whether to use this tech for the listed use cases:
Oxide employees all make $235k. Call it about $20k a month. So a 1% productivity increase is worth on the order of $200 per month. Personally I suspect the boost is more like 5%, at least in the short term, and if that’s true, it’s a bargain. On a per-task basis, when I do a task in an hour with little effort that would have otherwise taken me a day or two of high effort, and it costs $20 of LLM tokens, that feels like an incredible deal.
I am definitely worried about the community atrophy you mention, which would be a cost that only shows up over a longer period. I am not really worried about inefficiency (efficiency has improved continually for years and isn’t slowing down) or dependency on the internet — I was already highly dependent on internet connectivity before LLMs.
but going from $0 to $200 per month for an unnecessary expenditure is a big deal, particularly when you consider whether we want that flow of money to take place on a grand scale.
I think that can be explained by the cost for most people is anywhere between free and 100 USD. I’m sure most don’t pay 200 USD. Also if you pay 200 you probably don’t consider it unnecessary.
What i have found is that maintaining 2 small subscriptions for 20$ per months each kinda give me just the right amount of tokens to get things done at $day_job.
It also comes with the added benefit of being able to switch to a different competitive model based on different labs release cycles.
You definitely dont need to go up to 200 per months unless you start to rely on LLMs to automate a lot of things. In that case, i think pay-as-you-go model might work out better.
If one takes the ideas in https://www.benkuhn.net/10x/ to their logical conclusion, then a working programmer with sufficient autonomy can easily justify paying ten times the cost if the more intensive use of LLMs allows them to focus on the "highest-order bits" of a problem.
For the past several months I’ve been working on a skunkworks project with one other part-time (very) senior developer. It’s gone from crazy idea to prototype to an important part of the 2026 roadmap for a medium-size company. We’re seriously considering whether or not we even need more than the 1.5 of us to get it to MVP — maybe 2.5.
I’m used to being part of early product exploration like this, but the level of productivity this time is an entirely new experience. $200/month is wildly underpriced for what we’re doing. The ability to deeply and rapidly iterate on the thing is incredibly valuable.
The market for that mode is pretty darn small, so I still have grave doubts about the bigger picture of LLM assistants as a product category, but personally I’m in a sweet spot for it right now.
This article is interesting in that it identifies that what matters is first of all what the organization as a whole is doing. It doesn’t really identify whether the individuals matter that much, or if anyone adequate in those contexts would do just as well.
I mean, meso-level productivity (which is the realm in which firms live and are modelled in economics) is going to be a function of micro-level productivity and institutional/org/mechanism design.
Your assertion about "what matters is first of all" is slightly unclear: are you saying that "the highest order bit" is org (dys)functionality?
"whether the individuals matter that much," There are some attempts at collecting individual data in this direction (not LLM-specific): http://www.knosof.co.uk/ESEUR/.
"anyone adequate in those contexts" I think this heavily hinges on your definition of adequate. Adequate for what task? Some correlate more than others with the bundle of traits we have come to expect in folklore with the "10x programmer" or "100x programmer", or the more popular (at least on Lobste.rs) "thought leader in backend/SRE/PLDI/FM that is affiliated with the Recurse Centre", and some do not.
I pay 25 USD per month to Kagi and have to quite a few LLMS via their Assistant interface. Maybe the “real cost” is higher but if they don’t push tha to thaw user, how will the user know about it? I have all the privacy concerns you mention that’s why I test local LLMs every now and then to get a grip of tasks that I can offload to local models.
there’s a lot here that i find pretty agreeable but this is a bad idea, even with the qualifier:
More concretely: while LLMs can be a useful tool to assist in the evaluating of candidate materials per [rfd3], their use should be restricted to be as a tool, not as a substitute for human eyes (and brain!).
…and while i agree that this is true in theory, it has been astonishingly false in practice:
Engaging an LLM late in the creative process [..] allows for LLMs to provide helpful feedback on structure, phrasing, etc. — all without danger of losing one’s own voice.
there are ways that you can incorporate LLMs into the process of reviewing and refining your prose, but the danger of losing one’s own voice is a constant factor.
if you approach them with this understanding then they can be used like any other tool that presents a hazard to the user, but that’s not the framing chosen here.
This is missing the two elephants in the room: ethics and environment.
LLMs do not respect the licences of the codebases they have consumed and digested. No attribution for *BSD style licences, and no respect for *GPL sourced content.
On the environment front, the oceans of fresh water required for cooling is going to directly contribute to severe water shortage issues in states like Arizona and Virginia, the noise pollution is horrific for people who have no choice but to stay in their devalued datacentre-adjacent houses, and the use of gas, coal, and other fossil fuel sources for the prodigious energy consumption is fundamentally obscene.
Overall, general LLM usage is a literal existential threat for humanity.
This article attempts to portray a reasoned case for LLM usage by avoiding the actually important issues.
I’d like to see Oxide do better.
The energy usage and CO2 footprint issues are very real. The water issue is not nearly as credible, and I think is arguably a distraction from the more important energy issues.
ok, i've finally sat down to read that whole article, as well as his older one that was linked within
here is GP, specifically the claim you are presumably responding to
the oceans of fresh water required for cooling is going to directly contribute to severe water shortage issues in states like Arizona and Virginia
here is how the linked EA rationalist guy addresses this specific claim (these are quotes from that first article, bolded headers mine)
obviously i framed each header to satirize the assumptions underlying each point made by our favorite rationalist. i left out any points that did not relate to GP's claim, which again is
the oceans of fresh water required for cooling is going to directly contribute to severe water shortage issues in states like Arizona and Virginia
i'm not even an AI hater, but please, i just spent a lot of time reading an article (or two) hoping to learn a thing and every part that might possibly contradict GP's claim just. doesn't hold water. if it's not meant to intentionally mislead, then it's editorialized ai slop. i'm willing to be proved wrong, but otherwise please stop sharing this guy's links...
Hank Green just posted an independently sourced exploration of the same issue which came to effectively the same conclusion - that the water issue is far less concerning than the coverage of it would have you believe: https://youtu.be/H_c6MWk7PQc
I'll admit that I wasn't responding to the specific argument here about Virginia and Arizona: I saw a mention of water usage as an environmental concern and effectively thought "here we go again", since it's such a common miscommunication part of the story here - see also Hank Green's expressed frustration about how poorly explained it is.
the only reason i went to re-explore this argument (i.e. read that article) was because i watched hank green's video. i'm not sure if we watched the same video, as i remember his conclusion was very clearly not that the water issue was less concerning than reporting, it was that the nature of the problem makes it easy for people talking about the issue to mislead others (in either direction of the debate)
okay, i rewatched the video just now, yes hank is very adamant about his point about misleading people, he says it at the start and at the end. quote from the 15 minute mark:
My point here overall is that this is complex and that means to me two things.
One, it's really easy to mislead about it. It's easy to misrepresent by leaving out training and making the water use seem really small. It's easy to misrepresent by including water flowing through a power plant to make water use seem even bigger than it is.
But the second thing that I hope you walk away from this video with is that like I know next to nothing about this. Like I have a master's degree in environmental studies that's old by the way. This is preai. So, I have like some basis from which to start doing the research necessary that I did to write this video, but I still know basically nothing. Like, if an expert watches this video, they're going to see a ton of holes in it. They be like, "That thing's kind of wrong and like, oh, he forgot about this." Like, this is the reason why expertise is so important because the discourse is terrible at expertise.
But there are people who are actually involved in this conservation who are actually involved in trying to like mitigate this to decrease the water use and they are aware that AI is just one of many industrial uses of municipal water and that that is very different from water evaporating at a power plant that isn't coming from municipal water...
i included that last appeal to authority just to mention that he goes on to address something disjoint from "the impact of datacenters in water-stressed locations"
by the way, the argument hank personally makes towards this issue towards the end of his video is the same argument rationalist guy has regarding golf and water parks, it's that corn uses a bajillion times more water, we should be knocking corn instead. which obviously doesn't address the data centers at all. here's a line i deleted from my previous comment since it wasn't one of the points that addresses GP's claim:
i'm honestly here to understand this issue and looking for sources (pro or against) i can reference to my friends in good conscience. normally i wouldn't be replying if not for the fact that i follow your blog and look up to your (very good, very well-meaning) work. but we all have the duty to hold our idols to a high standard, and hiding behind others' longform, which don't actually dismiss the issue you're trying to dismiss, just ain't it chief
My frustration here is that I frequently see the water issue as the thing people find most upsetting - like "how could you use LLMs when they're destroying the water supply" - and I see that as a result of extremely poor explanations of water usage. Any article that companies bout "millions of gallons of water per day" without even attempting to help people understand what that number means is either accidentally or deliberately misleading IMO.
I like Andy Masley's piece because, for several months, it was pretty much the only well sourced piece of writing that pushed back against this narrative!
I'm very glad that Hank Green has produced a video on this now. "It's more complicated than people let on and other industrial has of water are far more severe" feels like the right message to me.
look, you aren't going to change the fact that 'the LLMs are draining the lakes' is going to evict more reaction than graphs of the CO₂e impact of data centers. it's basic marketing. it's why the labcoat people in early pepto bismol commercials claim that "it coats your stomach" and never say the word 'antacid'.
my point is that if you want to hand-wring about "extremely poor explanations of water usage" then there needs to exist airtight explanations of water usage. i am trying to find these explanations, and it doesn't help when the explanations you link are also 'extremely poor' when held to a basic level of inspection. it doesn't matter how much research and how well sourced it is if the resulting argument is full of holes. Andy's entire argument is predicated on the faulty assumption that data centers are an equivalent industry to all other water-guzzling industries (if not better) because of the massive (largely construction) tax revenue they bring in. Hank's video is fine but you can't simply cite it as a full shutdown / dismissal of the topic since again, he makes a completely different (but valid) point that does not address the water concern
edit: i personally also want to push back on water usage being a big deal, because i really do think energy and carbon is the more important piece, to act against that marketing bias. but if people are only sharing shoddy arguments regarding water usage then i don't really think i can do that anymore, not in good faith
to be extremely crystal clear, because of aforementioned marketing, people Care More about water than CO₂e whether you like it or not, and if it really isn't such a big deal, then there should exist a way to push back without appealing to greater water usage evils. using Hank's (very valid) point of "It's more complicated than people let on and other industrial has of water are far more severe" to address the preferential treatment of data centers in water-stressed locations, is completely out of place. it's like saying that food waste isn't a big deal, and that if you still care about food waste then you might as well complain about restaurants, not {policy, storage, logistics, etc}
LLMs do not respect the licences of the codebases they have consumed and digested. No attribution for *BSD style licences, and no respect for *GPL sourced content.
It’s very rare for LLMs to recall licensed content verbatim these days. I doubt that you would be able to get it to produce GPLed snippets that cross the threshold they would require annotating it with a license.
I tried repeatedly in the last year to make it spit out copyrighted content and failed.
IANAL but something doesn’t have to be verbatim for it to be plagiarised. Our threshold has often been to require clean-room implementations to assure no license violation. Even if you take the most optimistic view of what LLMs are doing, an LLM can’t be said to write a clean room implementation of something in its training data.
IANAL bit I don’t think this would hold in any court. Consider an algorithm is described in a paper. Someone implements it and releases the code under GPL. Later, someone else implements it without ever seeing the GPL code. There definitely would be similarities between the two and one might argue that the second implementation plaguiarised the first if they only look at the code. I think same holds for LLMs. It doesn’t have to be 100% verbatim but there has to be a considerable portion that is extremely close to the GPL source to have any basis for the claim. And despite “everyone knows they train on FLOSS code” it’s still fairly hard to get close reproductions of source code. And even then you’d have to prove that LLM was trained on that specific code and not on anything substantially similar but, say, in public domain or closed source licensed specifically for the training purposes, which is basically impossible.
This is just a money laundering argument. Mixing dirty money into a clean business doesn’t make the dirty money clean, if anything it makes the whole business dirty.
But the “business” here wasn’t even clean to begin with. GPL is not the only issue, as permissive licenses require attribution. If I were to read someone else’s implementation and make my own version incorporating their ideas, I would feel compelled to credit them, honestly even if their license didn’t require it.
This tech is all about resynthesizing existing work. And that’s a cool and useful tech! But if those working on it were concerned with being good citizens, they’d work on tools to better identify the sources behind generated text, instead of trying to pretend that’s not what’s happening, and hiding behind cheap tricks that prevent verbatim regurgitation
intarga and yourself are approaching this from two entirely different perspectives.
They are asking 'is it ethical?' to which the answer is obviously 'no, with caveats'.
You are asking 'can I get away with doing the unethical thing?' to which the answer is 'yes, probably, because IP law just isn't designed to handle this sudden technological mode shift'.
I can see that. However, preceding with IANAL puts their response in a legal plane, not ethical one. I wouldn’t even have responded if they just made an ethical argument.
TBH, I see this kind of mixed signals from FLOSS all the time. It put me at unease but I couldn’t quite put my finger on it. Thank you for giving me the right framing.
I don’t like when people put words in my mouth, it was a legal argument, whether or not I also added my own feeling of justice.
Money laundering was also difficult to punish under existing laws, partly because it was intentionally designed to be so. But judges recognised it as against the spirit of those existing laws, so broadened their scope and interpretation, and eventually lawmakers made new laws to specifically target money laundering, making it easier to punish.
I say IANAL because I’m not qualified to give legal advice, but telling people to take a shot on “this is illegal, but hard to prove in court” seems irresponsible in my opinion.
Personally I think the ultimate legality of this will come down to economic factors. If everyone’s pension fund was built on the cocaine trade, I don’t think it would be illegal, and that’s clearly what the companies that make these are going for by trying to shove LLMs and their output into everything. Lawmakers and judges might look differently on that if by the time they’re making decisions it has crashed the global economy so YMMV.
https://lobste.rs/s/d7wdhw/fsf_considers_large_language_models#c_twveil
I'm assuming the purpose of this link was the regurgitation of o3-mini for the shader: I'm not disputing that recalling is impossible, I know it can happen. But at the same time my personal experience in the last year has been that it's impossible for me, even when trying maliciously, to get an LLM to recall copyrighted content. This might be possible for some edge cases but it definitely is not the norm.
I encourage you to actively try it, it's not easy with a SOTA model.
I do not care if the sausage machine is really good at producing sausages that don't look like pigs. I care that the sausage machine ingested pigs to begin with. This is a perfect example of whitewashing, and you're just dismissing it because 'nobody will notice'.
The OP is concerned about the ethical question, not whether we can convince the lawyers to let us get away with ignoring the ethical question.
I think OP was concerned with the machine spitting out any pig meat, instead of just one that spits out meat from organically fed, free-roaming pigs. Maybe such a machine can be made, but I think practically the machine saw a lot of meat in the past and now actually generates artificially grown meat.
I'm not disputing that recalling is impossible, I know it can happen
Going back to the ethics elephant, this is where a significant amount of people (imho fairly) draw the line at "I won't use it".
I think, if you're saying to somebody who cares about the ethics discussion of LLMs(or generative AI): "try it out actively, it almost never happens!" then you're going to get interpreted as dismissive.
Maybe there are models out there that actively try to only use MIT (I think Zed has one? can't find the link tho), or try to be energy efficient. A list of those would probably be more actively helpful if your goal is to convince people that are ethically against LLMs.
Going back to the ethics elephant, this is where a significant amount of people (imho fairly) draw the line at "I won't use it".
Which is perfectly fine and acceptable.
Ad MIT: there is fundamentally not much of a difference between GPL and MIT etc. when it comes to LLM generated code. If it surfaces copyrighted code, you as a user have an issue. Which is why there is quite a bit of a motivation on the side of the foundation model providers to ensure that this does not happen because it would put the whole thing on shaky legal grounds for their enterprise customers.
Which is perfectly fine and acceptable.
Just to be safe, in case I expressed myself wrong, I didn't mean to imply you think otherwise! Just clarifying from my viewpoint.
I'm also unsure what "ad" stands for here in your MIT comment, but nonetheless it's good to know that the user would have an issue, I would make for a terrible lawyer!
I'm also unsure what "ad" stands for here in your MIT comment
It is meant as "in regards to".
FWIW, as another employee, I appreciate that our policy does not require me to make a choice between continued employment and discarding my personal ethical and moral framework -- at least with respect to my own work.
There's also the macroeconomic aspect. It is never mentioned. Am I crazy to think that this is something that should be taken into account?
ChatGPT and Claude are awesome for coding if you disregard important factors like ethics, environment, or even the atrophying of beginners. But they are also drawing in hundreds of billions in not just investment but also critical infrastructure development. What is gonna happen if these shops cannot make a profit at the scale that they promise right now?
I never felt guilty about contributing to the atrophying of beginners when I answered questions on StackOverflow or released open source libraries.
If a beginner is going to outsource everything to an LLM such that it damages their own learning process I'm afraid that's on them. I will very happily engage with and coach them to help them not do that, but I don't buy it as a blanket argument against the whole category of LLM technology.
What is gonna happen if these shops cannot make a profit at the scale that they promise right now?
They'll go bust, and their investors will lose their shirts. See also many of railway companies of the 1800s.
Those of us who use this technology will have to rent hardware (now available at rock bottom prices since the bubble will have burst) that can run the best of the surviving open weight models.
If a beginner is going to outsource everything to an LLM such that it damages their own learning process I'm afraid that's on them.
This is an incredibly callous and naive take on our collective responsibility when creating new things and foisting them on society. Would you say the same thing about addictive drugs and drug addicts? What about teaching kids to read, even when they're not suitably excited about it at a young age? Predatory lending to low income folks, or bail bonds?
The creation and adjustment of the environment around us, in which people learn and grow and live and work, is absolutely our shared responsibility to maintain. Not all new things are a net positive, and pushing it all back on "personal responsibility" is just gross.
I don't buy that comparing drug addiction to a beginner cheating themselves with an LLM is reasonable.
I argue that LLMs could be a huge net benefit to our shared environment, provided we can help people use them as effectively as possible.
It's rare that any new technology emerges that doesn't have both positive and negative effects. I like both trusting and actively encouraging people to use their agency to make use of that technology in ways that are net beneficial to both themselves and others.
I've coached a lot of beginner programmers (in the before-LLM-times though, I'd love to coach someone new now and see how it impacts their learning journey first hand). The main thing I learned from that experience is how brutally frustrating learning to program is. Forget a semicolon... lose an hour of time.
You might argue that the frustration is a requirement for learning. I think there can be a better balance than losing an hour over a semicolon.
A lot of people give up on learning to program thanks to that frustration - they never manage to climb that first miserable six month learning curve.
Anecdotally I've talked to a bunch of people over the past year who previously quit learning to program, and thanks to LLMs are now learning with great enthusiasm and producing working, useful software for themselves. I love that!
I don't buy that comparing drug addiction to a beginner cheating themselves with an LLM is reasonable.
I argue that LLMs could be a huge net benefit to our shared environment, provided we can help people use them as effectively as possible.
I wouldn't discard the comparison so quickly. Many people in the early to mid 20th century thought stimulant drugs were the productivity technology that was going to make everyone's lives better.
It's the second time in a few weeks I see the LLM = evil drug false equivalence, can we not?
Stimulants are useful, and help millions of people overcome debilitating issues like depression, narcolepsy, ADHD, binge-eating disorder, etc etc
At least we're above the previous thread and not saying LLMs are fentanyl (which, as I pointed out, also has extensive legitimate medical uses), but this level of discourse is honestly lazy and discredits both real ethical and environmental concerns about LLMs, and perpetuates weird stigmatization around psychoactive substances.
You're the one making that equivalence. I was meerly pointing out their role in a prior technology trend. Please don't ascribe to me and my comment the misgivings you have about other people's comments.
I'll give you the benefit of the doubt, but would like to point out the context from the original comment
[...] responsibility when creating new things and foisting them on society. Would you say the same thing about addictive drugs and drug addicts?
So you can claim your reply was a non sequitur, but the objection stands with respect to the global setting.
You might have conveniently jumped to "stimulants" at large (heavily stigmatized drug outside of prescription uses), but a lot of criticisms conveniently skip over cigarettes or caffeine, drugs (though rarely labelled as such) which have experienced religious and governmental bans throughout the world and are mostly regulated to protect children, and are probably the most apt comparison (if one is arguing honestly and not seeking to draw some kind of Huxleyan "soma" demonization).
Caffeine has also been pushed as a productivity drug, and is objectively incredibly addictive (although with low harm potential). Are we being "incredibly callous and naive" about the place caffeine occupies in Western society?
I understand that Simon and Joshua were having a wider ranging discussion. If I was replying to any of that wider discussion, I wouldn't have quoted such a small subset.
As I thought would be clear from my quotation, I was focused specifically on the need to weigh societal benefits versus costs that Simon was bringing into the conversation. The creators of Benzadrine, Pervitin, etc also described their drugs as great benefits to society while naive about the costs.
As I was speaking specifically about early use of these drugs prior to any stigmatization, any comment about modern day stigmatization of drugs is irrelevant. Please preach to someone else.
I don't buy that comparing drug addiction to a beginner cheating themselves with an LLM is reasonable.
Yes, that's pretty clear.
What about a child who doesn't want to learn to read at an appropriate age because it's more dull than playing outside? Is that "on them", or is it appropriate that we adjust the environment to help them through a problem that will obviously haunt them later?
No, it's on their parents. Or maybe their teachers, but I spend enough time watching frustrated teachers talk about this on TikTok that I'm leaning parents.
That said, I would hope that beginner programmers have more agency than a child learning to read.
Or maybe it is a systemic problem due to things like chronically underfunded school systems, parents not having ready access to parenting knowledge, lack of support system because of atomisation. Seems strange to me to blame a widespread lack of learning to individual choices.
I'd love to understand more about this. We gave virtually every human unlimited access to the sum total of human knowledge, why did that lead to less parenting knowledge?
Is it because we gave them access to the sum total of human misinformation and superstition at the same time?
Because access to information isn't knowledge?
People need time (and willpower!) to consume, process, understand the information put at their disposal.
Are we really asking why there are still bad parents even though LibGen provides every book on parenting for free?
There is equal parts misinformation as well. Why do people who do they their own research as they say end up believing that vaccines cause autism or something like that? If your next question is about why people just can't tell the which information they are being presented is correct then I don't have an answer to that.
On the aspect of parenting, I wouldn't say that there is less knowledge than before. It's not like there was a golden age of parenting and we are past that. The world just evolves too fast for empirical research to keep up. I am not a parent myself but I take care of my nephew extensively. Taking care of a child for a first time kinda uprooted all the assumptions I had about the experience even though they weren't dogmatic and deep rooted. Let's just say if I didn't have Bandit from Bluey to take cues from I would have been completely lost.
I think both of these things can be true at once. Ultimately I buy the argument that individuals have agency and can recognize if they're not actually learning because of LLM dependence. But, to @jclulow's point, I do think LLMs are so easy to misuse in this way, without thinking about it, that we would do well to introduce some guardrails to help people see this choice in the first place. Don't forget that many of us here are speaking with the bias of a pre-LLM world; it's easy to assume that younger people will see the technology in the same nuanced way, and not as a ubiquitous thing that's just always there.
It's not that hard for me to imagine someone being asked to explicitly consider whether they're learning to independently code or just relying on the LLM's capabilities, and them saying, "I guess I'm just relying on the LLM's capabilities, I just realized I'll be screwed in a couple years when I hit the LLM's limits, thanks for pointing that out."
The solution for the water usage is to only self-host models at home, eliminating the water waste of a datacenter entirely. That’s what I do and it works reasonably well. I wouldn’t hence generalize the water usage of DC driven LLMs to all LLMs.
It is an open question for me if running models at home as a net benefit in terms of energy and water usage.
If I run a prompt through a hosted model I'm sharing that energy and water burden with thousands or even millions of other people.
If I run it at home that power usage (and related water from electricity generation) is being consumed exclusively by me.
It is an open question for me if running models at home as a net benefit in terms of energy and water usage.
It makes little sense to me why a local model should be more energy efficient. I would have assumed that it's at least one order of magnitude less efficient than tokens produced by H100s or better in a data center, if not more.
Especially if anyone is buying new or more powerful hardware especially to run inference, on their low personal duty cycle of use.
It probably depends on what you mean by efficient, no? If you mean raw energy use, intuitively I'd expect it to be more efficient in a DC.
But maybe you live in a region with cheap renewable power, but the data center you'd be connecting to runs on coal, so CO2 impact is lower at home. Or, maybe both your municipal power and the data center run on dirty power, but by shifting your usage home, you're easing burden on a (in much of the US at least) aging power grid that cannot physically handle the expected demands of EV chargers, AI data centers, etc. - so you're making blackouts less likely by homelabbing.
Who knows.
In my case, it's getting on toward winter and my home is electrically heated. I might as well get some compute out of the power that's heating this room.
Note that even on electrical heating, heat pumps have coefficients of performance > 1 (easily 2-3x and up to 5x under ideal conditions, so 1Wh of electricity can provide 5Wh of heat) vs resistive heat. Using your machine as a space heater doesn't necessarily make sense.
If it's all cheap renewable electricity on the other hand, what's a few joules among friends?
Datacenters use more water than just for the energy production, the ACs in use or direct server cooling use up a surprising amount of water (but also less than most people think, the numbers I see thrown around by lay persons are IMO off by an order of magnitude, if not more). I don’t think there is an efficiency argument really, because if you run it at home, it’s running on your devices. Offloading to a DC means someone has to provision and manage an entire server to handle the workload (even if it’s shared), the resource load is higher as a result, even if effective energy per token might be lower for comparable models (most people use the big non-local models, which aren’t comparable)
It is unclear how local popularity of a model affects incentives for training (which would still happen in data centers), but indeed for the inference part the decentralisation does make cooling (although not the inference calculations) objectively more resource-efficient.
Training isn’t something that requires many datacenters to be active at a time or be on standby, even now training isn’t the big energy bill for AI models, from what I understand, so the cost problem would be much lower, and a smaller model capable of running locally wouldn’t require near the resources a big model needs. So local models “solve” the problem almost entirely. Plus we don’t make ourselves dependent on some big cloud providers running stuff for us.
That last thing is kind of the first consideration for me: local model means no sudden rug pull. Energy-wise I guess I am comparing local-XYZ vs hosted-XYZ, but yeah, using the actual big stuff needs more energy no matter the efficiency differential with my iGPU.
(I do use local models but not the hosted ones, although I guess sometimes I still use DeepL which is hosted LLM-ish but on the small side)
Always within the thread of reason there is true lunacy like:
"LLMs are superlative at reading comprehension".
Ok guys this is a document about risks. You just said "LLMs can read any amount of text instantly and understand it perfectly." That's just not even close to being true, so it's already corrupting trustworthiness and judgement as far as I can tell.
Why gloss over real risks in a risk analysis? There should be no need to save face or do promo work here.
That part stood out to me too because of this news story from a couple months ago:
Professional journalists from participating PSM [public service media] evaluated more than 3,000 responses from ChatGPT, Copilot, Gemini, and Perplexity against key criteria, including accuracy, sourcing, distinguishing opinion from fact, and providing context.
…
- 45% of all AI answers had at least one significant issue.
- 31% of responses showed serious sourcing problems – missing, misleading, or incorrect attributions.
- 20% contained major accuracy issues, including hallucinated details and outdated information
- …
https://www.bbc.com/mediacentre/2025/new-ebu-research-ai-assistants-news-content
I take these studies to show the opposite of what the headlines say: some models perform dramatically better than others, and they’ve all gotten better consistently over time. What it shows is not that no model can summarize a document but that you need to use the best models.
This study in particular doesn't even mention particular models, so it's really quite useless for the purpose of informing our expert practice — and note it is not designed for that. It is designed to give a picture of what the average user is seeing. Completely different thing.
To be fair, summarising a single given document is an easier task than synthesis from multiple documents of different opinion/reporting status, and search is an even more demanding task.
(LLMs make semantic errors insummarising / direct translation too, but the rates are better)
Ok what was the control group though? Human journalists also garble any and all specialist subjects too.
My takeaway from this is to be very careful with AI summaries. It is also consistent with my personal experience. I’ve seen both Claude and Google web search summaries literally make up stuff or base claims on questionable sources.
Agreed, it's usually plausible sound enough that if you didn't have a skeptics view(or enough subject matter expertise) you would think it's correct. However if you spend any time double checking the information, it's usually pretty easy to find out in a hurry that it was hallucinating.
There's some irony that in attacking the reading comprehension of LLMs, you surreptitiously replaced "superlative" with "perfectly".
I would not say LLMs read superlatively, I would say they read passably. But as I see every time I log on to Hacker News or Lobsters, sometimes humans don't even read passably.
That was purposeful, as I think the meaning in this case is equivalent. The dictionary definition of superlative says, "of the highest quality or degree," which you might say means "flawless," which you might say means "perfect"
That makes it worse.
While superlative can mean perfect, it definitely doesn't require that. Even the definition you provide doesn't suggest it. Would a wine of the highest quality be perfect? Of course not. Is Magnus Carlsen's chess "of the highest quality"? Debatable, I suppose, but it's not definitionally false because he's made a blunder. I link some real world examples below.
In the case of an obvious absurdity, unless there is clear textual evidence that a person means the absurd formulation, you have to note the ambiguity, or criticize the weaker form. You could also note the ambiguity.
Searching google for "superlative academic performance" leads to many examples. Just 3 below:
Those are puff news pieces. This is an engineering risk analysis. I do not feel I was incorrect to say that the wording of the risk analysis suggested that the author saw no risk there, a feeling supported by the fact that they did not go on to specify any risks associated with lack of reading comprehension.
specific content aside, I am pleasantly impressed by a company appealing to its values like this
my experience across many companies has been that "values" get printed on the side of a mug and then mostly forgotten.
Does the mechanical LLMs at Oxide document hold the juicy parts of how to actually use LLMs as a tool?
It seems that increasingly at $work and in daily life, LLMs and their usage is something that I simply can't just excuse myself out of; so practical tips on how to avoid the pitfalls the RFD mentions are appreciated.
Edit: here is the below as a gist: https://gist.github.com/david-crespo/5c5eaf36a2d20be8a3013ba3c7c265d9
I wrote that internal document. Most of it is about the boring mechanics of getting an API key and getting set up. The high-level point is: if you don't know which tool to use, just use Claude Code in the terminal. Then there is a bit of guidance for people who haven't used these tools before:
What to try first?Run Claude Code in a repo (whether you know it well or not) and ask a question about how something works. You'll see how it looks through the files to find the answer.
The next thing to try is a code change where you know exactly what you want but it's tedious to type. Describe it in detail and let Claude figure it out. If there is similar code that it should follow, tell it so. From there, you can build intuition about more complex changes that it might be good at.
And here is the practical advice section:
Use Sonnet 4.5 or Opus 4.5Sonnet 4.5 is the default and it is very solid. Opus 4.5 is pretty new as of Dec 2025 and it is clearly even better. They cut the price of Opus by 2/3 with 4.5 ($5/M input, $25/M output), so it is no longer absurdly priced compared to Sonnet ($3/M input, $15/M output). They claim the higher price may actually net out because it uses fewer tokens, perhaps because it is less likely to waste time on wrong directions. In practice, I'm not sure whether this is true. I have been spending more because it feels like Opus can do more.
Claude Code will sometimes automatically use the cheaper, faster Haiku 4.5 for subtasks like exploring a codebase. You can try setting it as the main model for the chat with
Prompt with as much detail as possible/model, but the speed/intelligence tradeoff isn't worth it. There is also Sonnet 4.5 with a 1M context window available, but long context weakens performance substantially, so you are almost certainly better off paring the context to make it fit in 200k.We've learned through decades of experience with search engines to be very careful about what we type into a prompt. This is the opposite of what you want with LLMs — they are capable of pulling nuance out of everything you say. So instead of figuring out the shortest prompt that will do the thing, ramble about the problem, tradeoffs, your hopes and fears, etc.
Track cost in real timeSpending too much is a good sign that Claude is spinning its wheels and you should think about how to prompt it better. By default, the TUI does not want to show you what you're spending in real time — you have to run
/costmanually to see it. Add this to~/.claude/settings.jsonfor a statusline at the bottom showing real-time session cost (ccusage):"statusLine": { "type": "command", "command": "npx ccusage@latest statusline" }(screenshot of ccusage)
Run
Don't argue, don't compact. Just start over.npx ccusagein the terminal to see daily/weekly/monthly usage tables.As conversation length grows, each message gets more expensive while Claude gets dumber. That's a bad trade! Use
/contextand/costor the statusline trick above to keep an eye on your context window. CC natively gives a percentage but it's sort of fake because it includes a large buffer of empty space to use for compacting.Run
/reset(or just quit and restart) to start over from scratch. Tell Claude to summarize the conversation so far to give you something to paste into the next chat if you want to save some of the context.
Finally there is a list of links I like:
Run /reset (or just quit and restart) to start over from scratch. Tell Claude to summarize the conversation so far to give you something to paste into the next chat if you want to save some of the context.
I'd been under the impression that /compact does exactly that: generates a summary, resets the conversation, and injects the summary into the new conversation. Is that not the case?
In my experience compacting keeps way more of the existing context than is necessary. The less full the context is, the better Claude performs, so it’s better to cut the context down to the bare minimum necessary to keep going. What I often do is start over and use !jj diff -r @- to tell it what it did so far. For a more complex change, if you have it keep a markdown plan document as it’s working, it can just read that and keep going. If CC is maintaining that doc as it works, you don’t have to do anything before hitting /new. And even if the plan isn’t kept up to date with task status, the combination of the plan and the diff is sufficient.
I don’t actually do the summary thing most of the time — I will update the doc to talk about the diff and the plan markdown too.
Anthropic affirm this approach in their great doc on prompting the gen 4 models:
Starting fresh vs compacting: When a context window is cleared, consider starting with a brand new context window rather than using compaction. Claude 4.5 models are extremely effective at discovering state from the local filesystem. In some cases, you may want to take advantage of this over compaction.
I’m going to add these details to the doc — they helped a colleague yesterday too.
I agree with Crespo. What /compact does is take all of your previous conversation history, plus any arguments you give it, and feeds it to a prompt that asks it to summarize things, and uses that as the new context. I find that it doesn’t do a particularly amazing job of this.
I’d rather either come up with the new context myself, or emulate /compact by prompting it myself (“I’m about to reset your context, can you give me a prompt to continue <thing>”) and then I can tweak it before I feed it back in after /clear.
Context is your most precious resource, controlling it carefully is important.
I find it hard to reconcile that you cite empathy as one of your values, and yet come to the conclusion that using LLMs is reasonable and acceptable.
It is possible for two reasonable people with the same set of information to come to different conclusions. It requires a lot of empathy to understand how.
I liked the distinction between 'as readers' and 'as writers'. Despite being pretty personally hostile to this brave new world, I'm reasonably happy to use LLMs in 'read mode' and this gives me a nice way to explain why.