Don't Let AI Write For You
21 points by kerollmops
21 points by kerollmops
It undermines my credibility ...
This.
I encouraged an intern to write a report of what he had done for others in the group to see the other day. Interns are fragile, so I slacked:
"Nice email. I have to ask.... AI?"
I want to be open minded. It's interesting how from the creator's point of view, the way we compare ourselves with others, it's so wonderful to have a buddy that helps you measure up to the "norm" in communications. As a consumer of said creations though, it has the opposite effect.
It's interesting to me as I talk with others about their use of these tool to elevate their "social output filter", they always minimize how much they used. They might admit, even embrace it, but they're always seeking to reassure me that it was "only for xyz". I've taken to disclaiming all the ways I used AI when I author a MR/PR. Even then, I find myself subconsciously trying to curate my avowed uses so that it seems like I used it a little less than I did.
It's an interesting tension. We use the tools to hopefully elevate our value to others. In admitting our use, we realize we're devaluing ourselves (though not necessarily the creation).
LLMs are useful for research and checking your work.
Are they? Are they really? I thought it was common for people to be double checking evidence and research that comes out of an LLM since it has been known to reference nonexistent studies for example.
I disagree. I see nothing wrong with a mathematician using a calculator. Likewise, I see nothing wrong with an author or researcher using LLMs. Using a calculator doesn't mean you're not thinking deeply about the math.
This is such a tired argument to me as it is simply a false equivalence. There is a fundamental difference between the output of a calculator and that of an LLM. Calculator output is deterministic while LLM output very much is probabilistic.
A calculator applies mathematical rules to arrive at an answer. An LLM works by predicting the next most likely token based on its training data (yes, modern models are much more complex. That doesn't matter for the point here). An LLM does not know facts nor does it apply rigid logic. It is one of the reasons why they are prone to hallucinate and confidently will output false information. A calculator on the other hand will never invent a new number just because it statistically fits the pattern.
The fact that LLMs can and will hallucinate already means that the role of the user is drastically different. Certainly bottom up when someone is newly learning about the subject. A student learning math can rely on the calculator being correct and focus on other principles. A student learning to program can't rely on a LLM in the same way.
Put differently, if you want to make a comparison, a LLM is more like a enthusiastic but careless research assistant who also struggles with short term memory loss.
A student learning math can rely on the calculator being correct and focus on other principles. A student learning to program can't rely on a LLM in the same way
I'm also reminded of a parable/koan I read once - "I know a boy who never became a mathematician because he relied on the answers at the back of his textbooks" "oh, were the answers wrong?" "even worse - they were right"
The analogy works because both an LLM and a calculator are just tools. They help you reach a result, but they do not own the outcome, you do. Whether the output is deterministic or probabilistic is irrelevant. A calculator can still lead you astray if you set up the wrong equation, and AI can be perfectly accurate or completely wrong. Either way, the responsibility to verify the result is still on you. TL;DR; blame the user not the tool; always check your math (pun intended).
"Tool" isn't just "thing you use to achieve an outcome", although that is one definition. It's much more useful to think of "toolness" as a spectrum, ranging from a rock or a chisel - where quality and efficacy are proportional to the user's skill and effort respectively - all the way through to services, where you pay someone else to achieve the outcome for you (and would be foolhardy not to verify their work). Hopefully you can agree that a calculator and an LLM occupy separate places in the toolness spectrum, the difference being that you shouldn't have to verify a calculator's output because it is an appliance developed and sold with the promise that it will behave as expected in accordance with the inputs it receives; the only thing you would be verifying is your inputs.
appliance developed and sold with the promise that it will behave as expected
I agree that toolness exists on a spectrum.
You're trippin' if you truly believe that an "appliance developed and sold with the promise that it will behave as expected" doesn't need to be verified. I think the QA department in this hypothetical company that makes these appliances would disagree with you. (lol)
Companies that make tools that operate deterministically still have QA departments. (was the snark implied in the previous statement if nobody caught it)
Plenty of tools marketed as "reliable appliances" still require verification in practice.
We don't stop verifying just because something is deterministic; we verify based on stakes.
It's a mistake to assume that "shouldn't have to verify" is ever true in serious work. It isn't.
It's not "simple deterministic tools don't require verification", it's "low stakes tools don't require verification". Everything else absolutely requires verification and scrutiny.
Oh dear. My apologies that I didn't specify the calculator in question was functioning correctly and that the manufacturer has a QA department that did their job before it was sold to the user. I didn't realise this variable was part of the discussion as it hadn't been mentioned at all. (Or did someone move the goalposts?)
low stakes tools don't require verification
It's not about stakes. We have entire categories of high-stakes tools that must be trusted absolutely for whatever reason or are themselves the method of verification. You don't measure a part with a micrometer and then eyeball it just to make sure, you trust the measurement because you trust the tool works within spec. (Yes, metrology must be reverified every so often, etc)
Frankly, a tool that you have to verify every time because it's within spec for it to blurt out horseshit every so often is worthless, and antithetical to the purpose of having tools in the first place.
micrometer
Calling a tool "worthless" because it requires judgment misses how most real-world tools actually work. Compilers, static analyzers, search engines, and even theorems all produce outputs that still need to be interpreted and validated.
P.S. Don't let microwaves cook for you.
I would add that I think an important difference is that how raw/final the output is. The rawer it is, the more you have to actively think about it, and have a chance to notice problems with it.
Meanwhile people very often just copy-paste LLM output as is (which can be fine in certain cases, but the quality can be very misleading from text written by another human - we are better judges of the latter)
A calculator can still lead you astray if you set up the wrong equation
Exactly, where an LLM can lead you astray regardless. Which, is a key difference as far as I am concerned.
Either way, the responsibility to verify the result is still on you.
See the student analogy I put in there. Again, there is a difference. Even if they are all tools, that is such a broad category that is meaningless in itself.
where an LLM can lead you astray regardless
You're over-indexing on how tools work internally, rather than on how they're used.
Both kinds of tools can produce wrong outcomes in the hands of someone who doesn't understand what they're doing.
Your student example doesn't break the analogy; it proves it. Beginners shouldn't blindly trust any tool. That's exactly why learning matters. Speaking of learning, teachers aren't infallible either, but you wouldn't create a post saying "Don't let teachers instruct you".
We rely on non-deterministic systems every day: search engines, compilers, etc. None are infallible.
to what degree are most search engines or compilers non-deterministic? short of cosmic rays, a compiler that isn't idempotent seems pretty useless. and search engine results might change over time, but if the same user searches the same thing twice in one day and gets wildly different results, that seems like a usability fault.
but then, your commentary has enough non-sequiturs to make me suspect you disagree with the OOP to such an extent that you aren't even bothering to share any of your own original thoughts in this discussion, and are instead relying too heavily on a tool to do a job for which it isn't suited.
Let's reset.
The premise seems to be: delegating writing means you lose the cognitive work required for real understanding.
I disagree, with my own mind.
Here's the textual representation of the exact original thought that went through my mind when I saw this post:
You can think deeply about a concept and still use an LLM to help structure, refine, and/or communicate it. Those aren't mutually exclusive.
Writing is one way to process ideas, not the only way. Using tools to assist with expression doesn't mean the thinking didn't happen.
delegating writing means you lose the cognitive work required for real understanding.
I think we are operating on a different premise here. The article at the start very specifically mentions LLM generated documents filling in the questions asked directly. Skipping the thinking process entirely.
You can think deeply about a concept and still use an LLM to help structure, refine, and/or communicate it. Those aren't mutually exclusive.
If you are doing this you are already writing a lot yourself. In fact, I'd say it is necessary to be able to claim that an LLM is communicating your thoughts. It can only do so if you have done most of the heavy lifting. Which, at the very least, means a lot of back and forth to arrive at something that truly reflects your thoughts and not the LLMs having taken you on a ride to the average of their training data. Which, I agree can be a valid way of using them. Though they might still set you on a false path or direction depending on, for all we know, the position of the planets.
So on that basis I'd agree, but that is also not what I replied to originally. That was the tired one line trope comparing it to the use of a calculator. That is still a false equivalence. While they are both tools, with valid use cases, the way those tools are used is entirely different.
With a calculator you put in a calculation, you get an answer. That is the extend of the tool usage there. Which effectively what you are advocating against when using LLMs as a tool.
There are all sorts of analogies that can be made in regards to the use of LLMs that can be useful. I simply don't think that a one line argument about them being tools like calculators is one of them. They are both tools, but that is about the extend of it.
No, it doesn't. A calculator is a deterministic tool that takes an input and will always give the same output. LLM AI just randomly chooses the next token based on probability and I could ask the same question 5 times and get 5 wrong answers.
Mostly a nit-pick, but there is nothing inherent in LLMs that make them fundamentally random. You can just make them work from e.g. a fixed seed and you can get the same answer 5 times.
With modern models setting temperature to 0 may not be enough, but the fundamental architecture is deterministic (returning a probability distribution, of which we can decide what to choose, and we add randomness here because it gives better results).
I agree with the article that you generally shouldn't use LLMs for writing, but this is a really bad reason why.
Loads of tools are nondeterministic and will hallucinate utter nonsense if you don't use them correctly or validate their work. That doesn't make them not useful, it just means that they need to be used in their proper context.
I used to work for a company that sold a device that told you the thickness of a sheet of metal. You'd place the metal sheet between two sensors and it would measure the thickness in nanometers and report back. Except it didn't really do that, instead it generated a bunch of electric currents and then measured the resistance it encountered while generating these currents, as well as measuring the current in an unconnected circuit. Those results were always very noisy, so it did a whole bunch of averaging and then ran the results through a function that converted the raw data into a thickness. That function was itself generated via calibration - you'd put a handful of known samples between the sensors and tell it what values it should be measuring, and then let it interpolate the right formula out. Except there is no generalised formula in this case, the result of measuring the induced currents in different widths of metal is a complex pattern that can't easily be reduced to a single formula, so instead you assume that all the samples you've got will lie in a certain region of the pattern that is easy to interpolate, and then generate a formula based on that assumption. So what you get out is a very accurate thickness, assuming that the thing you're measuring is what you've calibrated for, and assuming that the calibration was done correctly, and assuming that the sensors are still working as expected, and assuming that the humidity or temperature of the room hasn't changed too much, and assuming that everything remains exactly the right distance apart from each other, etc, etc. And even then, it had its limits, and could only give you approximate numbers. You could measure the same sample twice in the space of five minutes and get slightly different readings.
This was a really useful piece of kit that companies would pay a lot of money for, but it was also a hallucinating random number generator that would tell you whatever you wanted to hear if you didn't understand how to use it correctly.
And, to be clear, I don't think it was a particularly unusual piece of equipment either in this regard. There are loads of similar tools out there. They're difficult to use properly, so they often end up in the hands of experts rather than the general public, but they're still very valuable tools, because if you do use them correctly, they can be very powerful.
Now this is still an analogy. Just because an LLM also hallucinates randomly doesn't mean it's necessarily as useful as the measuring devices I used to work on. But it also demonstrates that experts can and do rely on things that don't give deterministic results. Replace @iamalnewkirk's example with a systems engineer measuring the reliability of their production line, or a materials scientist analysing a sample of an alloy, and you're back in the world of the probabilistic.
I don't think you can make a good argument about the usefulness of LLMs in this "first principles" kind of way. Yes, LLMs guess the most likely tokens rather than doing any internal logical reasoning. But if the result looks like logical reasoning - if we can interpolate logical reasoning out of the LLM output, just like we interpolated thicknesses out of the currents and resistances of our measuring device - and if that logical reasoning is sound and correct enough in enough cases to be useful, then that can still be a useful tool.
The only tool I have ever used that spits out non stop utter nonsense without user error is LLM based AI. I can't think of another tool that does that where we wouldn't call it a bug and file a report and expect it fixed. Yet, nobody seems to care when its LLM AIs failing (worst up time of any company the size of the big players) or being entirely inaccurate.
My screwdriver doesn't do that. My calculator doesn't do that. Excel doesn't do that. My compiler doesn't do that. And if any of them did I would not just accept it.
So on the one hand, that's just not true. You've probably done COVID tests before, and I'm sure you've done other medical checkups that have been statistical tests that do not produce accurate results in every case for a variety of reasons. Generally anything that moves into the realm of statistics - which is a huge category of tools - runs these risks.
On the other hand, "entirely inaccurate" seems like a significant mischaracterisation of what the output of an LLM is actually like. The AI tools I use certainly have flaws, and I'm very sceptical of people who write significant amounts of code with them, but when used properly, I very rarely find them being inaccurate. Not never, but still seldom enough that they can make for a useful tool.
But if that's your experience and ground truth, then fair enough, there's not much point having a discussion about it because clearly we're just experiencing two different worlds.
Compiler and Excel cases are really stretching «without user error», and, in case of Excel, «utter nonsense».
I have seen Excel try to treat basically 555-01-99 as «555th January, 1999».
A C compiler optimised «technically UB but limited impact on all supported platforms» mistake of «read / check bounds / use the read value / write update» in Linux code by dropping the bounds check and making it a full exploitable code-execution vulnerability from mild DoS risk.
I am not saying people shouldn't use them. That's far from the argument I am making.
I wasn't saying you were! I think I just wasn't very clear on that point. I was rather trying to say that I don't think the distinction between a tool like a calculator and a tool like an LLM is as big as you're making out here. Lots of useful tools are stochastic or do strange interpolations to get their results, or have to be interpreted and used correctly. That LLMs fit in these categories is not really related to the question of whether it's a useful tool.
of whether it's a useful tool.
Ah, that isn't what I have a problem with though. I do think LLMs can be a useful tool. But, I also strongly believe they are a different kind of tool compared to calculators entirely. See also my comment here.
A one line dismissive analogy in that context simply isn't useful in a conversation to have. Simply because it completely sidesteps the whole conversation based on what I still consider false equivalence. Even more so as the debate and exploration about how LLMs can be best employed and to what capacity is still on going.
Would you hire a lawyer who used an LLM? (and there are plenty of other examples of this happening). What about a doctor who uses such technology? Good luck if you do ...
Honestly, yes. I assume the best doctors are using the best tools available, including AI systems that help analyze medical scans. Same question, inverted: would you trust a doctor who refuses to use AI-assisted imaging tools?
Did you even read the links? The doctor link was to a report of botched surgeries due to AI imaging tools. So yes, I would trust a doctor who refuses to use AI-assisted imaging tools.
What you consider a good example of writing authored by an LLM?
TL;DR; The quality of documentation in GitHub projects in the past year.
All the content I've consumed in the past year (at least) I assume was produced by an LLM in some way. In the past 6 months (at least) there have been a proliferation of new GitHub software project, all (as far as I've seen) with rich documentation, which seems indicative of LLM involvement because years (anecdotally) in open source has taught me that nobody likes writing documentation and this is common for project not to have any, but that has changed.
Can you link your favorite example?
Which documentation are you referring to? I couldn't find any for the project beyond the README, which while lengthy doesn't actually contain much useful information.
I am going to just guess you will ignore that all of that documentation comes from the code comments.
A calculator is deterministic and only gives you accurate answers. AI lies and flops around using terrible sources and constantly outputting text in a way no human would write that is easily spotted from a football field away. It's a probability machine and regularly chooses the wrong path. Your comparison is not particularly reasonable given that.
interestingly (for this site), this article seems to have a narrow focus on technical or prose writing, rather than writing code. i agree with the thrust of the article in this domain.
i have seen lobsters users expand the scope of this line of thinking to also being against letting an LLM write your code… and to be sure, that seems compelling as well.
however, the “average” lines in the sand seem to be drawn in very different places, which i suppose reflects the utility of the writing involved: non-code writing is (usually) the object in which you are interested, whereas code (usually) is used to produce some secondary object. exceptions apply to both, of course; at least in fiction, i’m thinking of the mysterious “sermon” from Annihilation/the Area X series, but you could maybe argue that overly perfunctory technical documents written purely to satisfy some bureaucratic process might also qualify. for code, plenty of people also write code-as-craft or -art (and notably these often seem to be the people most acutely anxious about the rapid changes in the domain of software… i guess i count myself among them, but im beginning to question the futility of remaining in this camp)
Where the line is drawn does differ per person. But I think it is good if people stop to think where they draw the line and why. For me it comes down to my ability to be able to judge the LLM output properly. I can only do that when I have an active and maintained knowledge and skill set to do so. The more I let AI generate code for me (or actually written text for that matter) the less I practice those skills myself. The less I practice those skills and the more I lean on the generative part the more likely it will start to go wrong somewhere down the line.
So in my mind there two very distinct ways of using LLMs. Or maybe more accurately there are two sides of a spectrum with a lot of area in between them:
In my personal, so very much anecdotal, experience the last method still requires people to be very much engaged. But, as I said before, I think a lot of people who start using LLMs sort of drift to the lazy approach very easily. Because it is very convenient and easy to slowly hand over more thinking to an LLM.
The latter I have seen happen all too often around me. An alarming increase of lazy non critical use of LLM tools by people who should know better. People who might have been a bit slower before but delivered excellent results now have started delivering trash. Code spanning dozens of line trying to solve something that should only take one line. Code that completely ignores and conventions or design paradigms put in place. Code that goes directly against security practices. Suddenly downgraded dependency versions (because the models training data doesn't include the latest version).
So to me, it very much makes sense to keep writing myself. Both text like comments here, documentation or mails as well as code. And again, that doesn't mean I will not use LLMs during that process. But I very much make an active effort to apply them as tools externally to the process and actively engage with them in ways that doesn't hand me answers over on a silver platter.
thank you for sharing your thoughts! i agree that i do see, at least from a high level view, two main “modes” of working with these tools: a person in control, making creative (or reductive! or destructive!) choices, utilizing a tool to enhance certain and well defined parts of their process; or the tail wagging the dog
Thanks for writing this! It's good to know I'm not alone in keeping LLMs as a tool external to the process, instead of going full on "agentic programming" (or whatever it gets called today).
The goal of writing is not to have written.
We-ell, pragmatically we need many things to be written.
It is to have increased your understanding
Anything that is relevant to my understanding is at best a hard-to-linearise fluffy tree. At worst the load-bearing cross-links admit no way of avoiding cycles. Any written text readable by others is extremely inefficient as an intermediate tool of thought! (Funny enough, I heard that fine-tuning LLMs towards efficient use of that CoT scratch tokens sometimes pushes them towards an unreadable mix of languages; I might be closer to empathising with LLMs than with the author on that point, I guess)
then the understanding of those around you.
I have experienced cases when LLM rewrite from my notes (requested by someone else, then given to me to check) was better at that than me. The context was not worth a bigger effort than checking and endorsing the rewrite, though.
The second order goal of writing is to become more capable.
It's only valuable to become more capable if this capability is worth it, though. As I liked to say about computing things manually, I still need to do maths well enough to debug my computational code, but this is not that much.
Now, the thing that is problematic with LLM writing, is the non-trivial and different-from-previous-kind effort to maintain vigilance when checking nothing went wrong. But among all the arguments the author doesn't care to fully articulate this one. Would I be justified in suspecting that the author avoids inauthentic over incorrect, and distrusting the author based on that?
I hope it's appropriate to share this meme. It just seemed so apropos! "How to get your AI to not hallucinate" https://x.com/davj/status/2038639975206470046