AI Assistance Reduces Persistence and Hurts Independent Performance

56 points by krig

technomancy

The rapid rise of AI chatbots promises immediate and effective help with reasoning-intensive tasks such as studying, writing, coding, and brainstorming.
[...]
They were then presented with a series of 12 fraction problems with an AI assistant (GPT-5) available in a sidebar. The AI assistant was pre-prompted with each problem and its solution, allowing participants to receive immediate, accurate answers with minimal effort.

I like how it starts out with "we know this is super effective" and then the methodology shows that the "effective" help was ... they pre-seeded the answers into the AI because they didn't trust it to solve basic fraction problems.

Yet these findings need not be cause for pessimism. Rather, they point toward a clear design imperative: AI systems should optimize for long-term human capability and autonomy, a goal that cannot be achieved by surface-level interventions.

How naive do you have to be to believe that OpenAI or Anthropic would care about this? It's in their direct interest to ensure that users become reliant on their systems; why would they do anything to prevent that? Never mind that they don't even hint at how this could possibly be achieved with LLM technology even assuming the vendors were inclined to do so.

viraptor

I disagree with this take. First, because when doing an experiment, you want as consistent setting as you can achieve. They wanted the assistant to know "the answer" was "1/5", not 0.2 or one fifth, or here's a python program to calculate it, or ... There are reasons to ground some things in research to avoid randomness even if you expect the answer to be true either way.

Second, because what the LLMs do depends hugely on what you ask. You can ask for a solution, you can ask for verification, you can ask for a tutorial - and current benchmarks test most of those options. There's no reason for any of the big labs to nerf the model in that way - it would be noticed very quickly, so the incentive is not there. Also, the paper mentioned "AI systems", not LLMs there. Harness is part of the system and there are educators doing the right thing given they can't prevent students from using LLMs: https://gist.github.com/1cg/a6c6f2276a1fe5ee172282580a44a7ac

ironick

Imagine stretching out the cognitive difference from the study across a longer timespan. Current use of AI assistance in programming may not be representative of its long term use because the current cohort of engineers know how to code, and are more able to spot flaws in LLM generated softwares, and guide it towards their desired outcomes. The new wave of "engineers" who have never gone through the mental gymnastics will show the true power of LLMs as a coding tool.

At the same time, good engineers who use LLMs to offload all of their mental tasks will not be able to keep pace, since they're no longer using their coding muscles

Think of a great piano player who realizes that hitting play on a Spotify playlist sounds pretty close to playing the piano, does it for 10 years and then wants to play something specific but can't remember how

alexjurkiewicz

If you give people a mentor who tells them the answer, people will rely on the mentor.

This is absolutely true, and programmers will soon have to battle the same dynamic other industries like commercial pilots do. How to you keep your skills sharp when a computer is better at your job 99% of the time, but the remaining 1% is extremely difficult problem solving you need great skills for?

peter-leonov

Spot on. IIRC pilots do both: learn and train the modern aircraft failure modes, and practice flying simpler airplanes to keep the airflow instincts sharp.

Likely, we'll do something similar. Thus I always voted to keep the coding interview simple but real and AI free.
- kolja
  
  How would that work? 30 % of paid work time for code katas, like astronauts do their PE on the ISS? At that point, you might as well just forego the LLM use and use 100 % of the time for "classical" development.
  
  What I think will be more likely will be companies trying to "outsource" that by round-robin-hiring the last remaining self-thinking developers until those are used up.
  - alexjurkiewicz
    
    Developers will never get paid to train. Pilots do because the costs of failure are incredibly high. In IT, projects failing and effort going to waste is just another Tuesday.
- andrewrk
  
  These results suggest the need for AI model development to prioritize scaffolding long-term competence alongside immediate task completion.
  
  Oh really, is that what these results suggest? Because I can think of some alternative suggestions.
- refi64
  I'm not sure this test is actually testing anything...useful? Like if you did s/AI/calculator/, such that
  
  the opening directions said that I should try to use the calculator
  
  the final 3 questions did not have access to the calculator
  
  I might be inclined to also skip? This is a pretty short problem set that some people might not really care for, and by time the assistant is removed you're already mostly done. It's not clear to me that you can take much of note from this. (To be entirely frank, there have also just been better—and more concerning—studies done on the effect on thinking, in a much more intricate way.)
  - krig
    
    I agree that it's unclear how well this result generalizes and that the authors try to inflate the importance of the study. As always! Also, this study is still in review, which is worth noting. I think there is a conclusion to be made that the result supports, though.
    
    One argument that I have heard from AI proponents is that because they are spending less time and effort writing code, they can now spend that time and effort on architecture / planning / reviewing etc. I think this study shows quite clearly that mental effort is not quantitive in that sense, and that reduced effort in one area of solving a problem rather creates an expectation of low effort, and that in turn creates a tendency to try to avoid effort in other areas.
    
    Anecdotally, this is something that I see in others who adopt LLMs, to any degree. You would think that people would spend more time reviewing and testing code that they themselves did not write, but I am seeing the opposite. There is a tendency to accept larger changes or rewrites that would not have been accepted before, and a large number of generated tests replaces even a minimal amount of manual testing, so that I see PRs that completely do not work at all get sent on to review.
    
    refi64
    
    I think this study shows quite clearly that mental effort is not quantitive in that sense, and that reduced effort in one area of solving a problem rather creates an expectation of low effort, and that in turn creates a tendency to try to avoid effort in other areas.
    
    This does make sense, but I think they needed to extend the study a bit more to show this concretely, in particular tying in some other form of question entirely (to demonstrate the effort reduction extending outward) or implying some actual penalty to skipping (if you go "oh but I review the LLM output for correctness" and then don't, it's a bit different than if someone tells you "oh you don't have to bother reviewing it at all").
    
    Zurga
    
    The third experiment does not test this AI/calculator difference
  - mtset
    
    But calculators don't habitually confabulate and cover up mistakes. If faced with a task you can do with a calculator, you need to verify the algorithm one time - or trust the manufacturer - and you're good to go. In contrast, using LLMs requires constant, intense skepticism. I think they are qualitatively different in that way.
    
    refi64
    
    If faced with a task you can do with a calculator, you need to verify the algorithm one time - or trust the manufacturer - and you're good to go.
    
    Right but my whole issue with the study is that it's not actually testing this, because:
    
    the LLMs seemingly always gave correct answers
    
    the actual differentiator between the LLM and non-LLM portions of the test for LLM users is whether or not they...skipped
  - lollipopman
    
    They also tested reading comprehension and found the same affects, "Experiment 3: Convergent evidence from reading comprehension"
    
    refi64
    
    They did, but I'd argue it runs into the same issue of "skip" not being a useful indicator. (Although yes, in that case you certainly couldn't compare it to a calculator!)
  - k749gtnc9l3w
    
    Indeed, testing the unique effects of LLMs by offering them as assistance in tasks that even some exam-allowed calculators can do sounds at best like «second intervention condition (fraction-capable calculator) needed to calibrate the results».
- rameneater
  
  Am I getting this wrong or they could have just used a calculator instead of an LLM? What I am trying to say is that this feels hardly as a new phenomenon. It seems alike moving from physical dictionary to digital one when studying a new language, or from sifting trough books in a library instead of using a search engine and wikipedia. The process went from labor intensive to more immediate. Are we really that worst off after these technological innovations?
  - lollipopman
    
    as I mentioned in my other comment, they also tested reading comprehension and found the same affects, "Experiment 3: Convergent evidence from reading comprehension"