Which Programming Language Is Best for Claude Code?

17 points by MatheusRich

st3fan

It is a weird benchmark focused on Time & Cost. Not code quality or code complexity or maintainability.

mcherm

I actually really appreciate the fact that it only attempts to measure cost (okay, time and cost).

That makes this research have a clear, unambiguous result which I can use as long as I keep in mind that it is ONLY measuring cost to generate, not cost to maintain.

Trying to estimate code complexity, code, quality, or cost to maintain the code would require a very different, probably much bigger experiments. And because these things are not well defined, it would result in a conclusion that was much less clear.
toastal

Or like a performance/resources usage benchmark of implementation? These are metrics best for getting cheap/fast/throwaway slop :|
viraptor

Measuring the objective things rather than subjective is not weird at all. This is 300 small codebases. You'd need people who can not only judge that result using their good knowledge of the language ecosystem, you need to normalise the scores somehow between the judges. And all of that would take a serious amount of time. And it doesn't matter how well you design the comparison, someone will say "but Ruby is by design less complex and more maintainable than C, so it should be penalised to normalise for the generation skill".

The effort is not happening, and any attempt will just lead to disagreements.

rtfeldman

The outputs are 200-LoC programs.

I sympathize with the author that "designing a large-scale benchmark that's fair across 15 languages is quite challenging," but you have to be honest about the conclusions that can be drawn from the experiment you actually did. "At least for prototyping-scale tasks, Ruby, Python, and JavaScript appear to be the best fit" is not remotely supported by the data here. What percentage of useful prototypes are 200 lines of code?

Headings like "What causes the speed/cost differences?" "Doesn't lack of types mean more bugs?" "A 2× difference isn't that big, is it?" and "Isn't ecosystem and runtime performance more important for language choice?" could each have one sentence under them: "At the scale of this experiment, we can't draw any meaningful conclusions about this."

brucehoult

Maybe I missed it, but this doesn't seem to include the execution speed of the generated code.

I would expect the compact and cheap to write OCaml and Haskell programs to execute much faster than the Ruby and Python ones. Though JavaScript/TypeScript is going to be fast in nodejs as well.

And of course C and Rust should also produce fast programs, though in some cases if the extra lines of code include implementing library-style functionality then that might not be as well-coded as the built-in libraries in Ruby and Python.

It warms my heart to see Ruby come in so well. I've always loved it as a general-purpose scripting language, well ahead of the arbitrarily non-orthogonal and annoying syntax Python. Ruby is also a much more natural upgrade from Perl / shell / awk / sed while being an actually good language.

I'm a bit sad that we're missinh both Java and C#, which are a different class of statically typed languages than Rust, Go and C.

Apart from TypeScript, none of the static languages had exceptions, which means errot handling works differently.

I'd also love to see a comparison in code quality. My own vibecoding experience shows that nailing down strict rules, and static analysis definitly improves the runtime crash behaviour of programs.

jonathannen

I'm surprised TypeScript sits where it does compared to JavaScript - There is often discussion on types on Claude and the impact. So it's a very interesting benchmark.

Types and TypeScript are far more useful for larger projects. If I was building a single index.js to process some JSON for me I'd reach for JavaScript. For everything else it's TypeScript. So I do wonder how these results shift as you scale.

A 2× difference isn't that big, is it?

You can argue a type system forces the coder/AI to build the code (at least conceptually) twice. And if you take this data at face value that looks true.

bediger4000

I'm a little surprised that this matters. If "AI" is as good as is claimed about interpreting meaning, then "AI" should be able to use any language, Brainfuck, even, with good results. Here, "good" would mean "accomplishing the assigned task swiftly".

cohix

Been using Rust almost exclusively, it’s been working great.

mrcruz

I'm mildly salty that Bash wasn't included :c

einacio

I'm surprised they all always ignore php, it likely has a good enough amount of training data available to write cli scripts
brucehoult

Zig and Julia.
symgryph

I would add fish shell to that list as well!

Student

The test suite is very bare bones and happy path focused. This validates that agents are good at coding relatively simple projects in scripting languages if you can write a test suite. Big win for test suites that don’t depend on api details and we should probably all be investing in these types of test harnesses more.

The big surprise for me here is that ocaml comes out looking very good. Comparable to golang with very little dispersion in time taken to achieve the task.

viraptor

It would be interesting given that some languages force you to handle many errors explicitly even in a toy app. But don't worry about the capability itself - you can tell LLMs to generate/test edge cases and pathological environments and you'll get a comprehensive result. I've reimplemented things that way and the generated coverage is seriously impressive.
- Student
  
  Yep, my point is that this benchmark is only as good as the test suite.
travisgriggs

I was disappointed that Elixir faired poorish (he reported a rudimentary result in the comments).

I have a working hypothesis that LLM code generation success is loosely a product of “amount of training material” x “syntax/concept density”.

Elixir would be decent on the second, low on the first.
tomekw

Ada. Language optimized for reading.
faassen

So we cannot show the benefit of static typing in another context. A context that is limited by the unrealistic experiment, and it's hard to make a realistic experiment to do a fair comparison, like usual for such experiments. Research is difficult.
- Student
  
  The test suite is very happy path focused. Retrying with a more comprehensive test suite that tries to shake out errors would be much more Informative.