Agentic Pelican on a Bicycle: Gemini 3 Pro

7 points by youngbrioche

klingtnet

When it comes to Pelican benchmarks I trust the source, that is Simon Willison's blog.

mdaniel

For this specific kind of test, it actually is great that someone other than Simon can successfully run the trial. I actually much prefer OP's resulting SVG to Simon's new one

Also, to help others, here's the relevant section in that post https://simonwillison.net/2025/Nov/18/gemini-3/#and-a-new-pelican-benchmark

pscanf

I get the meme (though at this point it's getting trite for me), but aside from that, is there any value in the "pelican on a bike" benchmark? I'm guessing that how well the svg turns out is a proxy for the model's spacial reasoning skills, but it seems like a very poor proxy. All results look more or less crap, so it's difficult to tell "this is better than the other", or to quantify how much one model is better.

gpm

For the normal benchmark, there's some value in it as an anti-hype "these models still really suck at some basic tasks" reminder.

This iterated modification also just surprised me at how consistently bad the other models were all at improving the svg. They took improve to mean "add extraneous details" not "fix things". That Gemini 3 Pro here actually fixed things like the bike geometry is an interesting deviation from all the earlier attempts and hints that progress might be being made on actually reflecting on output quality.

It's just one sample of course, not a proper benchmark run on a lot of samples, but it's worth more than nothing.
- simonw
  
  Yea, one of the main reasons I keep doing it is that it's amusing when some company comes out with a new hyped AI model and it turns out it still draws a pelican riding a bicycle in the style of a five year old.