cursed_browser: A web browser with no rendering engine — the VLM reads the HTML and hallucinates the page
80 points by aoeu
80 points by aoeu
This is great! Fun idea.
I was hoping for worse results in the examples though. All of those in the cursed browser are actually... not that bad? There's gotta be some real crazy results out there.
The idea is funny, but also fails to deliver on the premise? "Every page load is a surprise. Every render is a work of art." - the demos on the page don't really make the case. The only surprising part appear to be images, which I assume aren't fetched, so it just comes up with something plausible.
I think the models have gotten pretty good at putting text in images, so there's not a whole lot that can go sideways if you're basically just telling it "render several paragraphs of text", especially not for websites the LLM is already visually familiar with, like HN, Wikipedia, or the Acid test. They're probably recreating the general vibe from training data, without needing to interpret the CSS perfectly right (if that CSS is fetched at all).
They're also not very good at prompt adherence if you tell them not to use tools, so even though the prompt says "don't use a HTML renderer", I wouldn't trust that this is effectual.
The prompt adherence thing is interesting. I guess if it was in fact using an "HTML renderer", where would it get that from? In either case: compiling and invoking one (Servo, for instance), or firing up Chrome and taking a picture, wouldn't the result be exactly the same as using a real browser? (Minus any very small differences because CSS)
They have access to a wide range of undisclosed internal tools simply as a way to reduce cost if there's a simpler / more reliable way of accomplishing certain tasks. For example, for math problems, they may have access to a solver or just write some Python code to confirm the result. I don't know if OpenAI actually has some backend HTML rendering or text layout tools, but the prompt in the cursed_browser specifically instructs the LLM not to use a renderer:
But I'm guessing that the browser itself is vibe-coded, so this prompt may be talking out of its hindquarters.
Ok, that's a fun idea. It would be fun to make them for more file types. JPEG and MP3 come to mind.