How I wrote JustHTML using coding agents
35 points by simonw
35 points by simonw
I really like this as a case study in how responsible, professional engineering processes plus coding agents can produce high quality code.
In this case it's 3,000 lines of dependency-free Python implementing a parser that correctly handles the full 9,200 test HTML5 conformance suite. That's a major achievement!
It's also an example of how much you can get done with coding agents if you set them up to brute force a problem with a robust existing test suite.
I wrote a bit more about this here.
I think the thing I'm most intrigued about is whether this really is high quality code, or whether it is just code that works (well) right now, but will very quickly become unmaintainable if anything changes (e.g. bugs fixed, new features added, etc). It sounds like the author is not at all familiar with the codebase, for example, despite saying that they've reviewed all the code. In my experience, someone feeling unsure about code despite having reviewed it often means that they've not actually reviewed it in a lot of detail. I've had this several times with human code review, and it's come up again more recently now that I've been trying out AI and reviewing agent-written code.
The other side of this is that part of the value of using a library is delegating the responsibility for understanding a problem to someone else. They dive deep into, say, the gritty details of the HTML spec and provide an abstraction layer over it, and I can trust their abstraction layer and know that the underlying details will be surfaced to me as and when needed. Here, it doesn't sound like the author of the library really does understand the underlying complexity. If neither I nor they understand how HTML5 chooses how to parse some weird construct, then aren't we just building on sand here?
This feels like the best-case scenario - the author copied an existing library and had a very complete test suite to work against - and even then, the results feel very mixed.
the author copied an existing library
Exactly! This is, by the author's own admission, a derived work that strips the authorship/license because it was laundered through an LLM. They do not have the technical knowledge (or presumably desire) to keep it updated. This is far from responsible.
I don't believe a reimplementation of an existing algorithm typically counts as a derived work, at least not in many jurisdictions. I know there's occasionally some fuss made about e.g. clean-room reimplementations, but I believe that mostly happens out of an abundance of caution, rather than because it is legally necessary. So in that sense, I don't think this is laundering necessarily, and it seems largely within the intended scope of use of an Apache/MIT licensed project.
The issue to me is more whether this technique is broadly practical. The code I wrote at $DAYJOB typically doesn't have a set of reference tests, or an existing implementation that I can just copy from. This presents an interesting case-study into the idea of porting from one language to another (although I still have my doubts how practical it is for that purpose, given the author points out how little they understand the generated code and algorithms), but otherwise feels very limited.
It's hardly just laundered. It's rewritten in a different language. And is more compliant
Laundered because they stripped the license & author info. The Apache 2 license should still apply as this is a derivative work & the original author's names should still be on it. There is case law around this that has been enforced over lesser violations. (edit: a point about MIT dual licensing was made below and is probably valid, this may not violate the letter of the license, but the spirit surely)
(https://en.wikipedia.org/wiki/Whelan_v._Jaslow is a good starting point here.)
Think about this critically, if this is valid, then I can ask an LLM to rewrite any code I encounter & claim it isn't a derived work, and subvert the license. What would that mean for the ecosystem?
Looks to me like servo/html5ever is dual-licensed under both Apache and MIT. Here's their MIT license: https://github.com/servo/html5ever/blob/main/LICENSE-MIT
EmilStenstrom/justhtml is also MIT: https://github.com/EmilStenstrom/justhtml/blob/main/LICENSE
What's not clear to me is how authorship attribution should work. The JustHTML README includes this:
AcknowledgmentsJustHTML started as a Python port of html5ever, the HTML5 parser from Mozilla's Servo browser engine. While the codebase has since evolved significantly, html5ever's clean architecture and spec-compliant approach were invaluable as a starting point. Thank you to the Servo team for their excellent work.
But the MIT license from html5ever says "Copyright (c) 2014 The html5ever Project Developers", where the MIT license file in justhtml says "Copyright (c) 2025 Emil Stenström".
Is there a known pattern for how this kind of attribution should work for ports from one language to another?
Responding to the licensing issue in this specific example is missing the forest for the trees, and based on your prior pro-LLM comments, not surprising to me. It is still disappointing that you refuse to engage with the fundamental question being asked here simply because the only logical answer —- that laudering code through LLMs is clearly an ethical violation and possibly an illegal one, that would have severe and significant negative downstream effects on... almost any software — is one that is very clearly anti-LLM and anti-coding-agent.
Under Apache, if it is a derived work, the original license should be intact, a thank you in the README isn't the same thing. I can anticipate two arguments:
If we put nebulous legal issues aside (especially because the courts are far from figuring out how to handle both software & LLMs in a productive way) I think the moral issue & potential harm to the ecosystem are much more clear. This is a violation of the spirit of the license and would likely be demoralizing to those that'd put their time into building something for the community asking for nothing more than credit, and seeing this.
It isn't licensed under Apache though. The original work is dual-licensed under Apache and MIT, consumer's choice. That means that you can choose how you license the work from the original authors, and do not need to comply with both licenses as long as you comply with at least one. (This is quite common in the Rust ecosystem, and I believe solves some issue with patent clauses but I can't remember the details.)
So assuming the author needs a legal pathway to licensing the work, then they can choose to licence the original work under the MIT license, and they comply with that by not using copies or substantial portions of the original choice in their work.
With regards to the ethical argument, I think the bigger issue to the ecosystem is someone putting out a library handling a complex topic that they don't understand. I think the philosophy that any reimplementation of existing code must count as a derived work is both legally and ethically very shaky, and I don't think the use of an LLM changes that at all.
I believe solves some issue with patent clauses
That is the idea; people want MIT for the GPLv2 compatibility and Apache2.0 for the explicit patent grant. IIRC, it's been a while and it's a little late here :)
I don't mean to argue over the minutia & I think we're mostly in agreement on if this is a "good idea" or not. & I don't disagree that it'd be shaky to establish that line on "any reimplementation", I think that would take it too far.
This is more than that though. The line is fuzzy, and the court cases I cite in my replies have tried to find a line to some success, but this is a reimplementation directly based on the code from another tool, which to me is over the line.
I'll admit my viewpoint may be colored by spending a portion of my weekend dealing with students that submitted stolen code as part of an assignment and wondering about the example that those of us that are more established are setting for these kids.
I think for me the key distinction is not "did you copy this or not?" (copying is broadly a Good Thing™). It's "did you understand this or not?". In the case of your students, the point of the assignment is to learn, so if they're copying things to get out of having to learn and understand, then that's a problem. Similarly here, if the author doesn't understand what they've produced, that feels like a problem to me.
That's why I'm approaching it more as an issue of a lack of understanding, because to me that feels like the core underlying problem here. If the author had translated the code by hand, that's still ultimately copying, but it feels like "good" copying - an open-source success. This feels like "bad" copying - or at least "potentially dangerous" copying primarily because of that lack of understanding.
EDIT: But yeah, I agree that we're both broadly in agreement here, and I sympathise with having to deal with lazy students in the AI era! These tools can be used very well, but they also make it very easy to take shortcuts, and miss out on important parts of the learning process.
The question about what this means for the ecosystem is deep.
I'm reminded of the thing in the 1980s where Compaq cloned the IBM BIOS using a clean-room system. One set of engineers reverse-engineered the BIOS and wrote a detailed spec. They then handed that spec over to another team who built the implementation.
Emil's port of html5ever involved the LLM looking at the original source code. It's not hard to imagine an automated version of that IBM/Compaq thing - one LLM reads source code and transforms that into a detailed spec, then a second LLM session takes that spec and ports it to code.
IANAL but that seems to me like it might be the same mechanism as the IBM/Compaq thing, which apparently held up legally. Whether that's a moral thing to do is a different issue!
I'm not saying you're wrong. But it's nothing to do with the LLM. People rewrite and translate libraries without understanding the license implications surprisingly often. They should do the right thing here of course whether or not an LLM was involved
I've read quite a bit of the code now and it looks good to me. There are some quite long methods in the core tree builder but nothing that feels unreadable - honestly it's hard to go too badly wrong at ~3,000 total lines.
For me the trust comes from that existing 9,200 test suite, combined with the fact that the exposed external API design for JustHTML - which I've reviewed in full - is very simple and clean. I fell like I understand exactly what it's doing for me.
I suggested a new feature yesterday and Emil landed a great implementation within a few hours, which suggests that making changes is currently still very productive: https://github.com/EmilStenstrom/justhtml/issues/1
I think the test suite is doing a lot of leg work here, and is kind of a unique situation. Essentially it means that both I as a potential library consumer, as well as the library author, are both just delegating our need to understand the complexities of HTML5 to the people who wrote that test suite. That's great for problems like this that have such a test suite, but doesn't feel particularly transferable to a wide range of problems. For example, much like HTML5 parsing datetimes are really hard to do correctly, and require a lot of in-depth knowledge of the subject matter to correctly cover all sorts of different cases. But unlike HTML5 where there is a spec with clear right-or-wrong answers, there is no "correct" way to understand time, just better and worse approaches.
I guess in the specifics, this demonstrates a case where AI can work really well, but also a case where the benefits feel relatively nebulous (was another HTML parsing library really needed here?). But I'm struggling to see any way that this generalises, given that it feels like what made this project work well is stuff that is not typically available to most projects.
Regarding code quality, a proper code review is a slow process and I've not written enough Python recently to do this project justice, but the things that jump out at me are the lack of documentation in various areas of the code, and the lack of type hints. In many ways those aren't that serious (the project works, the tests show that) and for an internal module in a project, I probably wouldn't bat an eyelid at that, but in this context I'm a bit more concerned about those sorts of issues.
I also don't think the example change there shows much either. Your suggestion was that the CLI could use an existing API to also allow the user to make queries. But the API already existed, so the change was mostly plugging existing parts together. This is something I would expect almost any developer to be able to do correctly, and with basically no knowledge of the codebase. But I'm more concerned about more complex changes — these are the ones where being able to understand the codebase is really important.
That's great for problems like this that have such a test suite, but doesn't feel particularly transferable to a wide range of problems.
I've found that most of the problems where I've gotten a large benefit from coding agents have been edge cases like this, but they're all edge cases in different ways. Examples:
I don't find LLMs to be super useful for "normal" engineering, but there are enough edge cases in a year to more than justify the current costs.
Yeah, I think type hints would be a good improvement here. I filed an issue.
That's great for problems like this that have such a test suite, but doesn't feel particularly transferable to a wide range of problems.
That's entirely true. It's very, very rare for any problem to come with an existing test suite like this. Emil called that out as the reason he chose this project:
When picking a project to build with coding agents, choosing one that already has a lot of tests is a great idea. HTML5 is extremely well-specified, with a long specification and thousands of treebuilder and tokenizer tests available in the html5lib-tests repository.
It's still a useful account in showing how to effectively use coding agents for large scale projects but the same exact trick won't be exactly applicable to most projects.
Am I reading it correctly that the text in this PR description is all it took for the first iteration of https://tools.simonwillison.net/justhtml? Or are you giving further feedback/iteration through a different channel?
It was almost just that prompt - I had a couple of follow ups around button text contrast which I included in this screenshot: https://github.com/simonw/tools/pull/156#issuecomment-3649661059
Then a second prompt and PR later to add the export to text/markdown feature: https://github.com/simonw/tools/pull/162
Not using html5ever seems like a mistake, and it’s brushed over here with a “I don’t like binaries” that neglects the other advantages of using maintained code. I also don't see any attribution/respect for the original license despite the fact that this is a derivative work of html5ever (albeit one that won't benefit from upstream fixes!). This actively undermines the FOSS social contract.
While the author can do what they want, releasing this as a library means countless downstream repos relying on an AI slop version instead of a well-maintained repository by people with their nose in the standards. It opens them up to legal liability and serious technical debt.
This will ultimately punish users that adopt such a library, they’ll need to migrate eventually when this stops being maintained since the author can’t read the code. We should push back on people actually releasing libraries like this, they'll do untold harm to the FOSS ecosystem.
when this stops being maintained since the author can’t read the code
I think you're misinterpreting that note from the author a bit. Elsewhere they confirm that they reviewed it closely:
I spent a lot of time reviewing code, making design decisions, and steering the agent in the right direction. The agent did the typing; I did the thinking.
I think the author has a different definition of code review to me, because they also explicitly say "I still don't know how to solve [the problems caused by the adoption agency algorithm]", "I still don't know HTML5 properly", and "I didn't understand all the algorithmic choices". Partway through the process, they also suddenly find out that most of the code written is not used, which again suggests that their code review was not particularly thorough.
When I'm working with another human, that usually indicates that I've only reviewed their code at a very superficial level. That might be okay if I can trust them to be responsible for maintaining the section of the code, but it's not okay if I'm now responsible for maintaining code that I don't understand.
I don't see how that would be different if an AI wrote the code compared to a colleague or a fly-by contributor.
I look at the code (parser.py) and see a long docstring on the StrictModeError class about why it‘s derived from SyntaxError and on the main class … nothing.
Not a fan already, since this tells me that nobody really read the code.
Not even a technical thing, just the fact that nobody said: Maybe he technical details on the Exception should go into a source comment and the class that people actually will interact with should have a docstring?
I didn't understand all the algorithmic choices, but I understood when it didn't do the right thing.
I don't understand how this works. At least there was a robust test set including fuzzing to test the code against.
JustHTML was partially a port of the html5ever Rust library to Python.
I just built justjshtml, as a direct port of JustHTML from Python to JavaScript. It took about 4.5 hours using Codex CLI and GPT-5.2 and I got to decorate a Christmas tree and watch the latest Knives Out movie at the same time.
Here's my full write-up of the process: https://simonwillison.net/2025/Dec/15/porting-justhtml/