I hate compilers
62 points by cadey
62 points by cadey
Clang silently runs
wasm-optfrom$PATHbehind your back
TIL; madness. This prompted me to go verify that this doesn't affect zig cc, which thankfully it turns out not to because this only runs when using clang as the linker driver.
Clang relies on address layout for ordering things
I would personally consider this a bug and report it as such, assuming it still occurs on the latest release.
Elsewhere Xe said that they will be reporting upstream, it is definitely an LLVM determinism bug and there's been efforts over the years to get rid of those.
I've been trying (and failing) to get a minimal reproduction case, but I am going to file an issue with LLVM over this. This is my first time ever hitting a compiler bug. Kinda weird for the axiom of "assume the compiler is bug-free and your inputs are wrong somehow" to finally turn out to be false.
assume the compiler is bug-free and your inputs are wrong somehow
This holds for a great swath of engineers, but not systems programmers trying out relatively ambitious things (like yourself.) ANY attempt to cross-compile, especially when Wasm is also involved, will mean the compiler can be just as wrong as you!
Anyway thanks for the writeup: it's my favorite genre of blog posts here :)
You're welcome! I've been meaning to post more low-effort-ish posts like this outlining development progress and weird things I run into.
This holds for a great swath of engineers, but not systems programmers trying out relatively ambitious things (like yourself.) ANY attempt to cross-compile, especially when Wasm is also involved, will mean the compiler can be just as wrong as you!
Even then, I still like to start from the assumption that my inputs are wrong, not the compiler. The compiler being an inerrant oracle of truth is a nice place to start from because I have much easier control of my inputs than I do the compiler. I assumed I forgot a -Fdeterministic-iteration-order call or something in my build script. Turns out disabling ASLR at least makes things deterministic on the same machine (it may be even per-boot, it's kinda weird), so that may be good enough to leave in as a known issue in Anubis' repo and file it upstream as an LLVM issue.
Part of this debugging process has made me use GLM 5.2 as a somewhat context-aware fuzzer of clang flags, build system flags, etc in order to try and diagnose a root cause and/or generate a minimal reproduction case. After about 2048 test binaries it gave up. I can't blame it. It seems that the problem is only reproducible with a somewhat complicated input (binaryen). I may just be hitting the edge case of edge cases here because something binaryen does with exception handling pushes something over the edge in just the right way.
I don't know, compilers are weird.
Wait until you try to reliably use clang.exe on Windows as a cross-compiler. You're going mad, because there are like 500 ways that clang just assumes you're going to build for your native system anways
I don't mean to be critical, because this is open source and the OP is offering a popular service for free. Please be assured that I have a lot of respect for projects like this.
But at the same time, I want to voice a possibly unpopular opinion: I hate seeing the web become like this. It is common for me to visit a website and get a flash of the Anubis loading page. Is this the web we want? One where every popular website presents a proof-of-work splash page? I don't like it one bit, but with the constant influx of AI crawlers, I don't know what else can be done.
Is there any evidence that proof-of-work measures actually prevent AI crawlers from crawling a website? Don't they have gazillions of dollars to just solve the proof-of-work challenge for every website they come across? I mean, it takes only a fraction of a second to solve the challenge, and these crawlers are already doing far more computation while reading each page, so solving the proof-of-work challenge seems like a drop in the bucket.
Is there any evidence that proof-of-work measures actually prevent AI crawlers from crawling a website?
Yes, there have been multiple posts here about it.
Anubis has unquestionably been an effective deterrent to unwanted traffic over the duration of the pilot. With our current rule configurations (mostly, the defaults defined in the core software), Anubis is consistently blocking about 90% of requests to all three applications, with some variation by app (DDR: 71.0%; ArcLight: 94.6%; Catalog: 92.4%).
[...]
We provide the following details of our experience resuscitating our catalog in June to illustrate what an aggressive bot harvest does to an application, and show the extent to which Anubis mitigates the problem. For the catalog application, bot traffic sharply increased Fri, May 30 and essentially knocked the catalog out of service until Anubis could be applied on Tue, Jun 3. At the apex on Jun 1:
• 3.4 million HTTP requests (about 40/sec) were made by 2.1 million unique IPs (avg 1.65 per IP).
• Pageload speed had swelled from a typical 2-3 seconds to over 70 seconds.
• Staff attempted to thwart the bots by adding IP-, subnet-, and fingerprint-based blocking rules to existing RackAttack middleware configuration, but the distribution of traffic rendered that strategy ineffective.
[...]
On Jun 4, with Anubis fully in place, the catalog was back online for users. On that day:
• 1.3M requests were denied outright as they included a user agent including a string like MSIE or Trident (a tactic used by a noted recent wave of aggressive Brazilian scrapers).
[...]
• The application only had to handle 125K total HTTP requests (i.e., 1.45/second). Pageload speed improved to 2.12 sec.
https://lobste.rs/s/ncyfcp/anubis_pilot_project_report_june_2025
Long story short, deploying Anubis immediately solved our issues. In fact, you can see the exact time in our monitoring. I didn’t get a single notification afterward. The server load has never been lower. The attack itself is still ongoing at the time of writing this article. To me, Anubis is not only a blocker for AI scrapers. Anubis is a DDoS protection.
https://lobste.rs/s/67ijih/day_anubis_saved_our_websites_from_ddos
Really cool. Thanks for digging up those links. Exactly the kind of vouching I was looking for. Appreciate it!
No doubt adding obstruction to accessing the website is gonna be effective. However, we've observed that anything that merely verifies that JavaScript is being ran at all (by e.g. setting a opensesame=yes cookie and reloading the page) is just as effective in practice.
I'll start calling Anubis effective when it becomes noticeably better than such a simpler, less invasive solution.
Anubis is also evolving into a WAF, it was used to mitigate a GitLab billion laughs attack last year. The client challenges are just the most visible sign of it.
I appreciate you sharing this, but it does strike me that this is a red queen's race, and I wonder what the odds are that the techniques have changed in the past year.
There's more use of headless chrome. I've been trying to mitigate it on Anubis' end, but I am recovering from surgery and its associated burnout. I really need a vacation or something.
Is there any evidence that proof-of-work measures actually prevent AI crawlers from crawling a website?
You don't need what it does. I just deployed a blocker on https://shithub.us that's stateless, JS-free, and probably at least as expensive to work around as simple proof of work for scrapers.
That's a nice approach! But if you were targeted specifically (because you have a lot of content or because this gate mechanism is used in many places), it is trivial to workaround.
It requires you to make things pretty stateful on the client, since you can't just distribute the cookies among scrapers. I have a few other tricks up my sleeve as well, specifically around forcing a certain ordering of loads.
The philosophy here is to make it difficult to just have a queue of stateless workers, and enforce state management in the client side in a way that's easy to rate limit.
No one knows what the crawlers are or what they're for but it seems like many of them are very lazy and can't handle anything unusual, some people defeat them even with a meta refresh tag or a button to push. So yes, Anubis works but not because of the PoW just because of the unexpected nature of it.
Right but in contrast to a simple "unexpected" defence, Anubis can continue to work even if it becomes the expected thing.
If a hypothetical Anubis existed with a simple button and it became popular, bots would be adapted.
But maybe the diversity of simple solutions is enough?
Any bot using eg a headless browser can pass through Anubis. It raises their costs by some amount, but it's not clear that they actually care or need to care. And there are also clearly cheaper means of passing it than a headless browser.
Is this the web we want? One where every popular website presents a proof-of-work splash page?
It’s certainly not the web I want.
It is getting to be even more miserable than it was trying to use a browser with Javascript disabled. I just want the web to be about documents, and now I have to pass through Cloudflare and Anubis and captcha gates everywhere.
It's been known since the start that any APT can accelerate computation of Anubis responses. Quoting a proof-of-concept from last year:
Bots will always find a way to bypass WAFs while real users are forced to waste CPU cycles on a loading screen.
I hate it myself, but it's probably either that or resorting to recaptcha / hcaptcha to keep the site from going offline, which is even worse. There's other solutions such as iocaine which serve markov generated poison to crawlers, but that comes with its own issues, and I'm not too sure myself if that's much better than proof of work based mechanisms.
It seriously pisses me off how AI companies have forced admins into an antibot arms race which negatively impacts everyone.
Not a surprise, sadly. Compiler toolchains have a looooooong history of relying on absolutely batshit amounts of local context that Just Has To Be Right. Funny to see this from clang though, since LLVM has been one of the leaders in weeding it out. This has let us, for example, have a rustc with no concept of separate a "cross-compiler".
Try bootstrapping an OS without relying on existing build tooling. Go ahead, just try. Build a kernel, build the libc and compiler for the kernel so you can run stuff, boot it, and rebuild all the above on the new OS. It's an absurdly complicated and fiddly process full of unspoken assumptions, and literally everyone on the planet builds their own boutique set of tools for doing it that work differently and patch over different holes in the process in different ways. Some of this is the fault of the operating systems, some of it is inevitable system complexity, but a lot of it is that it's just really fucking easy to add implicit state, really hard to know that it's there, and a lot of work to manage sanely.
It's a rare problem that only really touches OS and compiler devs, so there are no good tools or best practices for it. There's probably 5 people in the world per compiler+OS pair who actually understands how it all works.
Yeah I was surprised to hear this? I thought LLVM was designed for cross-compilation (host and target abstracted), unlike GCC
And I thought that's basically where the Zig toolchain got some of those features, although I understand they did a lot of work to even more cleanly separate it, and they no longer use LLVM (?)
But Clang has those same problems? They didn't inherit a cleaner architecture from LLVM?
IIRC, you do use Nix; I'm sincerely wondering why you didn't seem to mention and/or use Nix for mitigating at least part of the variability of the environment? (E.g. the "wasm-opt from PATH" part.) Or you did and I missed it?