Wikipedia blacklists Archive.today, starts removing 695,000 archive links

38 points by videah


pushcx

This is not super-topical, but it's a good excuse for a meta note. When I read this last night I also removed archive.today from Lobsters.

If you'd like to contribute, two great opportunities for archive-related improvements to the codebase are:

  1. A background job to send all Links to Archive.org's Wayback Machine. This model already tracks all external links from stories and comments.
  2. A recurring background job to poll the stories currently on the front page and toggle the is_unavailable flag if they're unresponsive, which would start serving the copy of the linked post that we have from Diffbot.

~kline suggested standing up an ArchiveBox instance. I'd be open to trying that after these two features, if someone with Docker expertise would like to help me work through setting it up on DigitalOcean.

dzwdz

Previously. The Wikipedia talk page is hard to browse because half the links are broken, so here's some of the more damning evidence:

I have another evidence of tampering: this is a Megalodon archive of a archive.ph archive of a post. The original post is now dead. Patokallio mentions this post in his blog – he would surely mention if the post mentioned him, in the way the archived version does. He quoted the original "[N.P.] was a woman[...]", while the archive.ph reads "Jani Patokallio was a woman[...]"

Some sort of find and replace indeed - Too bad that doesn't work on their own screen grab software. ([23] vs [24])


If you're wondering why everyone is redacting Nora's last name, gyrovague's article was edited yesterday to say:

It appears increasingly likely that the identity of “Nora” has been appropriated from an actual person, whose only connection to archive.today was a request to take down some content. As a courtesy, I have redacted their last name from this post.

The Wikipedia admins have scrubbed the name from the talk page for similar reasons. Presumably the a.t admin just wanted to follow suit.


Additional goodies from the a.t blog:

when we finally started communicating—around 2020—Mark mentioned that they [archive.org] come from a background of left-wing activism (this isn’t a secret; their biographies are public; I just hadn’t looked into them until it was brought to my attention).

By that time, Gamergate and various other scandals had already occurred. With few small exceptions, the right tended to preserve pages, while the left wanted to delete them. That was my aha moment: no collaborations were possible here. And so we became a kind of dialectical pair: we won’t delete what they delete, and vice versa, even when politics isn’t involved.

...and here I was criticizing 404 Media for bringing Gamergate into this.

There's one last interesting post on that blog which I'll just paste in full.

About Wikipedia, I promised to write when the referendum there ended so as not to influence it:

  1. Kiwi Farms set up an on-premise mirror of all archive.xx links from their forum (the volume is comparable). Wikimedia… never even had such an idea. That’s all you need to know about linkrot, contingency plans, and who to blame when links disappear.
  2. The value of the archive for Wikipedia was not in this, but in the ability to offload copyright issues. This is not about paywalls. This is, for example, about copyright trolls writing claims to stock photos, about articles deleted or changed for political reasons but pursued under the guise of copyright, etc. It is precisely these links that become dead, then got replaced with archive.xx, and we become the sink for all the attacks, legal and illegal. The need for fast-flux hosting, pseudonyms, and other pirate attributes stems largely from this. Do we really need this kind of “social burden”? Build your own toilet thing, you have millions.

Sorry for the giant quote dump, I just wanted to highlight what I think are most important parts. All of the emphasis in quotes was added by me to make this easier to skim.

proctrap

This is very sad to see. Archive.today so far has been the most reliable source for JS heavy websites, especially news outlets.

Garbi

archive.org is good. archive.today is bad. What about archive.is and other archive.TLD? Does anyone have a list?

thasso

I manually archive and upload all pages that I link to on my blog. I still link to the original website, but there's a floppy disk icon next to the link, which brings you to the archived version.

This way, I don't have to deal with stuff like this or with a provider running out of funding, etc. My blog has only existed for 2.5 years, and even during that time, I already experienced significant link rot, which is why I started to archive everything. Either my blog is up and all pages that I link to can be retrieved, or it's gone itself. (Except, of course, the archiving goes only one layer deep.)

It's pretty easy to do because I usually link to other blog posts, which are static pages without a paywall. And, as I'm doing it manually anyway, I could probably work around interactive and paywalled pages too. Monolith makes archiving static pages really easy: monolith --isolate --no-js --output static/archive/%title%-%timestamp%.html <URL>.

I'm not sure about the copyright situation, but I advertise the archive on my home page, and I have a big disclaimer on the page that describes the archive. It explicitly states that the archived materials were not produced by me, and the authors should contact me if they want archived material to be removed. Of course I never had any issues; I don't think anyone cares at this scale.

I haven't seen this done on many (any?) other personal websites, but it would be nice if it became more commonplace. It's so frustrating to read an amazing post from 10 or 20 years ago only to find out that all links are broken.

WilhelmVonWeiner

A story of spite and escalation. There was no good reason for the original attack on archive.today, there was no good reason for archive.today to respond. I doubt archive.today is going anywhere and I don't personally care they change someone's name in archives (I'll still use it over archive.org, which goes out of it's way to scrape things to archive, which I consider malicious and wrong), but that does fatally damage it's credibility in that punk-academic niche it's become so useful for.