Remix.run Logo
toomuchtodo 3 hours ago

Archive.org is the archiver, rotted links are replaced by Archive.org links with a bot.

https://meta.wikimedia.org/wiki/InternetArchiveBot

https://github.com/internetarchive/internetarchivebot

jsheard 3 hours ago | parent [-]

Yeah for historical links it makes sense to fall back on IAs existing archives, but going forward Wikipedia could take their own snapshots of cited pages and substitute them in if/when the original rots. It would be more reliable than hoping IA grabbed it.

toomuchtodo 3 hours ago | parent [-]

Not opposed, Wikimedia tech folks are very accessible in my experience, ask them to make a GET or POST to https://web.archive.org/save whenever a link is added via the Wiki editing mechanism. Easy peasy. Example CLI tools are https://github.com/palewire/savepagenow and https://github.com/akamhy/waybackpy

Shortcut is to consume the Wikimedia changelog firehose and make these http requests yourself, performing a CDX lookup request to see if a recent snapshot was already taken before issuing a capture request (to be polite to the capture worker queue).

Gander5739 2 hours ago | parent | next [-]

This already happens. Every link added to Wikipedia is automatically archived on the wayback machine.

2 hours ago | parent | next [-]
[deleted]
RupertSalt an hour ago | parent | prev [-]

[citation needed]

Gander5739 36 minutes ago | parent [-]

Ironic, I know. I couldn't find where I originally heard this years ago, but the InternetArchiveBot page linked above says "InternetArchiveBot monitors every Wikimedia wiki for new outgoing links" which is probably referring to what I said.

jsheard 3 hours ago | parent | prev | next [-]

I didn't know you can just ask IA to grab a page before their crawler gets to it. In that case yeah it would make sense for Wikipedia to ping them automatically.

ferngodfather 2 hours ago | parent | prev | next [-]

Why wouldn't Wikipedia just capture and host this themselves? Surely it makes more sense to DIY than to rely on a third party.

huslage 2 hours ago | parent | next [-]

Why would they need to own the archive at all? The archive.org infrastructure is built to do this work already. It's outside of WMF's remit to internally archive all of the data it has links to.

2 hours ago | parent | prev [-]
[deleted]
RupertSalt 3 hours ago | parent | prev [-]

Spammers and pirates just got super excited at that plan!

toomuchtodo 2 hours ago | parent [-]

There are various systems in place to defend against them, I recommend against this, poor form against a public good is not welcome.