What exactly is archiveteam's contribution? I don't fully understand.

Edit: Like they kinda seem like an unnecessary middle-man between the archive and archivee, but maybe I'm missing something.

▲

creatonez 3 days ago | parent | next [-]

What ArchiveTeam mainly does is provide hand-made scripts to aggressively archive specific websites that are about to die, with a prioritization for things the community deems most endangered and most important. They provide a bot you can run to grab these scripts automatically and run them on your own hardware, to join the volunteer effort.

This is in contrast to the Wayback Machine's builtin crawler, which is just a broad spectrum internet crawler without any specific rules, prioritizations, or supplementary link lists.

For example, one ArchiveTeam project had the goal to save as many obscure Wikis as possible, using the MediaWiki export feature rather than just grabbing page contents directly. This came in handy for thousands of wikis that were affected by Miraheze's disk failure and happened to have backups created by this project. Thanks to the domain-specific technique, the backups were high-fidelity enough that many users could immediately restart their wiki on another provider as if nothing happened.

They also try to "graze the rate limit" when a website announces a shutdown date and there isn't enough time to capture everything. They actively monitor for error responses and adjust the archiving rate accordingly, to get as much as possible as fast as possible, hopefully without crashing the backend or inadvertently archiving a bunch of useless error messages.

	▲	dkh 3 days ago \| parent \| next [-]
		I just made a root comment with my experience seeing their process at work, but yeah it really cannot be overstated how efficient and effective their archiving process is
	▲	iamacyborg 3 days ago \| parent \| prev [-]
		Their MediaWiki tool was also invaluable in helping us fork the Path of Exile wiki from Fandom.

▲

wongarsu 3 days ago | parent | prev | next [-]

> Like they kinda seem like an unnecessary middle-man between the archive and archivee

They are the middlemen that collects the data to be archived.

In this example the archivee (goo.gl/Alphabet) is simply shutting the service down and has no interest in archiving it. Archive.org is willing to host the data, but only if somebody brings it to them. Archiveteam writes and organises crawlers to collect the data and send it to Archive.org

▲

wlonkly 3 days ago | parent | prev | next [-]

Archive Team is carrying books in a bucket brigade out of the burning library. Archive.org is giving them a place to put the books they saved.

▲

1gn15 3 days ago | parent | prev | next [-]

ArchiveTeam delegates tasks to volunteers and themselves running the Archive Warrior VM, which does the actual archiving. The resultant archives are then centralized by ArchiveTeam and uploaded to the Internet Archive.

(Source: ran a Warrior)

▲

notpushkin 3 days ago | parent | next [-]

Sidenote, but you can also run a Warrior in Docker, which is sometimes easier to set up (e.g. if you already have a server with other apps in containers).

	▲	kalleboo 3 days ago \| parent [-]
		Yep, I have my archiveteam warrior running in the built-in Docker GUI on my Synology NAS. Just a few clicks to set up and it just runs there silently in the background, helping out with whatever tasks it needs to.

▲

gunalx 2 days ago | parent | prev [-]

Ran archive warrior a while back but hadde to shut it down AS i sterted seeing the VM was compromised trying to spam ssh and other login attemps in my local network.

	▲	mdaniel 2 days ago \| parent [-]
		This smells like a one-click bringup went wrong, and not that the Warrior software was compromised Is that the story, or you are saying that the machine was secured correctly but that running Warrior somehow introduced your network to risk?

▲

diggan 3 days ago | parent | prev | next [-]

> What exactly is archiveteam's contribution? I don't fully understand.

If Internet Archive is a library, ArchiveTeam is people who run around collecting stuff, and gives it to the library for safe keeping. Stuff that are estimated/announced to be disappearing/removed soon tends to be focused too.

▲

debesyla 3 days ago | parent | prev | next [-]

They gathered up the links for processing, because Google doesn't just give a list of short links in use. So the links have to be brute-forcefully gathered first.

▲

horseradish7k 3 days ago | parent | prev [-]

liability shield