Mastodawn

Websites don't just magically appear on the Wayback Machine, someone has to actually archive them. Sometimes that's Internet Archive themselves. Sometimes that's the volunteers at the ArchiveTeam community rushing to archive entire websites before they shut down.

That requires their own servers, and they're *expensive*. You can financially contribute on their OpenCollective site :)

https://opencollective.com/archiveteam/updates/changes-to-our-infrastructure-and-use-of-donations

Changes to our infrastructure - Archive Team

Over the recent months, some major changes have been made to the infrastructure behind many of the Archive Team projects. The tracker, backfeed, Gitea, transfer.archivete.am, and other services run on this infrastructure. The changes · Over...

Show thread

Christian Brüggemann Dec 8

@nicolas17 I always thought the Internet Archive automatically indexed pages. Is that not true at all?

Show thread

Nicolás Alvarez Dec 8

@cbruegg Internet Archive does have their own crawler which seems to randomly wander around the web. They also have "Save Page Now" to manually submit individual pages.

The ArchiveTeam community runs archivebot to archive lists of URLs or recursively crawl specific websites, controlled via an IRC bot. They also run several distributed projects where users can contribute their bandwidth (eg. blogger which is doing 3GiB/min right now) and which usually involve website-specific crawling code.

They actively track upcoming website shutdowns on their wiki and try to get them saved before they close.

ArchiveBot dashboard

Show thread

Jack Yan (甄爵恩)1d ago

@nicolas17 Hi Nicolás, you seem to be very knowledgeable about this Archivebot. Any idea why this bot hit my personal site 33,000 times via two German IP addresses please? The pace was worse than Google or Baidu. Also myriads of hits to other sites we own, but not as concentrated, over the last 24 hours.

Show thread

Jack Yan (甄爵恩)

@nicolas17 I have just implemented a crawl delay in robots.txt for User-agent: ArchiveBot. Hopefully this works.

Show thread

Nicolás Alvarez 1d ago

@jackyan I wasn't home when I got your message but I forwarded it to other archiveteam people. It seems they have now paused the job and they plan to resume it later with reduced speed.

Sorry for the inconvenience! There was really no need to make requests so fast if the site isn't shutting down soon, it's just a proactive archival.

Show thread

Jack Yan (甄爵恩)1d ago

@nicolas17 Thank you, Nicolás, I appreciate your reply and action. Last year lucire.com was hit as well every day for about a year, but that was before we monitored these things as closely.
Any idea why our sites are getting flagged? They are regularly updated and I havenʼt any plans to let them go.
Happy for them to be archived as I support the principles of the Internet Archive, just at a reasonable rate which youʼve helped me on.

Show thread

justauser 11m ago

@jackyan Some bits of information:
- ArchiveTeam ArchiveBot - that's how it identifies unless told otherwise - is not to be confused with some bots operated by Internet Archive. You can see my message above for a short summary of *different* IA-related bots out there.
- ArchiveBot is operated manually, For each website it crawls, some actual human has pasted the address, decided on parameters and hit <Enter>. Sometimes humans make mistakes and grab too much or too fast. IRC is the best way to reach the human in question ASAP.
- ArchiveBot reads robots.txt, but doesn't abide by it. It's a job of the human operator to decide on crawl speed, forbidden URLs and so on. Returning 429 or 503 doesn't have any immediate effect on the bot, but will hint the human (if one is paying attention) to dial the speed down. Please do not return 403 or 404 - they are considered permanent and won't be retried. Blackholing the IP for a day is suboptimal, but acceptable in cases of serious overload.
- ArchiveBot was originally designed for grabbing dying websites ASAP, so its default parameters are a bit skewed for emergency use. However, its main use today is grabbing websites that look valuable and would be a shame to lose - because even if it's active today, nobody knows what will happen tomorrow. Yours probably falls to this category. So, if you don't want to be hit by AB, don't publish anything useful :)
- Other sites are probably getting hit due to being linked from your main site. ArchiveBot tries so preserve some "context" of the target website by making a shallow copy of everything it links to.
- Long-term observations are unlikely to be ArchiveBot-related, unless your website is very popular. It's probably some Internet Archive crawler walking your site slowly and carefully.

Does it make sense?