@jackyan Some bits of information:
- ArchiveTeam ArchiveBot - that's how it identifies unless told otherwise - is not to be confused with some bots operated by Internet Archive. You can see my message above for a short summary of *different* IA-related bots out there.
- ArchiveBot is operated manually, For each website it crawls, some actual human has pasted the address, decided on parameters and hit <Enter>. Sometimes humans make mistakes and grab too much or too fast. IRC is the best way to reach the human in question ASAP.
- ArchiveBot reads robots.txt, but doesn't abide by it. It's a job of the human operator to decide on crawl speed, forbidden URLs and so on. Returning 429 or 503 doesn't have any immediate effect on the bot, but will hint the human (if one is paying attention) to dial the speed down. Please do not return 403 or 404 - they are considered permanent and won't be retried. Blackholing the IP for a day is suboptimal, but acceptable in cases of serious overload.
- ArchiveBot was originally designed for grabbing dying websites ASAP, so its default parameters are a bit skewed for emergency use. However, its main use today is grabbing websites that look valuable and would be a shame to lose - because even if it's active today, nobody knows what will happen tomorrow. Yours probably falls to this category. So, if you don't want to be hit by AB, don't publish anything useful :)
- Other sites are probably getting hit due to being linked from your main site. ArchiveBot tries so preserve some "context" of the target website by making a shallow copy of everything it links to.
- Long-term observations are unlikely to be ArchiveBot-related, unless your website is very popular. It's probably some Internet Archive crawler walking your site slowly and carefully.
Does it make sense?