there is currently a bot inside MIT IP space, address 18[.]4[.]38[.]176, scanning fedi at large. i have confirmed this with 5+ unrelated instance admins, large and small instances, across mastodon/misskey/pleroma/akkoma.

the bot is poorly behaved. i have observed it making repeated requests, multiple times per second, for the exact same paths (the paths being, generally: user profiles, specific posts, and sometimes following links in posts). returning 403s does not stop this activity. one of my domains received hundreds of additional requests despite replying with 403 to all of them. i have also seen it make requests for paths containing html tags - seems like a badly written parser. the purpose of these requests and what data is being gathered is unclear.

PTR on the ip returns
sts-drand03.mit.edu. a quick web search for "mit drand" brings back https://mitsloan.mit.edu/faculty/directory/david-g-rand and his personal website: https://davidrand-cooperation.com/ (note: other IPs in the /24 also have names in the PTR which match up with names of MIT faculty, but only the .176 IP appears to be involved in this activity).
seems he's doing research into "misinformation" and "fake news" on social media. he also appears to be on fedi! so
@Drand, given this activity is sourced from an IP with your name on it, could you share the purpose of this traffic? what data is being collected and how is it being used? do you plan to respect robots.txt or identify yourself in your useragent? is there a process for instance admins to opt out of this activity other than blocking the source IP?
for those who have checked logs on their instances, could you share the dates when the activity started? on this instance, the first record i have is from 2023/11/29, steadily ramping up since then
@natalie first access I have from this IP on my public instance is
[24/Nov/2023:19:39:06 +0100] "GET /api/v1/instance HTTP/1.1"
at ~4 minutes interval 3 times, then seems to have loaded some js/css/fonts the 25, scrapped the public timeline on the 28 multiple times and started scrapping users the 29
more or less the same behavior on this instance
@natalie it started here on the 2023-11-24 here, first couple of days looks like normal fedi usage.

the account crawling started on the 2023-11-29 here and has been going ever since.

blocked the ip now. thanks for the heads up
@natalie (this includes mkab and fab together btw)
@natalie I see as early as Nov 25th, but it's been mostly quiet other than a sizable burst in requests for the public timeline on that day
@natalie whilst my logs roll over and i don't have anything that far back, he went absolutely   on dec 2nd
@FloatingGhost @natalie that is why blob.cat had hickups the other day... sadly i couldnt check it since i didnt had my laptop with me
Akkoma

@natalie
Earliest one here is from 03/Dec/2023:03:33:07 going up until I blocked it just now. Like 4k requests.

@natalie First request of IP 18.4.38.176 that I find is

18.4.38.176 - - [24/Nov/2023:21:08:06 +0100] "GET /api/v1/instance HTTP/1.1" 200 7022 "-" "python-requests/2.31.0"

Then a whole bunch nothing, until

18.4.38.176 - - [27/Nov/2023:17:07:19 +0100] "GET /about HTTP/2.0" 200 948 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"

I rotate logs, the oldest request I have in my logs, is 22/Nov/2023:23:57:40 +0100, so if they did calls before that, I won't know.

For those interested, this are the main commands I used (zgrep is to grep through zipped files)

ls /var/log/nginx/ilja.space-access.log* zgrep '18.4.38.176' /var/log/nginx/ilja.space-access.log* | less

If interested, I dumped it in a file for future reference. I'm unsure if there's information in there I don't want to share, but if you're interested, I could go through it and share it if nothing is in there that I don't want to share.

@natalie btw, for any one interested, I now redirect the requests to an audio stream  I hope the bot just keeps downloading it as random data.

With nginx, let's say you have an audio stream, or just a big file, on https://example.com/big_thingy, then you can redirect as such

if ( $remote_addr ~* '18.4.38.176' ) {
return 302 https://example.com/big_thingy;
}

I tested it with my own IP, it worked \o/
@ilja @natalie One that could be interesting as well is a gzip-bomb, for example 10GB of zeroes compresses down to 10MB via gzip-9.
More modern algorithms might be even more efficient at compressing zeroes.
@lanodan @natalie i don't know enough about these things, but i would definitely like to be able to annoy them as much as possible without being a burden on others (like who hosts the stream/big file I use)
@natalie same for me, first request I see in my logs is from 29/Nov/2023:07:08:27 +0000 and it directly is the request for a user profile. After this it seemed to have steadily requested multiple user profiles
@natalie @Drand wow that's a good bit of reverse engineering
@arcade @natalie seconding this! Always a little scary the amount of info people can dig up so easily 😅
@natalie @Drand he’s not active on fedi, so someone should probably email him about this with these questions
@Clover @natalie @Drand or probably email the WHOIS abuse contact, which is arin-mit-security[@]mit[.]edu

@lena @Clover @natalie @Drand
We got in touch with him and are talking about it and monitoring the situation.

- TechHub Moderator

@natalie @Drand Amazing that it ended up being a literal Rand-o.
@natalie @Drand ip blocked lol

@julia @natalie

Do fedi servers have the ability to blackhole an IP? Instead of just block?

Literally just 302 to a special server who's entire reason for being is to hold the port open and sometimes give wait responses until the requester times out.

It's a great way of tying up someone's server.

You can also 302 to fake data too. If you think the researcher is being a complete pig. Poison their data with pure garbage.

@doctormo @natalie what I did was block it with ufw on the proxy server I have running
@doctormo @natalie fedi can blackhole stuff if it has dns, but not really for bare ips

@julia @natalie @doctormo
We got in touch with him and are talking about it and monitoring the situation.

- TechHub Moderator

Out of moderator mode, some sort of blackhole technique might actually be useful, but I'd rather we figure out a way to come to agreements with the people who want to scrape for legitimate research before doing that to MIT servers.

@doctormo That's not a feature of a regular Mastodon install using the standard code base, but best practice is generally to run one's Mastodon server behind an nginx proxy and I'm fairly certain it would be relatively straightforward to identify source traffic by IP and route it to a dedicated server that returns that 302 and keeps the connection open...
@natalie lol hammered my profile several times a second
easy block with ufw
@natalie 404s don't stop it either
used to run an instance on my domain root but i just have a personal site there now but it kept hammering that
@natalie @Drand Take a look at this! On his CV:

https://static1.squarespace....

@izzie @natalie @ashten @Drand
We got in touch with him and are talking about it and monitoring the situation.

That being said, I hope he sees the fact that someone caught him and immediately spread an alarm across the network as a useful data point.

- TechHub Moderator

So, @Drand , are you gonna donate that grant money to the instance admins who's servers you hammered? Because we've been scraping together every last dollar we have to pay for our servers for our friends to use, and you stole both that server time and our data when we explicitly requested you don't in our robots.txt and #nobots tags. Or are you gonna pocket that cash? Cuz I see you receive around a million dollars in grants over every year or two.

@natalie
Akkoma

@ashten @natalie @Drand Was this scraping project run past an ethics review, seeing as it's being performed with MIT resources?

@ashten @natalie @Drand consider sueing for damages as well as publicly shaming the IP-Adresses and ASNs used for said attacks by submitting them to my #DROP blocklist
https://github.com/greyhat-academy/lists.d/blob/main/drop.ipv4.block.list.tsv

By creating an issue:
https://github.com/greyhat-academy/lists.d/issue/new

lists.d/drop.ipv4.block.list.tsv at main · greyhat-academy/lists.d

List of useful things. Contribute to greyhat-academy/lists.d development by creating an account on GitHub.

GitHub

@ashten @natalie @Drand I’ll check the server log and ipblock…

annoying assholes

@ashten @Drand @natalie it appears i've sysadmined wrong and don't have the data on hand, only the last 40 requests over the 233656 total requests to the server. oops... will continue to monitor to see if that IP shows up

@Drand @ashten @natalie FOUND HIS ASS

the logs are very incomplete and may very likely not contain the first instance he hit this instance

Dec 03 19:50:24 haproxy[87751]: 18.4.38.176:63119 {Python-urllib/3.9} “GET /users/sela HTTP/1.1”

and a lot of other requests like this later

(redacted some useless stuff from the log)

alright, around this date he scraped our instance, but for some reason didn’t scrape my account, but only my partners’.

@lena @natalie @ashten Very sorry about all this, see longer response here https://techhub.social/@Drand/111533675192775824
Dave Rand (@[email protected])

@[email protected] UPDATED Hi all, apologies for this, we looked into what was happening more and turns out the issue was a link unshortening script - we didn't scrape posts, only used the official API. We didn't realize the unshortening script wad causing problems and have stopped it. In terms of why we were doing this data collection, we are doing research on how content moderation policies vary across servers, and how this can help inform the Fediverse more broadly about effective approaches to content moderation. You can get more of a sense of the kind of research we do here: https://docs.google.com/document/d/1k2D4zVqkSHB1M9wpXtAe3UzbeE0RPpD_E2UpaPf6Lds/edit?usp=sharing Sorry again about causing problems for folks! (And thanks to a couple of people for emailing me to let me know about this)

TechHub
@Drand @lena @natalie @ashten Ah, the priviledge of costing other people money and breaching their requested privacy and anonymity and responding, "Whoops!"
@ashten @natalie @Drand Oh wow "Federated Moderation" (not a thing) and "Mastodon Network" (also not a thing) what a joke.
@lanodan @natalie @ashten @Drand Everytime I hear “Mastodon network” my gorge rises. It really pisses me off for some reason. 
@ashten @Drand @natalie
And looking in the logs I don't think I even want to block that IP because it's pretty much evidence against bullshit.

- A ton of requests to /users/lanodan so just the profile. By ton I mean multiple times a second and multiple times an hour, kind of thing that would be a banning offense
- 9 requests to /objects/ (ActivityPub posts data for Pleroma) so pretty much nothing
- Only on the 2023-11-28 they grabbed /api/v1/timelines/public (MastodonAPI) with mastodonpy as User-Agent (everything else uses Python-urllib/3.9), without authentication on my instance you'd get only my public posts, I post unlisted by default so it's only some replies (like this one)
- They also scrapped parts of my blog, with URLs ending in </a> and %3C/a%3E do you even HTML?

@Drand @ashten @natalie btw in terms of amount of requests:

$ grep -Rc 18.4.38.176 /var/log/syslog/NightmareMoon/local7_nginx/2023/1* | grep -v ':0$' /var/log/syslog/NightmareMoon/local7_nginx/2023/11/27.log:41 /var/log/syslog/NightmareMoon/local7_nginx/2023/11/29.log:287 /var/log/syslog/NightmareMoon/local7_nginx/2023/11/04.log:41 /var/log/syslog/NightmareMoon/local7_nginx/2023/11/28.log:251 /var/log/syslog/NightmareMoon/local7_nginx/2023/11/30.log:57 /var/log/syslog/NightmareMoon/local7_nginx/2023/11/24.log:1 /var/log/syslog/NightmareMoon/local7_nginx/2023/12/01.log:338 /var/log/syslog/NightmareMoon/local7_nginx/2023/12/06.log:533 /var/log/syslog/NightmareMoon/local7_nginx/2023/12/05.log:502 /var/log/syslog/NightmareMoon/local7_nginx/2023/12/02.log:340 /var/log/syslog/NightmareMoon/local7_nginx/2023/12/03.log:194 /var/log/syslog/NightmareMoon/local7_nginx/2023/12/04.log:307
@ashten @natalie @Drand So he's getting $210,000 for DOS-ing the fediverse? When I could have done it for free? (No, I couldn't. I don't have the skills)
@kinetix dunno if it's too late to be worth the bother now but might consider an IP block

@natalie @Drand
@apophis Thanks for pointing this out, btw. It's an interesting thread - and fortunately the 'issue' is mostly cleared up (and fortunately didn't cause any issue for us along the way)

@natalie you know, I was curious about whether I had any hits in the log so I checked and 1) yeah, most with the urllib user agent but also a few that are for like, css and webfont ressources with just a regular chrome on windows 10 user agent

is this someones workstation that they're running a scraper script on lol

@halcy i saw that too but only on one domain out of three, was wondering the same

@natalie I have this person seemingly looking at (i didn't bother reordering just zcat everything and look through) /api/v1/instance with a python requests UA on the 24th, then looking at /about/ with a browser a day later

there's some other UAs in there too, like mastodon.pys default one which is frankly insult to injury for me

@natalie "proper" scraping starts on the 29th, with the urllib UA

I suspect they slid off mastopy because it tries to respect rate limits lol (it also may not have the API implemented for domain blocks yet because I've been a bit lazy)

edit: man, I should probably add a robots.txt parser in as well just to discourage this type of thing by default :|

@natalie overall this looks like someone trying various options while developing a scraper to me until eventually setting it loose