OK, found a more aggressive scraping defense mechanism that has managed to catch over 9000 distinct IPs. Is there a way to semi-automatically analyze this, collect the relevant subnets and find who they are assigned to, to see what the downsides would be of the subnet-wide ban?

#ShieldsUp

This is turning out to be an EXCELLENT collector for scraper IPs. But I really need to make sense of it somehow. I'm already at ~30K IPs in approx. 4.5 hours.
16 hours in, we're at ~125K IPs, so we're keeping the rate of around 2 attempts per second. I'm still waiting for recommendations on tools that would allow me to wade through this huge collection of IPs to get statistics on who they belong to, if there's an actual botnet in it (inclusive of residential addresses taken over by it) and/or which datacenters are involved. Any #recommendations? #askFedi #fediHelp #networking
I mean, I can cook up a script that iterates over the first IP, runs a whois query to get the route, finds all IPs that match that route, and then moves on to the next uncollected IP, but I can't believe nobody has done something like that already.
I ended up cooking my own script. Of course the issue with processing the WHOIS information of 175K IPs (and growing) is that queries to the WHOIS database have to be rate limited. I wrote a trivial Python script that does what I mentioned in the previous post, which limits queries to those for which no IP range has been “found” yet, but apparently even if inetnum ranges returned by whois are too tight, so the reduction isn't “impressive” (in some ranges there's like 1 or 2 IPs only)

@oblomov

Do you think you could adapt your script to filter out countries? I was thinking about this since a while. I would like to exclude the US, Russia and China for getting my pages served, but IP blocks for these countries are fucking huge.

@77nn that's a bit too much, probably, although for this way it seems to be mostly smaller countries, from the sampling I've made.