OK, found a more aggressive scraping defense mechanism that has managed to catch over 9000 distinct IPs. Is there a way to semi-automatically analyze this, collect the relevant subnets and find who they are assigned to, to see what the downsides would be of the subnet-wide ban?

#ShieldsUp

This is turning out to be an EXCELLENT collector for scraper IPs. But I really need to make sense of it somehow. I'm already at ~30K IPs in approx. 4.5 hours.
16 hours in, we're at ~125K IPs, so we're keeping the rate of around 2 attempts per second. I'm still waiting for recommendations on tools that would allow me to wade through this huge collection of IPs to get statistics on who they belong to, if there's an actual botnet in it (inclusive of residential addresses taken over by it) and/or which datacenters are involved. Any #recommendations? #askFedi #fediHelp #networking
I mean, I can cook up a script that iterates over the first IP, runs a whois query to get the route, finds all IPs that match that route, and then moves on to the next uncollected IP, but I can't believe nobody has done something like that already.
I ended up cooking my own script. Of course the issue with processing the WHOIS information of 175K IPs (and growing) is that queries to the WHOIS database have to be rate limited. I wrote a trivial Python script that does what I mentioned in the previous post, which limits queries to those for which no IP range has been “found” yet, but apparently even if inetnum ranges returned by whois are too tight, so the reduction isn't “impressive” (in some ranges there's like 1 or 2 IPs only)
@oblomov surising there's no download option for the whois db
@oblomov surely compression and serving a static file would be *less* work and load
@arichtman maybe there is, but I really have no idea how to do it. Probably there is no such thing as an entire whois db, but some way to get the various assigned blocks would really help.

@oblomov found a project that downloads the TLDs from data.IANA.org - seems like there would be more

https://github.com/tigger04/tlds

GitHub - tigger04/tlds: Bash script to automatically download and maintain up-to-date Top-Level Domain (TLD) lists from IANA. Provides both original case and lowercase versions for easy integration.

Bash script to automatically download and maintain up-to-date Top-Level Domain (TLD) lists from IANA. Provides both original case and lowercase versions for easy integration. - tigger04/tlds

GitHub

@oblomov

Do you think you could adapt your script to filter out countries? I was thinking about this since a while. I would like to exclude the US, Russia and China for getting my pages served, but IP blocks for these countries are fucking huge.

@77nn that's a bit too much, probably, although for this way it seems to be mostly smaller countries, from the sampling I've made.