statistics:
hits 1037549
after filter 13530
bot rate 98.70%
addrs filtered 6018
UAs filtered 225
paths filtered 1566

something something dead internet theory

got another few valid hits on my bitflip experiment:
statistics:
hits 53731
after filter 19
bot rate 99.96%
addrs filtered 200
UAs filtered 8
paths filtered 86

new hits:
- one from an Android 9 device on Rogers (ipv6) using gmail webview
- 4 from google-owned IPs(!): three tracking pixels from blogger domains, and one pagespeed proxy request

I am slightly intrigued by the google IPs - do they run a lot of gear without ECC memory?

update since the 8th:

hits 110397
after filter 27
bot rate 99.98%
addrs filtered 382
UAs filtered 13
paths filtered 141

new hits:
- two hits from distinct AWS IPv4s ~2 seconds apart, to a Gmail asset URL
- two more hits to the exact same tracking pixel URL from before (same referrer as well), one from DigitalOcean and another from residential .VN ISP
- one hit to the default Google profile picture (referrer accounts.google.com) from a possible proxy in .PK
- one hit to a placeholder image used in the Google Photos app (com.google.android.apps.photos in the UA) from a residential IPv6 in .VN
- one hit to a user's google profile picture from a residential IPv4 in .IN (referrer speedtypingonline.com)

it's getting a little trickier to filter out all the weird noise, my regex rules are starting to get kinda cluttered and I didn't provide for any means of commenting/documenting the rules. I think I will pick up another batch of 15 domains next paycheck, as it looks like there is still a surprising amount of activity even with only my small sample set so far.
next steps will also be to start logging all DNS queries - it seems like 99.9% of the garbage traffic is hitting the base domain, while all the interesting stuff is hitting well known subdomains. I can see this sort of analysis being a lot harder for non-CDN domains that don't have unique subdomains...
i'll probably try to cobble together a custom DNS server for this and run the nodes on my anycast routers, perhaps there will be some interesting geographical bias in where corrupt requests come from once more data is available. I am also wondering if there are possibly other services outside of HTTP{,S} running on googleusercontent.com - does anybody know if that's a thing?

+11 day update:

hits 228081
after filter 51
bot rate 99.98%
addrs filtered 634
UAs filtered 14
paths filtered 208

notable or interesting hits:
- a hit from a facebook crawler
- several hits for GCP block storage downloads from Nepal
- several pagespeed hits for Horse Talk
- a few hits from a chromecast dongle with a *lot* of flips in the URL, poor thing must be really suffering
- a number of hits for user profile pictures from what I think is PUBG mobile? game=ShadowTrackerExtra, engine=UE4, version=4.18.1-0+++UE4+Release-4.18, platform=IOS, osver=26.3.1
- some unknown unity app? UnityPlayer/2019.4.40f1, libcurl/7.80.0-DEV
- classic Opera with the Presto rendering engine, on a 32 bit Linux machine in Egypt!

getting paid next week and will pick up another batch of domains, which should hopefully increase the hitrate. like originally expected it mostly seems to be mobile devices, but there have been a few desktops and servers in the data so far.