Mastodawn

raspberry Jun 8

@slink thank you so much for this talk! super interesting to see a completely different approach to this subject.

Do you think some sort of .well-known or open standard for applications to expose valid Path/Method/Headers to a fronting WAF would work or I figure if you're going to modify an application to provide this you might as well implement HTTP Signatures (I'd vaguely heard about these but didn't know you could use them this way).

Maybe web frameworks (example django because i'm familiar with it) could provide an export that could be uploaded to a CDN/WAF to create a base ruleset? Do you have any thoughts on a format specification?

Thanks for your work on Vinyl Cache, cheers!

raspberry Jun 8

@slink oh, I'd also like to know your opinion/thoughts on the recent Bot Scraper WAF's like Anubis https://anubis.techaro.lol/ or Iocaine https://iocaine.madhouse-project.org/ (I know these probably fall under "bad/negative" traffic WAF blocking but I think some of their methods have some merit. Thanks again!

Anubis: Web AI Firewall Utility | Anubis

Weigh the soul of incoming HTTP requests to protect your website!

@theraspb So, regarding Anubis: Ideally, I would like to write an article similar to the one on Iocaine including the "here's how to do the same in Vinyl Cache", but in this case this involves developing some JS to be run in the browser, and this triggers a defense reflex, because I really don't like JS. So I am not sure if I will get around to it, and will try to give a comparably shorter response:

The central problem of all "crawler defense" techniques is to identify either ...

@theraspb ... legitimate users (to allow access) or crawlers (to block or otherwise "tame"). In the Iocaine article I have tried to explain why identifying crawlers is impossible in the general case, and more "the crawler is too dumb to properly disguise" where detection works. Sure enough, if a crawler _wants_ to be identified as such, there are good ways (DNS, IP lists, signed requests), but it's generally not hard for a crawler to pretend it was a legitimate anonymous surfer wrt http headers.

@theraspb So back to the first question: Can we maybe identify legitimate users better than crawlers? Ultimately, we can not peek out the user's screen and see if there's a person there, but google and apple are trying things in this direction with their attestation stuff, which essentially boils down to trusting secrets burnt into hardware, which in turn means users no longer have control over their devices. Scanning ID cards, faces and similar things also are similar attempts...

@theraspb which also relate to the "age verification" mess. IMHO, this all ultimately leads to a central internet with walls everywhere.
If we want to avoid that route and still identify legitimate users, one option is to require a log in, then the question is how hard it is to register an account and if account/cookie sharing is limited. But this also excludes anonymous users.
Some intermediate approaches try to identify if the connecting user agent "is a browser". A very simple form is to ...

@theraspb simply check if the user-agent supports cookies or supports cache validation. As many crawlers do, people resorted to checking if Javascript works, and crawlers adapted.

Anubis implements this idea combined with a proof of work: The client is tasked to find a hash collision by running javascript code, which induces relevant cost in terms of CPU time. If all goes well, crawlers will not invest that cost and stay out, but, IMHO, clearly thjs model is not sustainable:

@theraspb It requires users to enable Javascript and burns CPU time (=energy), which is, I think, the wrong signal: The web should work without JS.
Also, the cost of proof of work goes down considerably if you use specialised hardware (GPUs, ASICs, see also bitcoin miners), so determined crawlers will pay less than well meaning users.
My personal opinion is that the proof of work approach is still not good enough, and I have some ideas which I want to talk about later…