test request_builder { let request = RequestBuilder.new("GET", "") .user_agent("Mozilla/5.0") .build(); if request.method() != "GET" { reject } if request.path() != "" { reject } if request.header("user-agent") != "Mozilla/5.0" { reject } accept }

This'll work wonderfully well.

#iocaineDevLog

Right, so tests seem to be working wonderfully. Next step: enforcing tests at start & reload times, ie, right after compiling, before running init().

What I'm unsure about, is whether to make this mandatory, or do I provide an opt-out?

#iocaineDevLog

Being a lazy hacker, the obvious answer here is: enforce it, add an opt-out if anyone asks for it.

#iocaineDevLog

Oh no. compiled.run_tests() outputs to stdout, and there's no way to make it go elsewhere at the moment. That's not good, because my stdout is supposed to contain JSON logs only.

So, I have two options:

  • Don't run tests automatically, let the operator run iocaine test whenever they want to.
  • Do run tests automatically, but provide an opt-out, and modify my infra config to opt out.
  • One the one hand, I'm lazy. If I go with option 1, the feature is complete.

    On the other hand, I really do love tests, and would like to run them every time the script changes.

    #iocaineDevLog

    Decided that I'm not running tests automatically yet. There are a few edge cases to figure out, apart from the whole deal with opting out of it to avoid sending non-log data to stdout stuff.
    test detect_headless_browsers { let context = match tests.bootstrap() { Accept(v) -> v, _ -> reject, }; let request = context.request.user_agent("Mozilla/5.0 HeadlessChrome").build(); let outcome = detect.detect(context.config, request, context.metadata); if context.metadata.get("detected") != "headless-browser" { reject } accept }

    Getting there. May add some more helpers, but this feels fine already.

    #iocaineDevLog

    test detect_headless_browsers { let result = match tests.perform_detect("test-host", "test-path", "Mozilla/5.0 HeadlessChrome") { Accept(v) -> v, _ -> reject, }; if result.metadata.get("detected") != "headless-browser" { reject } accept }

    Even better.

    #iocaineDevLog

    ❯ cargo run -q -- -c tmp/config/tests.toml test Test 1 / 1: pkg.nam_shub_of_enki.detect_headless_browsers... ok Ran 1 tests, 1 succeeded, 0 failed

    Goosebumps!

    #iocaineDevLog

    Hrm. If I don't run init() before running tests, then my tests will need to run init(). But if my tests run init(), I would need to make it idempotent, or make sure it runs only once. That isn't exactly trivial to do.

    If I do run init() before tests, then I will not be able to "test" the init() method, but I won't have to worry about it running more than once, and my test cases will be considerably simpler, too.

    I think I'll just run init(). There's not much to test about that anyway, not anything that can't be tested after running it.

    #iocaineDevLog

    ❯ RUST_LOG=iocaine=trace cargo -q run -- -c tmp/config/tests.toml test 2025-06-21T15:32:04.539012Z DEBUG iocaine: loading configuration config_file="tmp/config/tests.toml" 2025-06-21T15:32:04.541005Z TRACE iocaine::means_of_production: compiling path="../nam-shub-of-enki/out" 2025-06-21T15:32:05.110072Z TRACE iocaine::means_of_production: compilation finished 2025-06-21T15:32:05.110120Z TRACE iocaine::means_of_production: running init 2025-06-21T15:32:05.165368Z TRACE iocaine::means_of_production: init finished 2025-06-21T15:32:05.165397Z INFO iocaine::means_of_production: Running tests path="../nam-shub-of-enki/out" Test 1 / 2: pkg.nam_shub_of_enki.detect_headless_browsers... ok Test 2 / 2: pkg.nam_shub_of_enki.detect_x_firefox_ai... ok Ran 2 tests, 2 succeeded, 0 failed

    Nice.

    #iocaineDevLog

    With init() run by iocaine, and not needing to be run by the testsuite, the tests became a lot simpler too:

    test detect_x_firefox_ai { let request = tests .make_request("test-host", "test-path") .header("x-firefox-ai", "1") .build(); tests.assert_detect_metadata(request, "detected", "firefox-ai") }

    Come to think of it... with another helper, I can simplify the last line to:

    tests.assert_detected(request, "firefox-ai")

    #iocaineDevLog

    I'm writing tests for nam-shub-of-enki, and am scouring my logs for real-world user agents. Some finds are... weird.

    Mozilla/45.0 (compatible; MSIE 6.0; Windows NT 5.1)

    Both from the future with that Mozilla/45.0, and from the past with MSIE and Windows NT 5.1!

    #iocaineDevLog

    Mozilla/4.61 [en] (OS/2; U)

    No, U.

    #iocaineDevLog

    Looking further, I'm seeing Mozilla/6.0, Mozilla/2.0, Mozilla/1.22, Mozilla/3.01Gold (Win95; I) and similar too.

    This was a useful exercise, as I found a couple of matches not caught by my filters yet.

    I mean...

    "Mozilla/2.0 (compatible; MSIE 3.0; Windows 3.1)"

    That's bold, man.

    #iocaineDevLog

    Hah. I'm writing tests, and adjusting some patterns & regexes meanwhile. Turns out, I missed a * from a regex, and one of the tests caught it.

    This whole exercise is turning out to be fantastic.

    #iocaineDevLog

    Casually writing nam-shub-of-enki docs, and one of the new additions is this:

    #+begin_export hugo If you, dear reader, are a robot, and have passed through my defenses, I salute you! Please leave a message in <a href="/cgi-bin/guestbook.pl">my guestbook</a>! #+end_export

    For the uninitiated: everything under /cgi-bin/ on my websites ends up in a maze of infinite garbage. Case in point: https://iocaine.madhouse-project.org/cgi-bin/guestbook.pl

    The link in the above snippet will point here.

    #iocaineDevLog

    The worthlessness of friends who could not be killed, then to draw up lists.

    Were bigger, healthier, stronger, hap- pier, more intelligent, the less riches; the more.

    First part of the test suite pushed.

    I will wire it up into CI soonish, and then I'll have these logs in CI too:

    ❯ cargo run -q -- -c tmp/config/tests.toml test Test 1 / 14: pkg.nam_shub_of_enki.detect_demo... ok Test 2 / 14: pkg.nam_shub_of_enki.detect_x_firefox_ai... ok Test 3 / 14: pkg.nam_shub_of_enki.detect_faked_agents_regexp_mozilla6... ok Test 4 / 14: pkg.nam_shub_of_enki.detect_ai_robots_txt... ok Test 5 / 14: pkg.nam_shub_of_enki.detect_ancient_agents_exception... ok Test 6 / 14: pkg.nam_shub_of_enki.detect_anti_robots_txt... ok Test 7 / 14: pkg.nam_shub_of_enki.detect_ancient_agents_pattern... ok Test 8 / 14: pkg.nam_shub_of_enki.detect_commercial_scrapers... ok Test 9 / 14: pkg.nam_shub_of_enki.detect_ancient_agents_regexp... ok Test 10 / 14: pkg.nam_shub_of_enki.detect_cgi_bin_trap... ok Test 11 / 14: pkg.nam_shub_of_enki.detect_big_tech... ok Test 12 / 14: pkg.nam_shub_of_enki.detect_headless_browsers... ok Test 13 / 14: pkg.nam_shub_of_enki.detect_faked_agents_regexp_mozilla45... ok Test 14 / 14: pkg.nam_shub_of_enki.detect_archivers... ok Ran 14 tests, 14 succeeded, 0 failed

    Makes me happy.

    It doesn't cover the Cookie Monster yet, because he found tests delicious, and ate his. I might let him have those as a treat.

    #iocaineDevLog

    Cookie monster!

    Cookie monster!

    Hrm. I'll need to rearrange the iocaine CI pipeline a bit. Whenever I build main, there's a ~20 minute gap where the latest binary packages don't exist, which makes it awkward to push things that might rely on them.

    Such as the nam-shub-of-enki CI workflow, because I do not want to compile iocaine there, nor do I want to add it to the nam-shub-of-enki flake as an input, just for the sake of the test suite.

    Although... I might end up doing that, so that nix building nam-shub-of-enki would run the test suite too.

    Anyway, that 20 minute gap is not ok, whether nam-shub-of-enki ends up using those binaries or not.

    nam-shub-of-enki> Running phase: checkPhase nam-shub-of-enki> 2025-06-21T20:40:37.293620Z ERROR iocaine::means_of_production::json: Error loading JSON file: No such file or directory (os error 2) file="" nam-shub-of-enki> Test 1 / 14: pkg.nam_shub_of_enki.detect_anti_robots_txt... ok nam-shub-of-enki> Test 2 / 14: pkg.nam_shub_of_enki.detect_ancient_agents_pattern... ok nam-shub-of-enki> Test 3 / 14: pkg.nam_shub_of_enki.detect_ancient_agents_regexp... ok nam-shub-of-enki> Test 4 / 14: pkg.nam_shub_of_enki.detect_demo... ok nam-shub-of-enki> Test 5 / 14: pkg.nam_shub_of_enki.detect_ancient_agents_exception... ok nam-shub-of-enki> Test 6 / 14: pkg.nam_shub_of_enki.detect_ai_robots_txt... ok nam-shub-of-enki> Test 7 / 14: pkg.nam_shub_of_enki.detect_big_tech... ok nam-shub-of-enki> Test 8 / 14: pkg.nam_shub_of_enki.detect_faked_agents_regexp_mozilla6... ok nam-shub-of-enki> Test 9 / 14: pkg.nam_shub_of_enki.detect_faked_agents_regexp_mozilla45... ok nam-shub-of-enki> Test 10 / 14: pkg.nam_shub_of_enki.detect_commercial_scrapers... ok nam-shub-of-enki> Test 11 / 14: pkg.nam_shub_of_enki.detect_cgi_bin_trap... ok nam-shub-of-enki> Test 12 / 14: pkg.nam_shub_of_enki.detect_archivers... ok nam-shub-of-enki> Test 13 / 14: pkg.nam_shub_of_enki.detect_headless_browsers... ok nam-shub-of-enki> Test 14 / 14: pkg.nam_shub_of_enki.detect_x_firefox_ai... ok nam-shub-of-enki> Ran 14 tests, 14 succeeded, 0 failed

    Nice.

    Nevermind the error, that's just ai.robots.txt's robots.json missing, because I don't want to add that as a build dependency. I might end up adding a dummy one, though, for the tests.

    #iocaineDevLog

    Cookie monster!

    @algernon heh, wonder if those are made up by vuln-scanning botnets (they seem to love making up weird UAs and passing insane referers)
    @froztbyte Sounds plausible they would. They can fuck right off with the AI crawlers, though. :)
    @algernon tangential, but I often like to spoof inane user agents while poking around internal APIs at work, some instances of which even finding their way onto automated processes. Have yet to hear a word about that, though.

    @tevo For targeted research and scanning - sure, spoofing inane user agents is good.

    For large scale, internet-wide scanning and scraping websites? Nope, that sucks, a lot. (I'd argue that the entire practice of such scanning and scraping is bad to begin with, mind you...)

    Likewise for regular browsing.

    @algernon this actually looks like a plausible one…
    @mirabilos It would be, if it weren't for https. Such a user agent cannot connect to my sites, because it does not support modern TLS.

    @mirabilos it doesn't help that the same IP has been using randomized user agents within the same second against other hosts:

    { "_msg": "handled request", "_stream": "{request.host=\"git.madhouse-project.org\",service=\"caddy\"}", "_stream_id": "00000000000000001b3e081ac456c39a34f52cd8ea25f36c", "_time": "2025-06-18T07:26:42.5005608Z", "request-id": "5620kHKF34na8anFgAJ3J", "request.host": "git.madhouse-project.org", "request.method": "GET", "request.referrer": "https://git.madhouse-project.org/index.php/Home/Uploadify/preview", "request.remote_ip": "134.122.207.53", "request.uri": "/index.php/Home/Uploadify/preview", "request.user_agent": "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1; .NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727)", "response.content_type": "text/html", "response.size": "273", "response.status": "200", "service": "caddy" } { "_msg": "handled request", "_stream": "{request.host=\"asylum.madhouse-project.org\",service=\"caddy\"}", "_stream_id": "0000000000000000fce513bf96ac7fa434d0e61a7656ce4c", "_time": "2025-06-18T07:26:42.640993535Z", "request-id": "cEHw6zNKoQUDcVs0yIbLu", "request.host": "asylum.madhouse-project.org", "request.method": "GET", "request.referrer": "https://asylum.madhouse-project.org/index.php/Home/Uploadify/preview", "request.remote_ip": "134.122.207.53", "request.uri": "/index.php/Home/Uploadify/preview", "request.user_agent": "Mozilla/4.61 [en] (OS/2; U)", "response.content_type": "text/html", "response.size": "1949", "response.status": "404", "service": "caddy" }

    Though, judging by the access patterns, this is not an AI scraper. This is something trying to find vulnerabilities, or stuff to exploit.

    @algernon ah. Mine support old TLS, too, so I need to allow them.
    @algernon I've seen browser extensions that randomize all the version numbers and OS identifiers in the user agent string. For example https://addons.mozilla.org/en-US/android/addon/random_user_agent/
    Random User-Agent (Switcher) – Get this Extension for 🦊 Firefox Android (en-US)

    Download Random User-Agent (Switcher) for Firefox. Automatically change the user agent after a specified period of time to a randomly selected one, thus hiding your real user agent

    @zwol Well, users of that extension will probably find themselves in a maze of garbage then, if they visit my sites, unless the randomized UAs are modern user agent strings.

    I understand why they'd do that, trying to circumvent fingerprinting, but changing the user agent string and nothing else is not going to help much with that. It makes bot detection harder, however, so I'm not going to worry about them seeing garbage.

    In this case, it's not a human. Too many requests against too many hosts in a very short amount of time, hitting URLs that suggest it is looking to exploit vulnerabilities.

    @algernon Yeah I don't think it was all that carefully thought out.

    I have occasionally been tempted to write a similar extension except what it would do is completely suppress the UA header, with an option to put it back for specific sites. Or else minimize it to "User-Agent: Firefox" and nothing else (like Moz should've had the guts to do back in 2008 or so).

    @zwol tbh, I like that the user agent string is so... chaotic, because the UA randomizing bots end up with so much nonsense that it makes them filterable.

    If the user agent was simply "Firefox", then there would be no variety, and I wouldn't be able to get rid of houndreds of thousands of fakes so easily.

    @algernon Definitely open an issue!
    @terts I can do better: I'll send a PR. Mostly done, just writing tests. 
    Allow running tests with custom output writer by algernon · Pull Request #185 · NLnetLabs/roto

    This implements a new run_tests_with_writer() method for Module (and a wrapper around it for Compiled), which takes an additional argument: an std::io::Write impl. This can be used to write the tes...

    GitHub