Mastodawn

Right, so tests seem to be working wonderfully. Next step: enforcing tests at start & reload times, ie, right after compiling, before running init().

What I'm unsure about, is whether to make this mandatory, or do I provide an opt-out?

Being a lazy hacker, the obvious answer here is: enforce it, add an opt-out if anyone asks for it.

Oh no. compiled.run_tests() outputs to stdout, and there's no way to make it go elsewhere at the moment. That's not good, because my stdout is supposed to contain JSON logs only.

So, I have two options:

Don't run tests automatically, let the operator run iocaine test whenever they want to.

Do run tests automatically, but provide an opt-out, and modify my infra config to opt out.

One the one hand, I'm lazy. If I go with option 1, the feature is complete.

On the other hand, I really do love tests, and would like to run them every time the script changes.

Decided that I'm not running tests automatically yet. There are a few edge cases to figure out, apart from the whole deal with opting out of it to avoid sending non-log data to stdout stuff.

test detect_headless_browsers {
    let context = match tests.bootstrap() {
        Accept(v) -> v,
        _ -> reject,
    };
    let request = context.request.user_agent("Mozilla/5.0 HeadlessChrome").build();
    let outcome = detect.detect(context.config, request, context.metadata);

    if context.metadata.get("detected") != "headless-browser" { reject }

    accept
}

Getting there. May add some more helpers, but this feels fine already.

test detect_headless_browsers {
    let result = match tests.perform_detect("test-host", "test-path", "Mozilla/5.0 HeadlessChrome") {
        Accept(v) -> v,
        _ -> reject,
    };

    if result.metadata.get("detected") != "headless-browser" { reject }

    accept
}

Even better.

❯ cargo run -q -- -c tmp/config/tests.toml test
Test 1 / 1: pkg.nam_shub_of_enki.detect_headless_browsers... ok
Ran 1 tests, 1 succeeded, 0 failed

Goosebumps!

Hrm. If I don't run init() before running tests, then my tests will need to run init(). But if my tests run init(), I would need to make it idempotent, or make sure it runs only once. That isn't exactly trivial to do.

If I do run init() before tests, then I will not be able to "test" the init() method, but I won't have to worry about it running more than once, and my test cases will be considerably simpler, too.

I think I'll just run init(). There's not much to test about that anyway, not anything that can't be tested after running it.

❯ RUST_LOG=iocaine=trace cargo -q run -- -c tmp/config/tests.toml test
2025-06-21T15:32:04.539012Z DEBUG iocaine: loading configuration config_file="tmp/config/tests.toml"
2025-06-21T15:32:04.541005Z TRACE iocaine::means_of_production: compiling path="../nam-shub-of-enki/out"
2025-06-21T15:32:05.110072Z TRACE iocaine::means_of_production: compilation finished
2025-06-21T15:32:05.110120Z TRACE iocaine::means_of_production: running init
2025-06-21T15:32:05.165368Z TRACE iocaine::means_of_production: init finished
2025-06-21T15:32:05.165397Z  INFO iocaine::means_of_production: Running tests path="../nam-shub-of-enki/out"
Test 1 / 2: pkg.nam_shub_of_enki.detect_headless_browsers... ok
Test 2 / 2: pkg.nam_shub_of_enki.detect_x_firefox_ai... ok
Ran 2 tests, 2 succeeded, 0 failed

Nice.

With init() run by iocaine, and not needing to be run by the testsuite, the tests became a lot simpler too:

test detect_x_firefox_ai {
    let request = tests
      .make_request("test-host", "test-path")
      .header("x-firefox-ai", "1")
      .build();
    tests.assert_detect_metadata(request, "detected", "firefox-ai")
}

Come to think of it... with another helper, I can simplify the last line to:

tests.assert_detected(request, "firefox-ai")

I'm writing tests for nam-shub-of-enki, and am scouring my logs for real-world user agents. Some finds are... weird.

Mozilla/45.0 (compatible; MSIE 6.0; Windows NT 5.1)

Both from the future with that Mozilla/45.0, and from the past with MSIE and Windows NT 5.1!

Mozilla/4.61 [en] (OS/2; U)

No, U.

Looking further, I'm seeing Mozilla/6.0, Mozilla/2.0, Mozilla/1.22, Mozilla/3.01Gold (Win95; I) and similar too.

This was a useful exercise, as I found a couple of matches not caught by my filters yet.

I mean...

"Mozilla/2.0 (compatible; MSIE 3.0; Windows 3.1)"

That's bold, man.

Hah. I'm writing tests, and adjusting some patterns & regexes meanwhile. Turns out, I missed a * from a regex, and one of the tests caught it.

This whole exercise is turning out to be fantastic.

Casually writing nam-shub-of-enki docs, and one of the new additions is this:

#+begin_export hugo
If you, dear reader, are a robot, and have passed through my defenses, I salute you! Please leave a message in <a href="/cgi-bin/guestbook.pl">my guestbook</a>!
#+end_export

For the uninitiated: everything under /cgi-bin/ on my websites ends up in a maze of infinite garbage. Case in point: https://iocaine.madhouse-project.org/cgi-bin/guestbook.pl

The link in the above snippet will point here.

The worthlessness of friends who could not be killed, then to draw up lists.

Were bigger, healthier, stronger, hap- pier, more intelligent, the less riches; the more.

First part of the test suite pushed.

I will wire it up into CI soonish, and then I'll have these logs in CI too:

❯ cargo run -q -- -c tmp/config/tests.toml test
Test  1 / 14: pkg.nam_shub_of_enki.detect_demo... ok
Test  2 / 14: pkg.nam_shub_of_enki.detect_x_firefox_ai... ok
Test  3 / 14: pkg.nam_shub_of_enki.detect_faked_agents_regexp_mozilla6... ok
Test  4 / 14: pkg.nam_shub_of_enki.detect_ai_robots_txt... ok
Test  5 / 14: pkg.nam_shub_of_enki.detect_ancient_agents_exception... ok
Test  6 / 14: pkg.nam_shub_of_enki.detect_anti_robots_txt... ok
Test  7 / 14: pkg.nam_shub_of_enki.detect_ancient_agents_pattern... ok
Test  8 / 14: pkg.nam_shub_of_enki.detect_commercial_scrapers... ok
Test  9 / 14: pkg.nam_shub_of_enki.detect_ancient_agents_regexp... ok
Test 10 / 14: pkg.nam_shub_of_enki.detect_cgi_bin_trap... ok
Test 11 / 14: pkg.nam_shub_of_enki.detect_big_tech... ok
Test 12 / 14: pkg.nam_shub_of_enki.detect_headless_browsers... ok
Test 13 / 14: pkg.nam_shub_of_enki.detect_faked_agents_regexp_mozilla45... ok
Test 14 / 14: pkg.nam_shub_of_enki.detect_archivers... ok
Ran 14 tests, 14 succeeded, 0 failed

Makes me happy.

It doesn't cover the Cookie Monster yet, because he found tests delicious, and ate his. I might let him have those as a treat.

Cookie monster!

Whistles innocently

Cookie monster!

Hrm. I'll need to rearrange the iocaine CI pipeline a bit. Whenever I build main, there's a ~20 minute gap where the latest binary packages don't exist, which makes it awkward to push things that might rely on them.

Such as the nam-shub-of-enki CI workflow, because I do not want to compile iocaine there, nor do I want to add it to the nam-shub-of-enki flake as an input, just for the sake of the test suite.

Although... I might end up doing that, so that nix building nam-shub-of-enki would run the test suite too.

Anyway, that 20 minute gap is not ok, whether nam-shub-of-enki ends up using those binaries or not.

nam-shub-of-enki> Running phase: checkPhase
nam-shub-of-enki> 2025-06-21T20:40:37.293620Z ERROR iocaine::means_of_production::json: Error loading JSON file: No such file or directory (os error 2) file=""
nam-shub-of-enki> Test  1 / 14: pkg.nam_shub_of_enki.detect_anti_robots_txt... ok
nam-shub-of-enki> Test  2 / 14: pkg.nam_shub_of_enki.detect_ancient_agents_pattern... ok
nam-shub-of-enki> Test  3 / 14: pkg.nam_shub_of_enki.detect_ancient_agents_regexp... ok
nam-shub-of-enki> Test  4 / 14: pkg.nam_shub_of_enki.detect_demo... ok
nam-shub-of-enki> Test  5 / 14: pkg.nam_shub_of_enki.detect_ancient_agents_exception... ok
nam-shub-of-enki> Test  6 / 14: pkg.nam_shub_of_enki.detect_ai_robots_txt... ok
nam-shub-of-enki> Test  7 / 14: pkg.nam_shub_of_enki.detect_big_tech... ok
nam-shub-of-enki> Test  8 / 14: pkg.nam_shub_of_enki.detect_faked_agents_regexp_mozilla6... ok
nam-shub-of-enki> Test  9 / 14: pkg.nam_shub_of_enki.detect_faked_agents_regexp_mozilla45... ok
nam-shub-of-enki> Test 10 / 14: pkg.nam_shub_of_enki.detect_commercial_scrapers... ok
nam-shub-of-enki> Test 11 / 14: pkg.nam_shub_of_enki.detect_cgi_bin_trap... ok
nam-shub-of-enki> Test 12 / 14: pkg.nam_shub_of_enki.detect_archivers... ok
nam-shub-of-enki> Test 13 / 14: pkg.nam_shub_of_enki.detect_headless_browsers... ok
nam-shub-of-enki> Test 14 / 14: pkg.nam_shub_of_enki.detect_x_firefox_ai... ok
nam-shub-of-enki> Ran 14 tests, 14 succeeded, 0 failed

Nice.

Nevermind the error, that's just ai.robots.txt's robots.json missing, because I don't want to add that as a build dependency. I might end up adding a dummy one, though, for the tests.

Cookie monster!

JP Jun 21

@algernon heh, wonder if those are made up by vuln-scanning botnets (they seem to love making up weird UAs and passing insane referers)

@froztbyte Sounds plausible they would. They can fuck right off with the AI crawlers, though. :)

Tevo Jun 21

@algernon tangential, but I often like to spoof inane user agents while poking around internal APIs at work, some instances of which even finding their way onto automated processes. Have yet to hear a word about that, though.

@tevo For targeted research and scanning - sure, spoofing inane user agents is good.

For large scale, internet-wide scanning and scraping websites? Nope, that sucks, a lot. (I'd argue that the entire practice of such scanning and scraping is bad to begin with, mind you...)

Likewise for regular browsing.

mirabilos Jun 21

@algernon this actually looks like a plausible one…

@mirabilos It would be, if it weren't for https. Such a user agent cannot connect to my sites, because it does not support modern TLS.

@mirabilos it doesn't help that the same IP has been using randomized user agents within the same second against other hosts:

{
  "_msg": "handled request",
  "_stream": "{request.host=\"git.madhouse-project.org\",service=\"caddy\"}",
  "_stream_id": "00000000000000001b3e081ac456c39a34f52cd8ea25f36c",
  "_time": "2025-06-18T07:26:42.5005608Z",
  "request-id": "5620kHKF34na8anFgAJ3J",
  "request.host": "git.madhouse-project.org",
  "request.method": "GET",
  "request.referrer": "https://git.madhouse-project.org/index.php/Home/Uploadify/preview",
  "request.remote_ip": "134.122.207.53",
  "request.uri": "/index.php/Home/Uploadify/preview",
  "request.user_agent": "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1; .NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727)",
  "response.content_type": "text/html",
  "response.size": "273",
  "response.status": "200",
  "service": "caddy"
}

{
  "_msg": "handled request",
  "_stream": "{request.host=\"asylum.madhouse-project.org\",service=\"caddy\"}",
  "_stream_id": "0000000000000000fce513bf96ac7fa434d0e61a7656ce4c",
  "_time": "2025-06-18T07:26:42.640993535Z",
  "request-id": "cEHw6zNKoQUDcVs0yIbLu",
  "request.host": "asylum.madhouse-project.org",
  "request.method": "GET",
  "request.referrer": "https://asylum.madhouse-project.org/index.php/Home/Uploadify/preview",
  "request.remote_ip": "134.122.207.53",
  "request.uri": "/index.php/Home/Uploadify/preview",
  "request.user_agent": "Mozilla/4.61 [en] (OS/2; U)",
  "response.content_type": "text/html",
  "response.size": "1949",
  "response.status": "404",
  "service": "caddy"
}

Though, judging by the access patterns, this is not an AI scraper. This is something trying to find vulnerabilities, or stuff to exploit.

mirabilos Jun 21

@algernon ah. Mine support old TLS, too, so I need to allow them.

Zack Weinberg Jun 21

@algernon I've seen browser extensions that randomize all the version numbers and OS identifiers in the user agent string. For example https://addons.mozilla.org/en-US/android/addon/random_user_agent/

Random User-Agent (Switcher) – Get this Extension for 🦊 Firefox Android (en-US)

Download Random User-Agent (Switcher) for Firefox. Automatically change the user agent after a specified period of time to a randomly selected one, thus hiding your real user agent

@zwol Well, users of that extension will probably find themselves in a maze of garbage then, if they visit my sites, unless the randomized UAs are modern user agent strings.

I understand why they'd do that, trying to circumvent fingerprinting, but changing the user agent string and nothing else is not going to help much with that. It makes bot detection harder, however, so I'm not going to worry about them seeing garbage.

In this case, it's not a human. Too many requests against too many hosts in a very short amount of time, hitting URLs that suggest it is looking to exploit vulnerabilities.

Zack Weinberg Jun 21

@algernon Yeah I don't think it was all that carefully thought out.

I have occasionally been tempted to write a similar extension except what it would do is completely suppress the UA header, with an option to put it back for specific sites. Or else minimize it to "User-Agent: Firefox" and nothing else (like Moz should've had the guts to do back in 2008 or so).

@zwol tbh, I like that the user agent string is so... chaotic, because the UA randomizing bots end up with so much nonsense that it makes them filterable.

If the user agent was simply "Firefox", then there would be no variety, and I wouldn't be able to get rid of houndreds of thousands of fakes so easily.

Terts Diepraam Jun 21

@algernon Definitely open an issue!

@terts I can do better: I'll send a PR. Mostly done, just writing tests.