Today started out well: I got a bug report against iocaine that enabling the firewall results in send queues filling up. This is something I observed too, and was about to tackle today.

But the reporter made a suggestion: accept established & related connections, only drop new ones.

This allows requests in flight to be served normally, but new connections are dropped. This way, send queues do not fill up.

The downside is that if a crawler makes multiple requests within the same connection, they'll all get served (poison). That's not a big problem, because the costiest part of the whole deal is the TLS handshake, and we're past that at this point. Crawlers typically don't make many requests over the same connection anyway.

I'll be deploying this on Eru shortly, see how it behaves. I have high hopes. If it works out, I'll scale Eru back.

New iocaine deployed, lets see how it performs!

Caddy's memory usage seems okayish so far, it's not using much more than prior to enabling iocaine's firewalling. That's a good sign, because prior  to the update, the memory use shot up pretty rapidly.

It's.. hm. Something's not quite right.
Meanwhile, I figured I can enable auto-merge, but ideally, I'd like to preserve my existing sets. Trying to export the current firewall so I can edit it to add auto-merge has been running for 15+ minutes.
At this point, I might just... flush it and start anew while the server is scaled up.
Yeah, ok, 30 minutes to export the sets is stupid. Lets flush this and re-create.

There we go! Request rate down from ~1k req/sec to ~150 req/sec within ~45 minutes. Caddy memory use is fine.

iocaine's CPU use is rather high, though. It's currently the dominating process on Eru. I suppose the firewalling isn't that cheap, and I'm still banning IPs at a high-ish rate.

Will have to profile this at some point. But for now, I'm glad it's working out well so far.

If all goes well, I will release #iocaine 3.4.0 on April 1st. A departure from my usual Friday releases, but... I can't not release on April 1st if I have the opportunity.

I think I need a graph for firewall blocks, see if there's any correlation with CPU use. Because iocaine keeps hogging an entire core, even though request rate is down to ~150 req/sec - yet, it's still using as much CPU as when request rate was around 800 req/sec.
@algernon err... I don't follow.. why would firewalling taking CPU lead to high CPU use of the iocaine process?

@petko That's the thing. It shouldn't. But when I turn iocaine's firewalling code on, CPU use goes up to 100%, seemingly irregardless of the rate of incoming requests. It keeps spinning at high CPU even when request rate is low.

Heck, turning Caddy off, iocaine still uses ~10% CPU, even though it's supposed to do nothing at that point. I even kept it off for 15 minutes, so it can empty any firewall block request queues, and it was still at it.

If I turn firewalling off, it's a lot quieter.

@algernon i wonder if collecting up all the addresses to be banned for a few seconds then adding them to the set with a single call to the equivalent of "nft add element" would perform any better than adding them one-per-call
@pho4cexa Marginally, perhaps. The high CPU use is probably not in the iocaine->nft parts, but somewhere else. Because it's spinning at ~10% CPU even with no traffic for 15 minutes, and nothing in its queue.

@algernon i don't have zero traffic, but i have close to it, about 0.3rps, and my iocaine 3.3.0 with firewall enabled uses about 2% average (where 100% = 4 cores at 100%).

when i watch it in htop it looks like it sits at 0% for most of the time, then hits maybe 6% at most for a moment before going back to 0

here's my stats dashboard, and posting it here will help me do a lil stress test to see what the cpu looks like with more requests

https://telemetry.orbital.rodeo

Telemetry | orbital.rodeo

@pho4cexa Hrm, then the problem is somewhere between 3.3.0 and main (I had to adjust how blocks work, because 3.3.0 fell over and crashed under high load1). That's a good data point, thank you!

  • 1024+ open connections â†©ï¸Ž

  • @algernon @pho4cexa If you attach gdb to it while it's spinning, take a backtrace, continue it, wait a few seconds, control-C, take another backtrace, repeat a few more times, what keeps showing up in the call chain?

    @zwol @pho4cexa

    #0 0x00007f0166d254e3 in recvmsg () from target:/nix/store/vr7ds8vwbl2fz7pr221d5y0f8n9a5wda-glibc-2.40-218/lib/libc.so.6 #1 0x00007f0166e185f2 in mnl_socket_recvfrom () from target:/nix/store/pnlb8ndya1i91p1lsvl5ciwmcs52a3wg-libmnl-1.0.5/lib/libmnl.so.0 #2 0x00007f0166f76f2c in nft_mnl_talk () from target:/nix/store/sapzih05dnr334d43zd9k0428j97hxay-nftables-1.1.5/lib/libnftables.so.1 #3 0x00007f0166f7a535 in mnl_nft_setelem_get () from target:/nix/store/sapzih05dnr334d43zd9k0428j97hxay-nftables-1.1.5/lib/libnftables.so.1 #4 0x00007f0166f82500 in netlink_list_setelems () from target:/nix/store/sapzih05dnr334d43zd9k0428j97hxay-nftables-1.1.5/lib/libnftables.so.1 #5 0x00007f0166f548b3 in nft_cache_update () from target:/nix/store/sapzih05dnr334d43zd9k0428j97hxay-nftables-1.1.5/lib/libnftables.so.1 #6 0x00007f0166f71fb2 in nft_evaluate () from target:/nix/store/sapzih05dnr334d43zd9k0428j97hxay-nftables-1.1.5/lib/libnftables.so.1 #7 0x00007f0166f73395 in nft_run_cmd_from_buffer () from target:/nix/store/sapzih05dnr334d43zd9k0428j97hxay-nftables-1.1.5/lib/libnftables.so.1 #8 0x0000560336529842 in std::sys::backtrace::__rust_begin_short_backtrace () #9 0x00005603365277be in core::ops::function::FnOnce::call_once{{vtable.shim}} () #10 0x0000560336d9d253 in std::sys::thread::unix::Thread::new::thread_start () #11 0x00007f0166c9a97a in start_thread () from target:/nix/store/vr7ds8vwbl2fz7pr221d5y0f8n9a5wda-glibc-2.40-218/lib/libc.so.6 #12 0x00007f0166d22af4 in clone () from target:/nix/store/vr7ds8vwbl2fz7pr221d5y0f8n9a5wda-glibc-2.40-218/lib/libc.so.6

    Looks like it's spinning around here. The other threads all show normal-ish activity (as in, they come up with different backtraces, as I'd expect them to). Thread 2 is always here.

    I should be able to reproduce this locally too, so I'll attack it with a combination of tokio-console and println debugging tomorrow.

    @algernon @pho4cexa That looks to me like it's just sent some sort of netlink message and is waiting for a reply — which means the code that's actually spinning is inside the kernel :(

    @zwol @pho4cexa Would that surface as the iocaine process using 100% CPU?

    ...actually... wait. This might be a PEBKAC.

    @zwol @pho4cexa NVM, fixed the PEBKAC (I had RUST_LOG=error,iocaine=trace set, so iocaine was logging all block requests to the journal), but it's still hogging a core.
    @algernon @pho4cexa I'm writing from my phone right now so I can't look at code but my first guess is, something about nftables IP set updates has quadratic cost (or worse, but that's unlikely)

    @zwol @pho4cexa nod

    That, or I'm doing it wrong (or both!). A recent observation is that I had an incoming IP that should've been blocked. 10 minutes later, it has not been added to the firewall yet.

    I see the number of blocked IPs rise by ~4-5/sec, while it should rise by a lot more, judging by the incoming rate.

    I have a feeling I'm Doing It Wrongâ„¢. Feels like I have a single sender, and a single receiver, and things spin until the sender is available. That'd explain what I'm seeing.

    I'll try shoving the send into a tokio task and a cloned sender, see if that helps. But sleep first!

    @zwol @pho4cexa Looking at my RAM graph... this feels like the right track: RAM keeps (very slowly) increasing. About a few hundred bytes / sec - roughly what I'd expect the firewall backlog would be.
    @zwol @pho4cexa In this case, yep, batching would definitely help. I'll know more once repro'd at home.

    @zwol @pho4cexa Update on the IP address: It was sent for blocking at 00:06 UTC. It appeared in the nftables set at around 01:33 UTC, so with almost 1.5 hour delay.

    If adding items is this slow, and the queue is so long, that explains the CPU usage.

    @algernon @pho4cexa weren't you going to bed?

    @zwol @pho4cexa I AM UNGOVERNABLE! NOT EVEN I CAN TELL ME WHAT TO DO! throws a tantrum

    (I got side tracked a little...)

    @algernon @pho4cexa I believe so, yes. A thread that's *running* (executing machine instructions) inside the kernel should be reported as R-state just like one that's executing instructions in user space. It's just unusual to actually see that for anything that isn't a kernel thread.