hi there rust peeps

does anyone have any clue why a tokio runtime would have a hard cap on the number of tasks at 32,768?

hitting this limit silently fails to spawn new tasks and existing tasks hit "device or resource busy" errors when they need to use IO

no ulimits have been hit as far as i can tell (83 open files, well under limit of 16,384, and all other limits are either unlimited or unrelated to this issue)

given the failure mode is completely silent i have no clue where to start debugging this

@niko seems like a funny interesting magic number...
@k yeah i have no clue why it's picked funny magic number

@niko  This smells like a signed 16-bit-shaped thing.. And very generic IO errors smell like a C API error.. Maybe it's hitting an implementation limit in one of the C-level primitives it uses, like epoll or something?

I dunno just a stab in the dark

@niko that's 1 + signed 16 bit max right ? probably an internal buffer indexed with that type then
@SRAZKVT well that's the problem i have zero clue where to start looking for limits like this
@niko debugger ? cargo rebuilds everything anyway, surely you can ask cargo to add debug info to deps too right ?
@SRAZKVT my config includes debug info yes but that still doesn't help because this is like finding a needle in a haystack with how big my codebase is

@niko

power of two?

32k is half of 64k.

Are they using a 16 bit counter?

@niko I looked into this and it seems the reason is that if all runtime threads are already busy, tokio will push tasks onto a queue, and apparently the size of that queue is bounded now.

not exactly a full solution, but I found an unstable tokio method of spawning a task that returns an error instead of panicking: https://docs.rs/tokio/latest/tokio/task/struct.Builder.html#method.spawn

(side note: goddamn, tokio is such a rust project. this API has been in there for years and there’s no sign of stabilisation.)

Builder in tokio::task - Rust

Factory which is used to configure the properties of a new task.

@follpvosten just checked runtime metrics, the global queue depth never goes above 0 according to https://docs.rs/tokio/1.49.0/tokio/runtime/struct.RuntimeMetrics.html#method.global_queue_depth
RuntimeMetrics in tokio::runtime - Rust

Handle to the runtime’s metrics.

@niko oh, you’re spawning tasks from outside the runtime?
@follpvosten no it's all from inside the runtime, is there another metric for that

@follpvosten problem here is just the entire process kinda goes unresponsive unless you get lucky enough to have your request spawn properly then it just works

i was able to correlate it with the runtime metrics and found out task count flatlined just below 32768 every time this happened so it's a pretty strong indicator

@follpvosten anyways i found another metric for worker local queue depth https://docs.rs/tokio-metrics/latest/tokio_metrics/struct.RuntimeMetrics.html#structfield.total_local_queue_depth and that never goes above 2 so i don't think that's it either

total number of live tasks (https://docs.rs/tokio-metrics/latest/tokio_metrics/struct.RuntimeMetrics.html#structfield.live_tasks_count) is bounded at 32768  which is my main symptom

i have access to all metrics so i'm just poking around to see

RuntimeMetrics in tokio_metrics - Rust

Key runtime metrics.

@niko I think I’m gonna check the spawn source code, maybe that’ll give me some insight  πŸ€”
@niko I figured out where the number seems to come from at least: https://docs.rs/tokio/latest/src/tokio/runtime/task/list.rs.html#231
list.rs - source

Source of the Rust file `src/runtime/task/list.rs`.

@follpvosten hmm i'm not entirely sure? num worker threads is 16, host system has 8 physical/16 logical cores and none of those numbers match nicely
@follpvosten unless there's something else going on up in the call stack i missed
@niko hmm yeah. MAX_SHARED_LIST_SIZE is 65k too, so it shouldn’t be below that…
@niko hmmm, no idea, I’ll look for it
@niko from the API docs it looks like there’s only ever runtime-scope metrics, no thread-local metrics :/
@niko Iβ€˜m getting more and more intrigued to learn about the X problem here πŸ˜…
@[email protected] not rust particularly but maybe its open files limit or something on your OS?

(In openbsd this is set in /etc/login.conf, I have no idea where linux would have hidden these settings nowadays)

@brettm edit didn't federate to your instance i assume

i checked open files at the time of this issue, 83 open files against a limit of 16,384

no other limits were near their cap/others were unlimited

@[email protected] oh no i did not see that. snac does not seem to bother with edits πŸ™‚
@niko Have you tried turning off io_uring if you use that?
@natty i do not use that
@niko are you perhaps watching filesystem changes or something? Is it Linux?
@natty this is linux, this is basically exclusively network IO, little to nil disk usage
@niko What do the spawned tasks do?
@natty mostly ship off data to other services and then wait for that to be processed
blocking CPU bound stuff is done on sync threads (spawned separately from the runtime) with queue communication between the two (it’s audio decoding), all of this is in a separate library and i don’t know of any user at larger scales than i with any issue like this
@niko silly question but did you perhaps run out of ports ​​

@natty

There are 32k ephemeral port numbers, so… πŸ€”

Edit: Wait no, the default on Linux is 28232 ephemeral ports, i.e. [32768, 60999]. Also, there are probably at least a few other processes using at least a few ephemeral ports at any given time, so you wouldn't get all of them to yourself.

@niko

@niko looking at where tokio uses i16 internally could be a start
@nCrazed hmm true i’ll do that in the morning thanks for the idea
@niko i asked @iczero about this and he was able to spawn 512k tasks fine apparently. he suggested you might be limited by cgroups and not ulimit

@unnick @iczero yeah i'm unsurprised at this because i've spawned >100 million tasks before just fine

i'm not quite sure what would be limiting here with cgroups, got any ideas i can chase on this one?

@unnick @iczero the behaviour i observe as well doesn't quite match up to a user-level limit because it's just this process that's limited, anything else running on the whole system continues to function normally

@niko
32,768 is 2^15 which ought to hit the limit for an i16 numeric: https://doc.rust-lang.org/reference/types/numeric.html

So I'm guessing a task counter/manager somewhere in tokio is i16?

Numeric types - The Rust Reference