hi there rust peeps

does anyone have any clue why a tokio runtime would have a hard cap on the number of tasks at 32,768?

hitting this limit silently fails to spawn new tasks and existing tasks hit "device or resource busy" errors when they need to use IO

no ulimits have been hit as far as i can tell (83 open files, well under limit of 16,384, and all other limits are either unlimited or unrelated to this issue)

given the failure mode is completely silent i have no clue where to start debugging this

@niko I looked into this and it seems the reason is that if all runtime threads are already busy, tokio will push tasks onto a queue, and apparently the size of that queue is bounded now.

not exactly a full solution, but I found an unstable tokio method of spawning a task that returns an error instead of panicking: https://docs.rs/tokio/latest/tokio/task/struct.Builder.html#method.spawn

(side note: goddamn, tokio is such a rust project. this API has been in there for years and there’s no sign of stabilisation.)

Builder in tokio::task - Rust

Factory which is used to configure the properties of a new task.

@follpvosten just checked runtime metrics, the global queue depth never goes above 0 according to https://docs.rs/tokio/1.49.0/tokio/runtime/struct.RuntimeMetrics.html#method.global_queue_depth
RuntimeMetrics in tokio::runtime - Rust

Handle to the runtime’s metrics.

@niko oh, you’re spawning tasks from outside the runtime?
@follpvosten no it's all from inside the runtime, is there another metric for that

@follpvosten problem here is just the entire process kinda goes unresponsive unless you get lucky enough to have your request spawn properly then it just works

i was able to correlate it with the runtime metrics and found out task count flatlined just below 32768 every time this happened so it's a pretty strong indicator

@follpvosten anyways i found another metric for worker local queue depth https://docs.rs/tokio-metrics/latest/tokio_metrics/struct.RuntimeMetrics.html#structfield.total_local_queue_depth and that never goes above 2 so i don't think that's it either

total number of live tasks (https://docs.rs/tokio-metrics/latest/tokio_metrics/struct.RuntimeMetrics.html#structfield.live_tasks_count) is bounded at 32768  which is my main symptom

i have access to all metrics so i'm just poking around to see

RuntimeMetrics in tokio_metrics - Rust

Key runtime metrics.

@niko I think I’m gonna check the spawn source code, maybe that’ll give me some insight  πŸ€”
@niko I figured out where the number seems to come from at least: https://docs.rs/tokio/latest/src/tokio/runtime/task/list.rs.html#231
list.rs - source

Source of the Rust file `src/runtime/task/list.rs`.

@follpvosten hmm i'm not entirely sure? num worker threads is 16, host system has 8 physical/16 logical cores and none of those numbers match nicely
@follpvosten unless there's something else going on up in the call stack i missed
@niko hmm yeah. MAX_SHARED_LIST_SIZE is 65k too, so it shouldn’t be below that…
@niko hmmm, no idea, I’ll look for it
@niko from the API docs it looks like there’s only ever runtime-scope metrics, no thread-local metrics :/
@niko Iβ€˜m getting more and more intrigued to learn about the X problem here πŸ˜