Dear parallel programming gurus, I know there's a couple of you out there that might see this.

I have a project idea where I have a bunch of processes that need to run in lockstep. Each step is fairly lightweight.

Traditionally I'd just single-thread this, but I want it to run a bit faster. Would it make sense to run, say, dozen threads, or does the communication overhead kill the whole idea?

Ideally the steps would run tens of millions of hz.

@sol_hsa Any communication or synchronisation will slow it down, and the point of diminishing returns will depend on how much of it there is and how much overhead that adds. For example (pulling some numbers out of my ass), you might not benefit from more than 4 threads even if there’s more cores available, and running those 4 threads instead of one might only increase performance by 1.5x

@maia Thing is, each of these "machines" has its own state, loading/saving of which is probably also an overhead.. keeping the state in its own thread might be a win.

What I'm worried about is that the workload per "step" is too light for there to be much benefit.

I do remember seeing the theoretical perf increase graphs, but they are based on a lot of assumptions too..

@sol_hsa Yeah, it’s possible that the overhead is too great to benefit at all. To continue my example from before, if you were to add a 5th thread, that could decrease overall performance again from 1.5x to 1.4x. If the overhead is too great that could happen already at two threads compared to the single threaded version. Hard to say for sure without benchmarking it.
@maia I guess I'll have to just make a test and see what happens.

@sol_hsa

> Ideally the steps would run tens of millions of hz.

Almost sounds emulator related? I had been thinking a bit about running the different chips in an 8-bit home computer (e.g. 1..4 MHz clock) in separate threads, but the threads would need to sync up after each clock cycle (basically all chip-emus need to wait for the slowest chip-emu to complete its cycle), and syncing threads at such high frequency just doesn't sound right :D

@sol_hsa ...I never got past the "just doesn't feel right" phase and into the "let's try it out phase" though...
@floooh @sol_hsa Yeah… I dunno. Syncing even 2 threads every hundred or so clock cycles sounds like a non-starter.
@slembcke @floooh @sol_hsa
I wonder how much overhead a thread barrier has, probably not (much) less than than 100 cycles?

@Doomed_Daniel what do you mean by thread barrier there? I think if you wanted this I think you'd need to pin all threads to a single core each, then spin to advance a counter, and each thread steps forward based by relaxed loading the one counter. But it's not going to be a very good idea probably. Kinda like a multi-consumer ticket lock. At least that's the first thing that comes to mind.

@slembcke @floooh @sol_hsa

@dotstdy @slembcke @floooh @sol_hsa
Some kind of sync primitive that will block the threads waiting on it until all threads are ready - I've seen this under this exact name in the past, but don't remember anymore where (what language or framework or whatever)

@dotstdy @slembcke @floooh @sol_hsa
but maybe this can be done with relatively low overhead (of waiting once everyone is ready) with some kind of spin lock

but still ~100 cycles is not much

@Doomed_Daniel @slembcke @floooh @sol_hsa ah yeah, barrier is the term then. Using a built-in one is going to be super expensive, though. Not because of the synchronization itself per-se, but because as soon as one thread hits the barrier it's going to go to sleep, and then every single iteration is going to involve context switching.
@dotstdy @Doomed_Daniel @floooh @sol_hsa You are basically left with spinlocks and atomics. Even those require synchronizing caches right? Is 100 cycles pessimistic for that on a 202X CPU?
@slembcke @Doomed_Daniel @floooh @sol_hsa probably optimistic I think, smt siblings lower, cross-domain sharing higher. but yeah the whole thing ain't going to be fast, it's hard to imagine it being profitable unless you have really a lot of work per step. just looking at memory latency alone, sharing between two cores on the same L3 in a 7950X is 15-20ns. crossing to the other ccx is ~70ns.

@sol_hsa I've had the same thought and same assumption that it won't get much speedup, but it's hard to say.

Thinking about what might minimize slow cache line ownership transfers, I have no idea if this is better than contending over one cache line, but maybe: each thread gets a 64-bit int on its own cache line. At each join point, each increments its own int to x, then spins reading the other n-1 ints until they're all >= x.

@sol_hsa I don’t have any personal experience parallel programming, but Ryan Fleury recently blogged about his approach, and it seems like a really elegant solution to me: https://www.rfleury.com/p/multi-core-by-default
Multi-Core By Default

On multi-core programming, not as a special-case technique, but as a new dimension in all code.

Digital Grove
@sol_hsa there's a pattern I've had good results with that sounds similar to the problem space you're describing. I had a thread pool of 1 thread per (cpu_count - 2), and a single atomic queue feeding the thread pool with a heterogeneous set of tasks. Each generation of tasks typically is way more tasks than the number of threads, so the threads can stay busy without significant synchronization. When the last task in the generation completes, it dumps the next generation into the work queue.
@sol_hsa however, if each generation is light on the number of tasks, you probably wont see much benefit from this pattern

@aeva yeah, that's useful for, for example, audio generation, and I've done that before... this one unfortunately has so much potential cross-communication at low granularity that it's not feasible.

The result was more or less what I expected (if not a bit better), but one has to challenge assumptions every now and then...