Yesterday I was thinking about serialization of fiber-based state machines and remembered Near (RIP) had written an article exploring the problems and potential solutions: https://saveweb.github.io/near.sh/articles/design/cooperative-serialization.html.
Cooperative Serialization — Near's Respite

Anyway, I think I have another solution. Let's take the article's two-component example of a CPU and a APU. For deterministic replay, normally you only need to record external inputs, not internal inputs (i.e. cross-component interactions). But you can use per-component "ragged checkpoints" to resolve the alignment problem. During the checkpointing interval you're also recording internal inputs, so you can step forward the lagging component in isolation to catch up when restoring a checkpoint.
I'm not a distributed systems person, but I guess this is a _much_ simpler version of something like https://en.wikipedia.org/wiki/Chandy%E2%80%93Lamport_algorithm. E.g. this part is similar to what I was describing as recording internal inputs during the checkpointing interval: "If a process receives a marker after having recorded its local state, it records the state of the incoming channel from which the marker came as carrying all the messages received since it first recorded its local state."
Chandy–Lamport algorithm - Wikipedia

@pervognsen
hmm but wouldn't that need all cross-component interactions to be expressed as explicit message passing, instead of method calls and/or shared state?
@wolf480pl That's just the distributed system algorithm. The way I normally do record-and-replay is to instrument component interface calls. I think if anything it should be easier for something like the whole-system emulation example, but I need to do a prototype implementation to prove it to myself.
@wolf480pl Here's how I see the serialization side:
1. Set a flag to announce that you want to checkpoint the system state.
2. The next time a component reaches a safe point, it snapshots its local state and starts recording interface calls while continuing execution.
3. When the last component reaches a safe point and takes a snapshot, the system checkpoint is finished. Each component contributes a local state snapshot and its local log of recorded interface calls.
@wolf480pl On the deserialization side:
1. Every local component snapshot corresponded to a different point in the shared timeline; here I'm assuming a shared synchronous clock as in the whole-system emulation example.
2. The n-1 lagging components have to deterministically replay forward from their local snapshot using the locally recorded log of interface calls, to catch up to the newest snapshot. This log means they can replay independently from each other, without any synchronization.

@pervognsen
ok, so let's say APU is already snapshotted, and the CPU tries to read APU's memory, but the APU is behind.

So I guess it passes a message to the APU "gimme memory at address 0x1234".

The APU dutifully records that message, and doesn't reply, since it's already snapshotted.

The CPU needs to wait for the reply, but it will not get one before resume, so it needs to sit in a safe point...

which is what OP was trying to avoid

@wolf480pl The APU continues execution after taking its snapshot, so it behaves completely normally from everyone's point of view.