Yesterday I was thinking about serialization of fiber-based state machines and remembered Near (RIP) had written an article exploring the problems and potential solutions: https://saveweb.github.io/near.sh/articles/design/cooperative-serialization.html.
Cooperative Serialization — Near's Respite

Anyway, I think I have another solution. Let's take the article's two-component example of a CPU and a APU. For deterministic replay, normally you only need to record external inputs, not internal inputs (i.e. cross-component interactions). But you can use per-component "ragged checkpoints" to resolve the alignment problem. During the checkpointing interval you're also recording internal inputs, so you can step forward the lagging component in isolation to catch up when restoring a checkpoint.
I'm not a distributed systems person, but I guess this is a _much_ simpler version of something like https://en.wikipedia.org/wiki/Chandy%E2%80%93Lamport_algorithm. E.g. this part is similar to what I was describing as recording internal inputs during the checkpointing interval: "If a process receives a marker after having recorded its local state, it records the state of the incoming channel from which the marker came as carrying all the messages received since it first recorded its local state."
Chandy–Lamport algorithm - Wikipedia

@pervognsen
hmm but wouldn't that need all cross-component interactions to be expressed as explicit message passing, instead of method calls and/or shared state?
@wolf480pl That's just the distributed system algorithm. The way I normally do record-and-replay is to instrument component interface calls. I think if anything it should be easier for something like the whole-system emulation example, but I need to do a prototype implementation to prove it to myself.
@wolf480pl Here's how I see the serialization side:
1. Set a flag to announce that you want to checkpoint the system state.
2. The next time a component reaches a safe point, it snapshots its local state and starts recording interface calls while continuing execution.
3. When the last component reaches a safe point and takes a snapshot, the system checkpoint is finished. Each component contributes a local state snapshot and its local log of recorded interface calls.
@wolf480pl On the deserialization side:
1. Every local component snapshot corresponded to a different point in the shared timeline; here I'm assuming a shared synchronous clock as in the whole-system emulation example.
2. The n-1 lagging components have to deterministically replay forward from their local snapshot using the locally recorded log of interface calls, to catch up to the newest snapshot. This log means they can replay independently from each other, without any synchronization.

@pervognsen
ok, so let's say APU is already snapshotted, and the CPU tries to read APU's memory, but the APU is behind.

So I guess it passes a message to the APU "gimme memory at address 0x1234".

The APU dutifully records that message, and doesn't reply, since it's already snapshotted.

The CPU needs to wait for the reply, but it will not get one before resume, so it needs to sit in a safe point...

which is what OP was trying to avoid

@wolf480pl The APU continues execution after taking its snapshot, so it behaves completely normally from everyone's point of view.
@pervognsen
I wonder how the overhead of record-and-replay on cross-component interactions compares to the overhead of CPS (continuation-passing style) -transforming the component implentations, so that they don't keep anything on stack.
@wolf480pl In the past I've done this with traits/templates so that you can basically instantiate each component with different modes, e.g. live/pass-through mode (the fast default mode of execution), record mode, and replay mode. So the pass-through mode doesn't have any non-essential overhead. Aside from true external inputs (e.g. gamepad input from the user), you only need to enter the record and replay mode during serialization and deserialization, respectively.
@wolf480pl At least the way I've traditionally done it, the transitions between these modes can only happen at safe points, so it should line up pretty well with the requirements of the checkpointing example.
@pervognsen
so in passthrough mode it's the overhead of 1 virtual call that you'd likely have anyway?
@wolf480pl I don't have any virtual calls at all. I mentioned the trait/template approach to emphasize that you don't need dynamic dispatch since you can do mode switching at safe points without any crazy OSR-like complications.
@pervognsen
ok I guess I don't understand what the trait/template approach is.
@wolf480pl As a simple example, suppose you're modelling a byte-oriented channel with read_byte() -> u8 and write_byte(u8) methods. That channel interface would be a trait. I just mean that your component is compile-time parameterized by a trait impl so you don't need virtual dispatch for something like that, because you can transition a component between modes at safe points. It's not that different from having a normal interpreter and a tracing interpreter, for example.
@wolf480pl Maybe a bad example since a popular way to support different interpreter modes like that that is by patching the opcode dispatch table to instrument instructions, which is dynamic dispatch, but there's no reason that's required. :)
@pervognsen
oh, so you compile a separate specialization of a component for each of the modes, and then at a safepoint you switch which one you run?
@wolf480pl Right, it's not anything very special. The only reason I mentioned it is that you brought up the point about overhead.
@pervognsen
sounds a little crazy, I wouldn't've done that unless I was desperate to get rid of that virtual call, but I guess if other people are doing it, it means it's not that crazy at all...
@wolf480pl @pervognsen more or less crazy than cross-modifying code? (https://github.com/backtrace-labs/dynamic_flag)
GitHub - backtrace-labs/dynamic_flag: A C library for runtime-flippable feature flags on Linux/x86-64, with negligible overhead in the common case

A C library for runtime-flippable feature flags on Linux/x86-64, with negligible overhead in the common case - backtrace-labs/dynamic_flag

GitHub

@pervognsen I got all excited because I thought you meant IO serialization until I started reading the article. >_<

Though this is a more interesting topic than I would have expected!