@mei You've got me :) I thought about this edge case, so here is the answer.
When the shell executes the a|b pipeline, the following sequence is run:
- pipe() is being to create a pair od connected descriptors for write to and read from
- fork() is being called for a and b sides, inheriting the descriptor from the process above - please note, this is the moment, when we are loosing control on the sequence, as every child process is going to be managed by the process scheduler, henceforth we cannot guarantee the order of following operations
- for the child which is going to be a, the write to descriptor is being cloned with dup2() as descriptor 1 (stdout), then execve() is being called to actually run a
- for the child which is going to be b, the read from descriptor is being cloned as descriptor 0 (stdin), then execve() is being called to actually run b
Disclaimer: I am omitting some housekeeping here, giving the concept only.
If there are multiple |, first the pipe() is being called for all of them, then fork() is used to spawn all the child processes - and again, this is the moment we are loosing control on the sequence of operations, then each child process performs the dup2() to properly set stdin and stdout, then execve() follows to actually run given stage of the pipeline.
I've hoped that having () - which means another fork-exec is being run is enough to delay the execution of this stage, henceforth allowing cat to open a file, before rm remove it, but the delay is not significant enough to ensure it. Certainly, adding sleep before rm would solve the problem, but then what should be the proper value of delay? 1s? 0.1s? 0.01s?
However, this is solvable. One approach is very explicit:
raise_semaphor; (grep meow a&; delay; rm a; lower_semaphor) | (wait_for_semaphor; tee a > /dev/null)
where semaphor operations is some dirty magic using flock,
and delay is either sleep; or a condition checking if a has been opened by grep.
The semaphor operations are here to ensure the order of commands, particularly, to make sure tee won't open the file, till cat open it first, then rm remove it.
However, there is implicit solution, which - at least the theory says it - should work.
{ grep root | { rm a && tee a > /dev/null ; }; } < a
Why?
We already know the pattern.
First, file descriptors are being prepared, then fork() happens, then dup2() renumbers descriptors for each child process, then execve() is called.
So, if cmd < a is being used, that means the shell opens the file a in the parent process. So, the file is open, and its descriptor is available for further use.
After or before - does not matter, pipe() is being called to prepare descriptors for handling all the |. But this the first stage.
The 2nd stage is to call all the forks. This is the moment we are loosing control over the order of execution. But the descriptor to read from a is already open. Whatever happens next is not relevant, as long, as we remove the file a from the directory, before opening it. Which is ensured by the sequential calling of rm and tee. I am using && instead of ; for better testability. As I can run following command:
cp /etc/passwd a; cnt=0; while { grep root | { rm a && tee a > control-file ; }; } < a; do (( ++cnt )); done ; echo $cnt
It will run till we break it, or till race condition is detected.
After running the command for a minute and intentionally breaking it by removing the a file from another session, the result is:
22059
and the control-file, being overwritten at each iteration of the loop contains:
root:x:0:0:root:/root:/bin/bash
@domi