Mastodawn

int*domi;*domi=0 Aug 24, 2024

One Bash fallacy I found myself to believe for whatever reason is that you can’t do substitutions on the same variable you’re assigning to. This used to lead me towards using multiple variables to hold an intermediate value, which in itself contributed to code clutter. This is, fortunately, not needed!

Remember: while you can’t do cat a | grep meow > a (cat abuse for demonstrative purposes only), nothing stops you from doing a=${a/nyaa/meow}!

Using this, one can create expression chains of sort, which only operate on one variable:

delim=$'\01'
newline=$'\02'
ctrl=$'\03'

# (...)

tr="${1//$delim}"          # remove 0x01
tr="${tr//$newline}"       # remove 0x02
tr="${tr//$ctrl}"          # remove 0x03
tr="${tr//$'\n'/$newline}" # \n -> 0x02

(excerpt from notORM) #ScriptSaturday

http.sh/src/notORM.sh at cd0fe42879a9012dbe5caab5c9a7624def8ef5a5

http.sh - A webserver/web framework written entirely in Bash. Fully configurable, with SSL support, vhosts and many other features.

the sakamoto git server

Show thread

Agnieszka R. Turczyńska Aug 24, 2024

@domi
cat a | grep meow | ( rm a; tee a > /dev/null)
will work(*) . But if I see that in a production code, I'll scream :)

(*) on Linux.

Show thread

mei | 絶望動物 Aug 24, 2024

@agturcz @domi wait, isn't that a race condition?

Show thread

int*domi;*domi=0 Aug 24, 2024

@mei @agturcz iirc it is

a less dirty solution would be `grep meow <<< "$(cat a)" > a`, but that introduces a trailing newline + i'm nor sure whether this is a race condition too or not

Show thread

Agnieszka R. Turczyńska Aug 24, 2024

@domi It is still operating on the same file, understood as a name tied to the starting inode.
Also, an edge case: a is bigger than your available memory+swap.
@mei

Show thread

mei | 絶望動物 Aug 24, 2024

@agturcz @domi how do you explain this behavior, then?

Show thread

Agnieszka R. Turczyńska Aug 24, 2024

@mei You've got me :) I thought about this edge case, so here is the answer.

When the shell executes the a|b pipeline, the following sequence is run:
- pipe() is being to create a pair od connected descriptors for write to and read from
- fork() is being called for a and b sides, inheriting the descriptor from the process above - please note, this is the moment, when we are loosing control on the sequence, as every child process is going to be managed by the process scheduler, henceforth we cannot guarantee the order of following operations
- for the child which is going to be a, the write to descriptor is being cloned with dup2() as descriptor 1 (stdout), then execve() is being called to actually run a
- for the child which is going to be b, the read from descriptor is being cloned as descriptor 0 (stdin), then execve() is being called to actually run b
Disclaimer: I am omitting some housekeeping here, giving the concept only.

If there are multiple |, first the pipe() is being called for all of them, then fork() is used to spawn all the child processes - and again, this is the moment we are loosing control on the sequence of operations, then each child process performs the dup2() to properly set stdin and stdout, then execve() follows to actually run given stage of the pipeline.

I've hoped that having () - which means another fork-exec is being run is enough to delay the execution of this stage, henceforth allowing cat to open a file, before rm remove it, but the delay is not significant enough to ensure it. Certainly, adding sleep before rm would solve the problem, but then what should be the proper value of delay? 1s? 0.1s? 0.01s?

However, this is solvable. One approach is very explicit:

raise_semaphor; (grep meow a&; delay; rm a; lower_semaphor) | (wait_for_semaphor; tee a > /dev/null)

where semaphor operations is some dirty magic using flock,
and delay is either sleep; or a condition checking if a has been opened by grep.

The semaphor operations are here to ensure the order of commands, particularly, to make sure tee won't open the file, till cat open it first, then rm remove it.

However, there is implicit solution, which - at least the theory says it - should work.

{ grep root | { rm a && tee a > /dev/null ; }; } < a

Why?

We already know the pattern.
First, file descriptors are being prepared, then fork() happens, then dup2() renumbers descriptors for each child process, then execve() is called.

So, if cmd < a is being used, that means the shell opens the file a in the parent process. So, the file is open, and its descriptor is available for further use.
After or before - does not matter, pipe() is being called to prepare descriptors for handling all the |. But this the first stage.
The 2nd stage is to call all the forks. This is the moment we are loosing control over the order of execution. But the descriptor to read from a is already open. Whatever happens next is not relevant, as long, as we remove the file a from the directory, before opening it. Which is ensured by the sequential calling of rm and tee. I am using && instead of ; for better testability. As I can run following command:

cp /etc/passwd a; cnt=0; while { grep root | { rm a && tee a > control-file ; }; } < a; do (( ++cnt )); done ; echo $cnt

It will run till we break it, or till race condition is detected.

After running the command for a minute and intentionally breaking it by removing the a file from another session, the result is:
22059
and the control-file, being overwritten at each iteration of the loop contains:

root:x:0:0:root:/root:/bin/bash

@domi

Show thread

mei | 絶望動物 Aug 24, 2024

@agturcz @domi

> raise_semaphor; (grep meow a&; delay; rm a; lower_semaphor) | (wait_for_semaphor; tee a > /dev/null)

these semaphores seem to only serve to obfuscate the issue. the behavior is equivalent to

grep meow a | (delay; rm a; tee a > /dev/null)

perhaps with the syntactic difference that this mythical delay would be easier to implement with the way you wrote it

but if you implement delay as sleep, it is still broken. you can never, ever fix a race condition by adding a sleep. you'll just make it less likely to happen – it'll take an unfortunately timed hiccup for it to happen, but nothing says that's impossible. there is no upper bound on how bad such a hiccup can be. think "user hibernates the machine at an unfortunate moment and the processes get resumed at different times, creating a multi-second gap"

> Certainly, adding sleep before rm would solve the problem

no!!!

> { rm a && grep root >a; } < a

oh, there we go. my mental model does indeed say that this indeed should reliably work. neat.

Show thread

Agnieszka R. Turczyńska

@mei @domi

> these semaphores seem to only serve to obfuscate the issue.
Correct.

> but if you implement delay as sleep, it is still broken
Absolutely correct.