Mastodawn

Jann Horn Jun 1

To actually access it, you first need to attach to the target process using ptrace(…), the system’s debugging API.

You shouldn't need to do that.

lseek(…) to a specific offset, and then read(…) or write(…) as you would with a normal file

(or you can use pread/pwrite)

the ptrace(…) function also supports a better-known PTRACE_PEEKDATA and PTRACE_POKEDATA interface, but come on

fwiw a third alternative are the process_vm_readv/process_vm_writev syscalls which also don't require attaching with ptrace()

Some readers might find it amusing that back then, Linux 2.2 allowed you to call mmap(…) on the resulting /proc//mem file descriptor

ooh, I didn't know that bit of history, interesting

You can peek at process metadata via a pseudo-filesystem called /proc, but the data doesn’t seem particularly interesting. It’s the stuff you see in the output of ps or top:

Fun story: /proc/$pid/environ and /proc/$pid/cmdline work very similarly to /proc/$pid/mem, except that they are read-only and restrict the view of memory to a subrange of the virtual address space. (and cmdline also does some munging of the data iirc.)
So even when you run "ps aux" or such, behind the scenes this is reading from the memory of the listed processes, almost like /proc/$pid/mem would.

Christian Brauner 🦊🐺Jun 1

But then, not everything is a file descriptor! Some parts of the OS use separate namespaces; a good example are process identifiers (PIDs)

Linux does have "pidfds" nowadays, which let you represent a reference to a process with an FD, thanks to @brauner

Jann Horn Jun 1

@lcamtuf also, fun fact about /proc/$pid/mem: because it is intended for stuff like debuggers that want to install breakpoints into executable code, it uses FOLL_FORCE internally to allow you to bypass write permission checks (unless you're accessing a shared memory range). so you can write over executable read-only code, or read-only data, with this API

@jann @lcamtuf @brauner Nitpick: pidfds do not refer to processes, but to PIDs. Their name is very deliberate, it's double indirection.

You can witness this for example by the fact that `pidfd_send_signal` only works on processes in your own PID namespace or its child namespaces. Other types of file descriptors refer to resources directly, e.g. you can use an fd to write to a file that is not visible to you in your current mount namespace.

Christian Brauner 🦊🐺Jun 1

@muvlon @jann @lcamtuf It's trivial to remove the restriction to only allow signaling pidfds in descendant pid namespaces. That's really just a limitation we put in place to temporarily retain the same behavior as for PIDs (which architecturally can't do this). pidfds can easily signal upwards if we wanted to allow that.

Jann Horn 6d ago

@muvlon @brauner technically they don't refer to PIDs, they refer to "struct pid". That means even when a PID is reused by a new process, a pidfd pointing to the old process does not start referring to that new process.

A "struct pid" is essentially a weak reference on a thread, a process, a process group, and a session, except that the tasks it points to can change on non-main-thread execve().

Christian Brauner 🦊🐺6d ago

@jann @muvlon "the task it points to can change on non-main-thread execve()" which is the real crime...

Christian Brauner 🦊🐺6d ago

@jann @muvlon also this has very fun consequences for pidfd polling...

(1) During a multi-threaded exec by a subthread, i.e., non-thread-group leader thread, all other threads in the thread-group including the thread-group leader are killed and the struct pid of the thread-group leader will be taken over by the subthread that called exec. IOW, two tasks change their TIDs.

Christian Brauner 🦊🐺6d ago

@jann @muvlon (2) A premature thread-group leader exit means that the thread-group leader exited before all of the other subthreads in the thread-group have exited.

Christian Brauner 🦊🐺6d ago

@jann @muvlon Both cases lead to inconsistencies for pidfd polling with PIDFD_THREAD. Any caller that holds a PIDFD_THREAD pidfd to the current thread-group leader may or may not see an exit notification on the file descriptor depending on when poll is performed. If the poll is performed before the exec of the subthread has concluded an exit notification is generated for the old thread-group leader.

Christian Brauner 🦊🐺6d ago

@jann @muvlon If the poll is performed after the exec of the subthread has concluded no exit notification is generated for the old thread-group leader.

The correct behavior is to simply not generate an exit notification on the struct pid of a subhthread exec because the struct pid is taken over by the subthread and thus remains alive.

Christian Brauner 🦊🐺6d ago

@jann @muvlon
But this is difficult to handle because a thread-group may exit premature as mentioned in (2). In that case an exit notification is reliably generated but the subthreads may continue to run for an indeterminate amount of time and thus also may exec at some point.

Christian Brauner 🦊🐺6d ago

@jann @muvlon
After some rework I did no exit notifications will be generated for a PIDFD_THREAD pidfd for a thread-group leader until all subthreads have been reaped. If a subthread should exec before no exit notification will be generated until that task exits or it creates subthreads and repeates the cycle.

This means an exit notification indicates the ability for the father to reap the child.

Christian Brauner 🦊🐺6d ago

@jann @muvlon Note, that I think all of these cases could be handled a lot more elegantly if we finally enable pidfs integration with fsnotify() - which gets extended with custom event spaces. This would be pidfd monitoring on steroids.

Christian Brauner 🦊🐺6d ago

@jann @muvlon Because poll() is just super limited in terms of how fine-grained it can communicate events and because it cannot actually send metadata as part of an event.

Jann Horn 6d ago

@brauner @muvlon another option might be David Howells' O_NOTIFICATION_PIPE, which is currently only used by the keys subsystem, idk if that's better or worse than fsnotify

Christian Brauner 🦊🐺6d ago

@jann @muvlon Funny you say that as I looked at that just a few weeks ago. So the problem with O_NOTIFICATION_PIPE is that it's inhernetly fixed length and it has no support for event coalescing (if we ever needed that) nor does it allow to recover events easily. Plus I liked the idea of having the ability to use a uniform monitoring interface for general fs events and pidfd events. Plus with Amir and Jan it has very active and responsive maintainers with long-term plans.

@brauner @jann Wow, being confidently wrong on the internet rocks! I won a free pidfd deep drive straight from the horse's mouth :D

(No really, this is amazing, thanks for sharing)