Trying to answer a question no sane person ever had to ask: How Hard Is It, To Open a File?

This one is about the great POSIX idea of a filesystem, and why you could not play your games and open chrome for a few days.

https://blog.sebastianwick.net/posts/how-hard-is-it-to-open-a-file/

How Hard Is It To Open a File?

It’s a question I had to ask myself multiple times over the last few months. Depending on the context the answer can be: very simple, just call the standard library function extremely hard, don’t trust anything If you are an app developer, you’re lucky and it’s almost always the first answer. If you develop something with a security boundary which involves files in any way, the correct answer is very likely the second one.

swick's blog
@swick Didn’t understand 80% of that, but thank you :)
@jimmac appreciate the enthusiasm :p
@swick that was a really nice blog post, thank you!

@swick ocaps
ocaps

*bangs table, stamps feet*

OCAPS
OCAPS

@federicomena object capabilities are a really good concept and fd passing is just that, but this kind of path traversal is something a service is doing internally and maybe even after having received a fd. so I don't reaaaly think that this is about ocaps.
@swick Would Glib address any of that?

@swick For securely using path based APIs like the old mount syscall, passing "/proc/self/fd/NNN" paths is an option. While they look like symlinks from user space, they're treated specially by the kernel and will resolve to whatever the file descriptor points to race free.

I've used that to do bind mounts between possibly hostile locations after resolving the paths similar to how you describe.

@jamesh Yup, the magic symlink works for mount. Maybe it wasn't the best example...
@swick well written, and awesome work!
@swick "2a01:4f8:c012:2e79::" as listed in DNS for blog.sebastianwick.net does not answer to me on HTTPS.
How Hard Is It To Open a File?

It’s a question I had to ask myself multiple times over the last few months. Depending on the context the answer can be: very simple, just call the standard library function extremely hard, don’t trust anything If you are an app developer, you’re lucky and it’s almost always the first answer. If you develop something with a security boundary which involves files in any way, the correct answer is very likely the second one.

swick's blog
@swick That was a terrific explanation of a lot of subtle things, thank you!

@swick

I've long thought that there's a hole that needs filling, that does what the original #Unix namei does but allows application mode code to supply everything necessary as (opaque) open descriptors: the root directory, the working directory, and the security credentials.

Frustratingly, Unix openat(), Windows NT's NtCreateFile(), and #Hurd's dir_lookup() all come close but all miss a final piece of the puzzle in different ways. openat() misses, for example, a descriptor for the root directory and something like NT's process token handles for security processing. NT has odd ideas about current directories.

This way, server processes could simply make use of the kernel's own already existing logic to handle not traversing '..' over a changed root, following symbolic links, and checking security using client credentials.

There's so much reinvention of this wheel that would have been resolved decades ago if it had only been exposed as a system call.

#filesystems

@JdeBP @swick what's in your view wrong with dir_lookup?

@JdeBP @swick openat2() with the RESOLVE_IN_ROOT would seem to cover that?

Or RESOLVE_BENEATH if you just want to make sure the opened file is below dirfd without changing the meaning of absolute symlinks.

@jamesh

Of course no it does not. Passing the root directory and the working directory as file descriptors takes two descriptors, and openat2() only has one descriptor parameter.

The idea is that application mode code explicitly passes in all of the things that would normally be internally referenced from fields in the process structure, such as the root directory, the working directory, and the user credential set. And everything then just proceeds per the #Unix namei of old.

We've been frustratingly close to this for decades, and no-one has quite invented it.

With it, @swick's privileged server program opens the root directory, opens the working directory, opens/receives a credentials descriptor, and then just calls the syscall with the client-supplied paths. All of the TOCTOU problems with path normalization vanish. All of the multi-client parallel sete[ug]id and chdir synchronization problems vanish.

#filesystems

@swick One former older colleague once told me that the big advantage when migrating from VMS to Unix was, that it got very easy to open a file. Have never been using VMS myself though.
@swick great post! The sad part is that even after all those years chaseat() in systemd still gets relevant changes every few months, that non-trivially rearrange my PoV on file system interfacing. I.e. right now we are working on reinventing chaseat() around a new InodeRef structure that combines and fd *and* a path into one (together with some other fields) so that we don't lose the ability to write useful log messages (you really want a path for that) but can do the actual ops via fds...

@swick it's amazing how broken and unsecure posix fs apis have been from day one and still are (i.e. there is no posix way to convert an O_PATH fd to a real one for regular files for example)...

And really sad that even modern programming language standard libraries always focus on the posix fs api, mostly ignoring the new stuff, -- rather than focussing on the newer stuff and then trying to retrofit the old stuff to work like the new stuff wherever possible.

@swick and it's really shameful that supposedly security minded programming language communities (rust...) don't grok that, and happily work with the guaranteed insecure traditinal posix stuff instead of doing things better. I am pretty sure posix fs shenanigans are a bigger attack surface these days to gain privs than frickin memory unsafety, and focussing solely on memory stuff ignoring the fs stuff is just bad security engineering.

@pid_eins @swick This!!!
It's not just essential for security, but also dramatically increases robustness of the resulting application - I ran into the latter just last week (debugging data loss for a non-security-relevant app).

Even though I know about all of this, I still use the POSIX-like interfaces a lot because they're default in many languages and readily available and "it's not security relevant anyway". Until it is. Better defaults would be so nice!

I wouldn't say "happily".

Rust standard library folks are very well aware of the ideal of doing fd-based operations whenever possible, and we'd love to. Linux is doing great work on adding ways to do everything one might want to do using fds. However, we can't force people to run on exclusively modern Linux, as opposed to old Linux or other OSes. And it's much more challenging to design *portable* interfaces around fds without accepting capability limitations or lowest-common-denominator.

We could probably make an extremely capable interface, if we stuck most of it in `std::os::linux`, and had some of it fail if run on older Linux.
@josh @swick @pid_eins As far as I know, the cap-std crate (https://github.com/bytecodealliance/cap-std) does what is explained in the blog post, using an API that is close to the standard library. We use it a lot in bootc and related projects.
GitHub - bytecodealliance/cap-std: Capability-oriented version of the Rust standard library

Capability-oriented version of the Rust standard library - bytecodealliance/cap-std

GitHub
@josh @swick yeah, i don't buy into that race to the bottom thinking. Designing stuff with the shittiest model in mind instead of the best is just awful engineering. Always figure out where you want to be, i.e. go for the summit — and then fill in the gaps/degrade gracefully where you have to on worse systems. But that's really not what rust is doing there. It's letting itself be held hostage by the worst system, and let's that heavily leak into its APIs...

@pid_eins Rust's std::fs and std::io were an MVP for the 1.0 release, and unfortunately mostly stayed like that.

Back then Rust still had to prove that the memory safety and data-race guarantees could even work, that it could be stable and backwards-compatible language, while having even bigger unfinished gaps in the language and libstd.

Now std::fs sticks out as the weak point, but Rust couldn't take moonshots in every aspect all at once. It would never ship.

@kornel @pid_eins I wonder whether it's still possible to fix that in rust's standard library. One quick and likely silly idea is to change std::path::Path and std::path::PathBuf to also optionally include a file descriptor and prepopulate that one on the first file system related sys-call. That might resolve most of those TOCTOU issues. It likely will break other stuff horribly, so for now that's just an silly idea without much research behind it.

@weiznich Unfortunately the `Path` API has a no-alloc conversion from `&str`, so there's no room for a new field. Adding fd there would require stuffing it into the path, which seems hacky and could backfire.

The Path isn't good anyway. Can't even store \0 for C APIs nor UCS-2 for Windows. Useless for browser FileSystem APIs (WASM).

@kornel I think there are plans to change the various Range types in an also incompatible way with the next edition. It might be possible to do something similar here as well, at least that would give "us" the ability to change the internal layout in an incompatible way. Yes this would cause a lot of churn as old migrated code would then use something like std::edition_2024::path::Path instead of std::path::Path and likely get a deprecation warning for the old items at some point, but it doesn't seem to be impossible to change it. Especially given that the various std::fs functions all take generic arguments that require AsRef<Path> (not sure if we would get away with using the new path there, although I expect it should be fine as long as all currently existing variants are accepted there.).
That are obviously all unfinished quick ideas what could be tried. Any of that would need a bunch of research first if it is wanted and then what exactly is feasible and what not.
@pid_eins FWIW, I have a change for glnx_chase which adds a strategic callback for every path segment that gets resolved, so we can, for example, build the path without adding more complexity to glnx_chase itself. I also hinted at a new cross-platform API in GLib/Gio where we would want to have an opaque handle, which for posix would contain the fd, but it could also contain the path as well. So yeah, I agree that it's the right design, but I think it's something we should do on top of glnx_chase.

@swick Insightful post. Snapd has had a paranoid approach to fd security but even with that ww did commit a few CVEs over the years.

I agree that starting with current kernel APIs would be far easier to do the right thing. The openat2 and the new mount system calls are way better if you can depend on them.

Out of the missing set I wish kernel had an openat flag that makes atomic chown, and similar feature for mkdirat.

Best regards!

@zygoon file an issue: https://github.com/uapi-group/kernel-features. It does help sometimes :)
GitHub - uapi-group/kernel-features: A collection of ideas for new kernel features

A collection of ideas for new kernel features. Contribute to uapi-group/kernel-features development by creating an account on GitHub.

GitHub

@swick thank you for pointing that out. I just filed https://github.com/uapi-group/kernel-features/issues/56

Please feel free to correct me if I'm wrong but making directly path elements safely is just always racy if you want to configure them in a particular way and you are not implicitly creating them from a root/root process.

Race-free mkdirat + chown · Issue #56 · uapi-group/kernel-features

Any code that has a similar sequence of mkdirat and chown is racy. Code like this may exist in a privileged helper that runs with capabilities as the user. Ideally o int ret, subdirfd; ret = mkdira...

GitHub
@swick This was a very thought provoking blog post, particularly as I am currently developing an app that handles files. I think this problem should be mostly nonexistent in my app, as files are provided from other trusted applications, but I suppose it would be beneficial to protect against those trusted apps anyway just in case they get hacked. Therefore I will think about it as I am writing my code. Thanks!
@jessica If you don't have multiple processes with different capabilities you really don't have to care. But really, this should all be handled by the stdlib so you never even have to figure out if you have to care.
@swick as the world's biggest openat() propagandist (self-proclaimed) [1] i approve this message (super happy about the work, thanks!)
path.join Considered Harmful, or openat() All The Things - Home of Val Packett

@swick I just landed a refresh to chaseat() in systemd so it takes separate root and directory file descriptors as part of the InodeRef prep work. The only annoying thing about that is you can't easily check if a directory file descriptor is a child of another directory file descriptor. You have to walk upwards to check which is just horribly slow.
@daandemeyer saw it already :) interesting for sure and I'll keep an eye on it
@swick I propose to you the opposite question: how hard is it to not open a file? Magic links, special mounts, stale filesystems...
@swick great write up and thank you for your work!