https://stackoverflow.com/questions/33051108/how-to-get-around-the-linux-too-many-arguments-limit/33278482

> I have to pass 256Kb of text as an argument to the "aws sqs"

what, uhhh, what

> MAX_ARG_STRLEN is defined as 32 times the page size in linux/include/uapi/linux/binfmts.h:
> The default page size is 4 KB so you cannot pass arguments longer than 128 KB.
> I modified linux/include/uapi/linux/binfmts.h to #define MAX_ARG_STRLEN (PAGE_SIZE * 64), recompiled my kernel and now your code produces

casually patching the kernel to send a quarter megabyte as a *single* argument oh my god i'm laughing hard
@navi well in the early rust for Linux days we hit this limit with the passing kconfig options to rustc. Fun times
@kloenk as a single argument? this isn't the whole argument list, this is *just* argv[1]
@navi ah oh. Then we hit the other limit. Many many arguments. Way to many
@navi the kernel existed somewhere on the limit, and then I broke it with just adding O=build to my make flags (I like a separate build dir for the kernel)

@kloenk @navi Back when 128 kB was the limit for argv+envp, Google was hitting it too because they passed all the configuration for their whole software stack on the command line as --long-option=value switches.

Their solution? Compress the command line. So every binary started by ungzipping argv[1] and parsing it to get the configuration.

The person explaining this to me saw my horrified face, and said with the perfect Hide The Pain Harold smile: "a series of individually completely rational and reasonable decisions led to this." and I have been thinking a lot about it since.

@ska @kloenk @navi nah what the fuck 😭😭😭
@kitten @ska @navi if google does it we all are fine :p
@kloenk @kitten @navi tbh that was 11 years ago and I have no idea if they're still doing it. I suspect some Googlers were behind the push for Linux to drop the limit, and the whole tech staff breathed a collective sigh of relief when it happened.

@ska @kloenk @kitten @navi specifically, the arguments were compressed with gzip and then base64 encoded so that they could be passed with reasonable levels of escaping through ssh. This started about 20 years ago, and the kernels at the time were already modified to allow longer command lines.

Frankly, it's just another example of scaling: there's always a bottleneck, sometimes in surprising places.

The line about individual choices is perfect, too: there are always trade-offs to be made.
./.

@ska @kloenk @kitten @navi The whole thing also illustrates neatly why "Google does X" or "Amazon does Y" or, especially, "the SRE book says Z" is useless. You aren't Google, and even Google isn't that particular segment of Google at that particular point in time (and infrastructure and history and scaling and pressures).

Make your own choices, and then revisit them when they no longer work for you

(And definitely tell the rest of us, so we can listen in mute horror... :))

@gabe @kloenk @kitten @navi Precisely, I disagree that it's just another example of scaling. Usually, when hitting a limit, Google *reconsiders its approach* and develops a new solution that scales better. Here, the opposite happened: a hack was added to the original solution in order to accommodate the limit. At some point the limit was going to be hit again, even with compressed arguments.

Maybe it was just a way to buy time while waiting for the real solution, i.e. Linux dropping the limit. But it's definitely not the example I think about when mentioning "Google scale" 😅

@ska @kloenk @kitten @navi Google has some odd traditions it still can't quite let go, "put it all on the command line!" Is one of them. It does invent new stuff to fix issues until the next step function of scale, yes, but (even at Google) there's often just a bigger box.
@ska @navi @kitten @kloenk All to avoid a long-needed refactoring.
@ska @kloenk this broke me laughing what the fuck
@navi @ska same, can’t boost this enough
@ska @kloenk @navi What the hell did I just read oh my god, that is TERRIFYING. Yet that is also so ingenious that I don't know how to feel about it

-James
@thecatcollective @navi @kloenk "Brilliant and cursed" applies to way too much software, and I want the exact opposite of that - I want things that work dumbly, simply, elegantly, and that can be understood by mere mortals.
@ska @thecatcollective @navi isn’t technically a OS not being able to be written with C as the C spec defines some required things as UB? So yes that would be nice but sometimes fear we might need new abstraction for some of those types of software we have
@kloenk Excuse the beginner question, but: If operating systems are not to be written in C due to the C spec defining some stuff as UB, how do kernels get away with it?

As far as I know, something being defined as UB means it may work, it may not work, it may do unintended stuff, but then how do they get around this?

-James
@thecatcollective I don’t remember anymore. Think it was something like the C spec not defining some forms of casting or something. Spec is only a document that says please do it like this. But if all compilers just decide to do it the same way even when the spec does not define it in that way

@kloenk @thecatcollective even if only the compiler used for the specific OS kernel does it in a way the source of that kernel expects, it's fine(-ish) - no "all compilers agree" needed

UB really just gives compilers the freedom to do whatever the fuck they want

And while I don't have anything specific to say, from doing kernel dev myself, I'm pretty sure there is some UB I depend on (and lots where I could probably just write better code, but that's a different topic xD)

@thecatcollective @kloenk for the longest time the linux kernel was only buildable with gcc because they relied on undefined behaviour that is defined in gcc (plus obviously gcc extensions to the c language)

it is possible to write an OS with only standard, defined, C though -- but what you do is pull all the ultra low level logic that can't be done in a defined manner (very few things are like that actually) and write that in platform specific assembly

often said logic needs to be assembly anyway so it's all good
@navi @thecatcollective there are still some configs that don’t build with clang (still didn’t get around to send a issue report). Enabling the CONFIG_MATOM option results (in my case at least) in clang exiting with some weird to many registers error

@thecatcollective @kloenk operating systems commonly use features that are provided by the specific compiler(s) that they're developed with, but that are not part of the C language standard. OSes also commonly have small pieces of low level functionality implemented directly in assembly language for their target platforms.

See, for example, https://maskray.me/blog/2024-05-12-exploring-gnu-extensions-in-linux-kernel

@jpab @kloenk We will take a look at this, thank you!

-James

@kloenk @thecatcollective @navi I don't think the permissiveness of C has anything to do with the beatitude (in the NetHack sense) of a piece of code. C is underspecified, yes, because it's old and used for a lot of various things including kernels and drivers and stuff where it's essentially used as a glamorized assembly language.
IO coding with C is generally very pedestrian, not anything brilliant at all, and not especially cursed either, it just... is.

No, I am referring to high-brained solutions to problems you would never have had if your design wasn't made by and FOR high-brained programmers.

@kloenk @navi @thecatcollective @ska You essentially require that unless you start implementing a minimal runtime in microcode like some Lisp Machines and Java Machines did.

Regardless of the language, hardware-specific details will have to be handled as compiler intrinsics (or assembly or machine code) if the hardware isn't literally made to just run the language with no further setup.
@ska @navi And I guess null bytes in gzipped form must have been funny to handle

@lanodan @navi I don't think that's necessarily a problem. argv[1] doesn't have to be a string, it's a character array. Null is used as a separator when the kernel puts the whole argv on the stack, yes, but argv[1] is still just a pointer and if you know you're expecting a blob and have a way to know where the blob ends, it should work, I think.

Or they could have been base64-encoding the gzip for all I know, it's probably still smaller than the uncompressed argv.

(Edit: typo)

@ska @lanodan @navi Nope, it's a string. execve will stop processing it at the first null byte.

execve syscall essentially acts as a scatter-gather (in this case gather) operation running over the argv and environ pointer arrays in user memory and performing a string-copy-from-user operation for each one to built the object that will be prepopulated into the new process-image.

@ska @lanodan @navi And the since-abolished 128k limit was a very good thing because it put a bound on the burden the kernel could be asked to do on behalf of userspace in one uninterruptible go, and on spraying attacks you could do to suid binaries.

It was probably removed because someone doing the utterly stupid thing Google did here demanded it.

@dalias @ska @lanodan @navi

One of those someones was Rob Pike in 2004.

https://interviews.slashdot.org/story/04/10/18/1153211/rob-pike-responds

And the underlying kernel change happened in 2007.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b6a2fea39318e43fee84fa7b0b90d68bed92d2ba

Interestingly, modern #FreeBSD has a sysctl() limit (kern.ps_arg_cache_limit) on how large processes can resize their argument block and environment string block and still have it show up in the ps command.

I just raised mine to 768 bytes, coincidentally. I could have got away with just 512, I think.

#Linux

Rob Pike Responds - Slashdot

He starts by clearing up my error in saying he was a Unix co-creator in the original Call For Questions. From there he goes on to answer your questions both completely and lucidly. A refreshing change from the politicians and executives we've talked to so much recently, no doubt about it....

@JdeBP @dalias @ska @lanodan @navi Pike's diplomatic non-answer of "Comparing patents to nuclear weapons is a bit extreme" gave me a good chuckle, lol

@dalias @lanodan @navi Oh? It's a shame, then, that there isn't an execae() primitive that takes a char array as argv and a char array as envp and splits them following null bytes.

Because I have to do that all the time in execline and I hate to do it just to follow the API when the kernel is going to do the exact same right afterwards.

@ska @lanodan @navi Having them packed one-after-another in a single array is an implementation detail of the ELF entry point. It's not a programming interface, so it makes sense that there's no syscall to do that. Even if there were, the kernel side would have to do validation and building the pointer arrays, so it really wouldn't help make anything more efficient.
@ska @navi @lanodan @dalias Why not just pass the data in through stdin?
Laurent Bercot (@[email protected])

@[email protected] @[email protected] @[email protected] Sending the configuration to stdin is more difficult than storing it in a config file, because you have to have a process writing to the daemon's stdin. It's easier for the cluster manager to scp the config and give "-f configfile" to the daemon's command line. The point is that they didn't even want to scp a config file. The agent was just reading and running a command line and they didn't want to modify it. That, I think, is the more questionable design decision.

Treehouse Mastodon
@ska @lanodan @navi I mean, you don't need to go all the way to base64, COBS would suffice

@lanodan @ska @navi

Even on Linux-based operating systems, one can get strnvisx() and strunvis(), which solves that problem, should one choose to have it in the first place.

https://libbsd.freedesktop.org/wiki/

#Linux #BSD #vis #unvis #CommandLines

libbsd

@ska @kloenk @navi Narrator: they were not individually completely rational and reasonable decisions.
@dalias @ska @kloenk @navi individually locally rational decisions according to incentives may not be globally rational
@ska @kloenk @navi have had to work with googlers on build tooling in depth maybe 5 years ago and this explains some things about how they work lmao

@hipsterelectron @kloenk @navi The main insight I've acquired about how Googlers think is that they're used to working at Google scale, which is only relevant in FAANG companies, but when you're at Google everything is designed to make you forget that the outside world exists and is important; so your mind gets used to thinking *always larger*.

If you're designing software for a single machine, or a rack of servers, Googlers won't really understand you and you'll talk right past each other. If it doesn't scale to thousands of machines (or even millions), it has no value to them.

@ska @hipsterelectron @navi depends. I know some of the kernel devs at google (e.g. working on binder). They also have mobile “sized” software
@kloenk @hipsterelectron @navi Well the Android team is different, that's for sure. I haven't had any interactions with them.
@ska @kloenk @hipsterelectron @navi I worked at GOOG for 14 years. One of my projects was writing code for an 8051 with 2K of EEPROM and 128 bytes of RAM. #NotAllGooglers 🤣
@davidlsparks @kloenk @hipsterelectron @navi I envy you, grats for landing that project! My domain expertise was a bright red neon sign flashing "Put this guy on Borg or on some embedded stuff", but they chose to put me in Web search instead. 🤷 At least I had a very formative and educational year surrounded with incredible people.
@ska @kloenk @hipsterelectron @navi I hear you. I was fortunate to be hired as an embedded with low power expertise. "There are dozens of us." 🤣
@ska @kloenk @navi
I love one of the first rational decisions here: command-line arguments in scripts should be long-form to minimize reader confusion. Things go off the rails well before you hit 128kB of args though. You need to throw that in a config file or something, folks.

@c0dec0dec0de @kloenk @navi Actually, *that* particular decision made sense: when you have a huge software stack with configuration switches, you have to use long options because you just don't have enough characters for short options. And when you have a cluster manager running a command line on thousands of machines, you don't want to have to copy a config file, it's good to have the config on the command line.

The questionable decisions were upstream (is it good to have a whole software stack with configuration switches in every binary? hmmm) and downstream (what to do if we hit the command line limit), but *that one* was sound. 😅

@ska @c0dec0dec0de @kloenk

i would honestly take the configuration from stdin at that point, and it can even look similar to the bazillion flags in a script by using here-doc

wouldn't work if they need stdin for something else, but i kinda doubt that a program that has this many flags actually uses stdin directly

@navi @kloenk @c0dec0dec0de Sending the configuration to stdin is more difficult than storing it in a config file, because you have to have a process writing to the daemon's stdin. It's easier for the cluster manager to scp the config and give "-f configfile" to the daemon's command line.

The point is that they didn't even want to scp a config file. The agent was just reading and running a command line and they didn't want to modify it. That, I think, is the more questionable design decision.

@ska @navi @kloenk @c0dec0dec0de also google's build system caches process executions by argv and it does also have checksummed file inputs but it's more effort to provide a config file since it needs to be paired with the command line every time and their tool's API is not as good as pants (horrible to use) so this is more difficult because their systems are not as good as they pretend basically. that's in the case of build processes at least
@ska @navi @kloenk @c0dec0dec0de Putting the config in the command line *is* copying a config file everywhere, they just already implemented a protocol to do that.
@wollman @navi @kloenk @c0dec0dec0de Yup. The thing is, the way files are managed on Google servers is... peculiar, like everything they do, because everything needs to scale so immensely; I don't remember if servers even had local storage space. Can't give many more details without risking 1. being inaccurate and 2. going into NDA-protected territory, but it's likely that "accessing a file in the filesystem before having initialized the part of the software stack that does files the Google way" wasn't a trivial task at all. Whereas even they could not, despite their best efforts, make the command line more complicated than it is, so it was a good bootstrapping medium.
@ska @wollman @navi @kloenk @c0dec0dec0de before borg (circa Pentium II), it was common for web index servers to have a HDD failure, but keep running, because once you've loaded everything into memory, you don't need a disk.
@trouble @wollman @navi @kloenk @c0dec0dec0de even after Borg, IIRC, all the index servers had incredible amounts of RAM because serving from cache is so much faster than anything else.
@ska @navi @kloenk @c0dec0dec0de
"The agent" here would be the one writing to the daemons stdin, no?