@Drwave
This was a while ago, so forgive my foggy memory (I wrote a chapter of a popular Linux book that described this at the time):
I believe Linux started with the 2 GiB / 2 GiB split and also a direct map of all physical memory (all physical memory was mapped into the kernel’s VA space). This had a few advantages:
- You can tell apart user and kernel addresses by looking at the top bit.
- You can always pin a page and then use the result of the current translation to get a VA you can use for access anywhere in the kernel, not just threads associated with the current process.
By the early 2000s, machines with more than 2 GiB of physical memory were affordable by places running Linux and so the direct map had to go away. I think this work started with PAE. PAE meant you could have more physical memory than you had virtual, which made this much worse, so 32-bit kernels with PAE couldn’t use a direct map at all and had to allocate mappings to fill things into the kernel’s address space on demand, but gradually this became necessary for everyone (I think PAE was a compile-time option for the kernel for a while).
User processes grew memory requirements faster than the kernel, so there was a configuration option to choose where this split was. I’m not sure when Linux went to 3 GiB, but by the have enough physical memory for this to make sense you’re past the point where a direct map is feasible, so there are no downsides. I think it was possible to arrange the split the other way, which was useful for file servers and similar where you wanted a lot of memory for the buffer cache and very little for userspace.
Red Hat went further. They had kernel builds with a 4 GiB / 4 GiB split. Userspace got almost the full 32-bit address space and the syscall code switched to an entirely different page table on transition (meltdown mitigation later required a similar mechanism. Sensible ISAs have different page-table base registers for user and kernel space and make this easy. Sadly, x86 is not one of these). This was slow because every system call that took a pointer required the kernel to look up the address translation in the userspace page tables, map the relevant pages into its address space, and then copy. Even with caching, this was painful. Oh, and the trick of self-mapped page tables can’t be used in this configuration (and was covered by a Microsoft patent at the time).
64-bit kernels are much nicer for running 32-bit userspace because they can give the userspace process the entire low 4 GiB and map that into the kernel’s address space for use by top-half threads. This leads to a slightly odd situation that it’s possible to find 32-bit userspace programs that will run successfully only on 64-bit kernels.