ELF Parser Thread (since this is what I'm doing tonight -- my partner is out of town picking up a fancy chrome lamp she found [pic related] in Seattle, and I am a glutton for solo punishment). Let's take a look at one of the genuinely better file formats out there!
ELF is pretty old, which is good and bad. Good because it means a lot of stuff is standardized by implementation, and most of the implementation is open source and easy to read (seriously, https://raw.githubusercontent.com/novafacing/elf/main/specifications/glibc/elf.h is just a list of numbers). Bad because...a lot of stuff is standardized by implementation, and the kernel is full of comments like https://github.com/torvalds/linux/blob/480e035fc4c714fb5536e64ab9db04fedc89e910/arch/powerpc/include/asm/elf.h#L68 which is indeed, not mentioned at all in the spec.

Unlike the first time I (incompletely) implemented ELF handling by reading readlelf.c and linux/arch/*/include/asm/elf.h, this time I'm reading the specs. All of the specs (at least for all the architectures I care about, I'll let friendly PR providers add their own stuff later but I'm not gonna).

It turns out, Wikipedia (https://en.wikipedia.org/wiki/Executable_and_Linkable_Format#Specifications), the uclibc page (https://uclibc.org/specs.html), and especially the kernel page (https://refspecs.linuxfoundation.org/) refer to out of date documentation that doesn't match the specification as implemented. Luckily, for nearly all of these specifications that don't match, it's just because the spec has been updated and the reference hasn't. This is most noticeable with the most actively developed architectures x86_64, RISC-V, and ARM. So it's just a matter of tracking down these new specs, which I've done along with all the old, not-likely-to-change specs and gathered them all here: https://github.com/novafacing/elf/tree/main/specifications.

Executable and Linkable Format - Wikipedia

Working from up-to-date specs clears up a lot of things as compared to working from purely linux and glibc code. For one, the code generally doesn't include enough comments to tell you why something is the way it is, but the ARM documentation will spend several paragraphs doing so. Genuinely helpful!

Ok, with the "how am I getting my information" question answered (a combination of up to date docs and, yeah, glibc+linux code because it is not precisely gospel but is so close to it that we may as well sing it) the first decision in writing an implementation of the spec is how to handle the object file. Computers are pretty fast now, and I don't intend this to be used in an operating system or anything that's stupendously performance sensitive. I tried, well, see for yourself:

pub struct ElfHeaderVersion {
pub version: u64,
}

pub trait ElfHeaderVersionKind {
fn version(&self) -> ElfHeaderVersion;
}

pub struct Elf32HeaderVersion(Elf32Half);

impl ElfHeaderVersionKind for Elf32HeaderVersion { /* ...boring... */ }

Yucky!!!! Not good. Obviously the goal is:

  • Parse independent of bitwidth and endianness (the two variables that affect how the file is read, excluding platform and architecture specific extensions)
  • Write independent of bitwidth and endianness
  • Use independent of bitwidth and endianness. This one is tricky, because if we want to emulate this parsed ELF eventually, we will need to be able to know the actual value, not the independent value. Imagine the entrypoint is set to e_version and e_version is treated as code. Not really possible if we don't preserve the information. Luckily, this problem reduces to writing independently. Before using anything, we "write it back" to its original form.

If that's all we need, why not just parse everything into Elf64 with some metadata so we can convert it back later? I'm going to do this, because otherwise it's a mess, and if I regret the choice later, I'll just change the implementation because it's rust and we can do that.

By the way, I'm aware of the goblin (https://docs.rs/goblin/latest/goblin/) project. It's great and I use it in the fuzzer I maintain at work, but I need more and more importantly, I need to do it myself for no reason other than I want to. Thanks goblin, for solving this problem so this can be a side project instead of a paid project, which would make it way less fun.
goblin - Rust

libgoblin

A surprising thing you'll notice after reading the various specs and code is that there is way less architecture-specific stuff than you would expect. There's a lot, but it's not like every vendor defines an extension twice the size as the original spec like with PDF.

It's mostly relocation-specific stuff and flags.

Ok, unfortunately I have nerd sniped myself. As soon as I pasted the snippet above I realized there is actually a way more legit way to do this because of how I mentioned earlier that ELF is a pretty good format. The only abstraction is over bitwidth and byte ordering, and we know them up front (and if we don't, we can guess pretty accurately). So... behold:

#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
/// An address in an ELF file. Represented as 32 bits for class 32 and 64 bits for class 64.
pub struct ElfAddress<const EC: u8, const ED: u8>(pub u64);

A happy medium, now we can do the entire decode from the top with the right shape. This has one unfortunate side effect -- if we implement a best-effort mode and it turns out at any point that we guessed wrong, we'll need to start fro the top. Luckily, even HUGE ELFs are under a few hundred MB, and this is a factor of 2, not n so I think it's fine.

Huge elves? Why not, might be good for SEO.