Mastodawn

David Chisnall (*Now with 50% more sarcasm!*)Apr 25

When we originally created #CHERIoT, we added an instruction to make accessing globals uniform with respect to the rest of RISC-V. We fairly quickly realised it wasn’t a great design, but the toolchain changes required to eliminate it took a while.

We’ve finally done it and I spent a bit of time this week writing up our journey. Removing this instruction is something we always planned to do before rebasing on the upcoming RV32YE base.

#CHERI #RISCV

Removing the AUICGP instruction

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away – Antoine de Saint-Exupéry

CHERIoT Platform

CHERIoT Platform Apr 24

Removing the AUICGP instruction

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away

– Antoine de Saint-Exupéry

In the last week or so, we’ve started landing patches in our compiler to remove the auicgp instruction. This is something that we’d hoped to do for a while because we want to remove the instruction entirely in CHERIoT v2 (to be based on the upcoming RV32YE standard). This post will explain why we had it, why we wanted to get rid of it, and how we’ve removed the need for it.

What is auicgp?

In CHERIoT, every global is either accessed relative to the program counter (PC) for code or immutable globals, or the global pointer (GP) for read-write globals. On RISC-V, PC-relative addressing is done via the auipc instruction, which takes a 20-bit immediate, left shifts it by 12, and adds it to the PC. This is used paired with a load, store, or add (if you just want to materialise the address as a pointer) to fill in the remaining bits.

For symmetry with base RISC-V (more detail on this later), we introduced a similar instruction that added a shifted immediate to the global pointer: add upper immediate to (capability) global pointer (auicgp). Originally, because we’re based on RV32E, we used one of the bits in the destination-register field to differentiate this from auipcc instruction, so it fitted into some spare encoding space.

In the initial public release, we changed this because RISC-V promises never to use that bit of encoding space for standard extensions and we hoped to standardise it. Since then, the auicgp instruction has been the least popular bit of the CHERIoT ISA, for a variety of reasons.

First, it’s an enormous instruction. With a register target and a 20-bit immediate, it uses a full major opcode. That’s 1/128 of the total encoding space. In practice, it’s worse than that, because the RISC-V encoding requires 32-bit instructions to start 11 (01, 10, and 00 are reserved for 16-bit encodings), it’s actually 1/32 of the space available for 32-bit instructions. That’s a lot to pay, though it might be worth it if the instruction were used a lot.

Unfortunately, auicgp is very rarely used. Our ABI points the global pointer to the middle of the globals region, which means that we can use the entire 4 KiB range of the 12-bit immediate in loads, stores, and adds. On RISC-V, the normal model is for the compiler to emit a long instruction sequence for operations like global accesses and then the linker to relax this by deleting instructions. We followed this pattern for auicgp, with the linker deleting it if the immediate is 0. As a result, almost all of the auicgp instructions were deleted. Our original linker version didn’t implement support for relaxations (upstream LLD didn’t support them for RISC-V when we started working on CHERIoT!) and so we weren’t initially sure how feasible deleting them would be, but we’ve had linker relaxations working for a few years now and they work very well.

The RTOS test suite contains a few auicgps, but only because we’ve intentionally written tests to make sure that large globals work. In most firmware images, there are no instances of the instruction that survive relaxation. So we’re using 1/32 of the encoding space for 32-bit instructions for an instruction that, to a first approximation, never gets used.

If that isn’t bad enough, it’s also annoying for simple pipelines to implement. RISC-V is designed so that the source and destination registers are always in the same place in instructions. Once you move to long pipelines, have register rename, or extensions with multiple register files (such as floating-point or vector extensions) then this is of little or no benefit. But CHERIoT targets the kind of microcontroller design that RISC-V overfitted for. Simple in-order pipelines find it useful to do register fetch early. The auicgp instruction is the only one that has fetches from the general-purpose register file but does not have the registers in the correct location. This adds extra multiplexing that, on a short in-order pipeline such as CHERIoT Ibex, impacts the critical-path length.

So, unfortunately, there is no situation in which this instruction is a good idea, except that it made it easy to create dense code via largely orthogonal code paths. To understand why we incorporated it, you need to understand how RISC-V normally generates pointers to globals.

Why did we have auicgp?

Let’s start with the simple case in RISC-V, of a static binary (how you’d normally compile embedded firmware) accessing a global. A simple sequence loading an int from a global called x would look like this:

	lui     a0, %hi(x)
	lw      a0, %lo(x)(a0)

This has two relocations. The load upper immediate instruction will materialise the top 20 bits of the address via the hi relocation, then the load will materialise the low 12 bits in its immediate from the lo relocation. This is very efficient, but doesn’t work on a CHERI system because you can’t just make up an integer address and use it for loads and stores.

That’s not a big problem, because it’s the same issue that position-independent code needs to handle, and RISC-V does this with the same number of instructions. When you’re compiling position-independent code, RISC-V will instead generate the following sequence:

.Lpcrel_hi0:
	auipc   a0, %pcrel_hi(x)
	lw      a0, %pcrel_lo(.Lpcrel_hi0)(a0)

Again, this materialises the address in two instructions, including folding half of the address-materialisation into the load. The difference is that the high bits are not created as an absolute address, they are created by adding a value to the current PC value. The %pcrel_lo relocation looks a bit odd because it refers to the address of the auipc instruction, not the address of x. This is because of limitations in ELF relocations. The linker needs to know both the address of the auipc instruction (which doesn’t have to be next to the lw, and might not be if the lw is in a loop) and of the target, but ELF relocations can’t express both of these. Fortunately, the linker can work around this by looking up the relocations that apply to the address of the auipc instruction to find the target.

On CHERIoT, accessing read-only globals works in exactly the same way. They are stored within the bounds of the program counter for the current compartment (or library) and so we can compute a displacement and load. Read-write globals are instead stored within a region reachable via the global pointer. These capabilities don’t overlap, and they have distinct permissions.

This is where we encounter the first problem. Most of the time, the compiler knows whether a global is read-only or not and so knows whether it needs to emit a sequence relative to the program counter or the global pointer. This is not universally true. In particular, C inherited a notion of ‘common linkage’ from Fortran and this makes it possible for different compilation units to disagree on whether something is read-only and leaves it for the linker to figure it out.

For read-only globals, the sequence looks like this (note: the ct. prefix is the official vendor prefix for CHERIoT instructions):

.LBB0_1:
	ct.auipcc    a0, %cheriot_compartment_hi(x)
	ct.clw       a0, %cheriot_compartment_lo_i(.LBB0_1)(a0)

But for read-write globals, it looks like this:

.LBB0_1:
	ct.auicgp   a0, %cheriot_compartment_hi(x)
	ct.clw      a0, %cheriot_compartment_lo_i(.LBB0_1)(a0)

This hopefully makes it clear how the auicgp instruction is useful. These two sequences look identical and so the linker can rewrite one to the other by simply rewriting the opcode of the first instruction. If the compiler tries to do PC-relative addressing for a read-write global, the compiler can turn the auipcc into an auicgp, or vice versa.

For read-write globals, the linker can also do relaxation. In the (common, almost universal) case that the address of x is within the displacement from the global pointer of a single instruction, the second sequence becomes this:

	ct.clw	a0, {displacement of x from the global pointer}(c)

Note that this is now fewer instructions than the baseline RISC-V case. We hit this in a lot of cases, because a lot of global accesses are simple reads and writes of scalar values. That’s not universally true. If the address of the global escapes the analyses that the compiler uses to prove that a dereference is in-bounds (which are currently fairly naïve, but will improve over time), then we must materialise a full capability.

This is why it’s important to design a CHERI ABI along with a threat model. Our threat model requires that any compartment be able to enforce a list of memory-safety properties against untrusted code, but treats things like memory safety for stack or global variables within a compartment as defence-in-depth properties. Code within a compartment is allowed to access all globals for that compartment and so the threat model says it’s fine to make the compiler responsible for enforcing those bounds. But when a pointer to a global is passed to another compartment (possibly indirectly via another compilation unit, maybe in a different language, in the same compartment) then the compiler taking the address of that global must be able to apply bounds and permissions.

If we are taking the address of x, the compiler will generate a sequence like this:

.LBB0_1:
	ct.auicgp       a0, %cheriot_compartment_hi(x)
	ct.cincoffset   a0, a0, %cheriot_compartment_lo_i(.LBB0_1)
	ct.csetbounds   a0, a0, %cheriot_compartment_size(x)

Note that now the second part of the address isn’t folded into the load instruction. This doesn’t hurt code size as much as you might think, because now we’re likely to do offset-zero loads and stores on the result and RISC-V lets us express these as 16-bit instructions.

The new instruction is applying the bounds. As before, the linker will normally relax away the first instruction, so we end up with this sequence for taking the address of a global:

	ct.cincoffset   a0, a0, {offset of x}(gp)
	ct.csetbounds   a0, a0, {size of x}(x)

Again, this is the same number of instructions as baseline RISC-V, but now with bounds applied. For comparison, here is the RISC-V version:

	auipc   a0, %pcrel_hi(x)
	addi    a0, a0, %pcrel_lo(.Lpcrel_hi0)

There is one annoying case though: what happens if the bounds are too large for an immediate? Here, we need an extra instruction (and register) to hold the immediate. This means that our worst case (large bounds, address taken or not provably in bounds) is five instructions (if it takes two instructions to materialise the immediate for the bounds), much worse than our best case of one.

The approach taken on the ABI for big CHERI is to use an indirection layer, still referred to as a global offset table (GOT), even though it doesn’t contain offsets. This mirrors the baseline RISC-V GOT model, where the same sequence would be:

        auipc   a0, %got_pcrel_hi(x)
        lw      a0, %pcrel_lo(.Lpcrel_hi0)(a0)

At two instructions, this looks cheap, but note that the second is a load, which is loading the address from the GOT. This means that, although it’s only eight bytes of instructions, it’s also one pointer’s worth of data (eight more bytes for CHERIoT, sixteen for big CHERI). The second instruction is accessing memory (even on CHERIoT systems with local SRAM, memory-access instructions are typically slower than ALU instructions).

This is worse than our common case, but is often better than our worst case.

The new model

In the new approach, any global that isn’t reachable via a short displacement from the global pointer (i.e. things that would need auicgp) is turned into a GOT-relative access.

This also serves as a building block for other optimisations. The three-instruction sequence is smaller than the GOT-relative access if a global is accessed only once, but is larger if the same global has its address taken three or more times. The linker has full visibility into the number of accesses to a global and so can make this decision.

Similarly, globals that are too big to have their bounds applied in a single instruction will be able to fall back to the GOT mechanism. These are rare but they do occur for things like frame buffers. These are close to break even for a single use. For example, imagine if x is an array of 1028 ints. Taking the address is currently this (long!) sequence:

	lui             a0, 1
	addi            a0, a0, 16
.LBB0_1:
	ct.auicgp       a1, %cheriot_compartment_hi(x)
	ct.cincoffset   a1, a1, %cheriot_compartment_lo_i(.LBB0_1)
	ct.csetbounds   a0, a1, a0

Assuming that the auicgp is relaxed away, this becomes:

	lui             a0, 1
	addi            a0, a0, 16
	ct.cincoffset   a1, a1, {offset of x}(gp)
	ct.csetbounds   a0, a1, a0

This is four instructions, but both the lui and addi will, in this case, be the 16-bit variants, so this is 12 instructions: smaller than the GOT-relative sequence plus the GOT entry. This sequence also requires an additional register, and removing that requirement can improve code generation in the rest of the function.

To support this, we’ve introduced a new ct.auipcc.data pseudo-instruction that emits padding nop that is normally relaxed away, but provides space for the linker to emit the longest sequence. The sequence the compiler will emit looks like this:

.LBB4_1
	ct.auipcc.data  s0, %cheriot_compartment_data_hi(x)
	ct.cincoffset   s0, s0, %cheriot_compartment_lo_i(.LBB4_1)
	ct.csetbounds   s0, s0, %cheriot_compartment_size(x)
	ct.cincoffset   a0, s0, a0

The linker can then relax this to any of the sequences described above, almost always to something much shorter.

The extra padding is needed only in one specific case, but emitting it in all cases that may use the global pointer simplifies the linker logic. If the compiler statically knows that the access is in bounds, it will emit the short two-instruction sequence of auipcc followed by a load or store. This is two 32-bit instructions (8 bytes total). If the global is large, this must be transformed into a GOT load followed by a load or store, which is three instructions, two of which might be 16-bit, but in the worst case all three will need the full range of the 32-bit variants. The linker always removes the nop but may use the space that it reserved for the longer sequence, in this one corner case.

A final note

CHERIoT benefits enormously from having co-designed the ISA, ABI, and programmer model. The cheriot-audit tool makes it easy to reason about the security of a firmware image. This tool is possible because we defined our ABI so that everything that a compartment accesses outside of its code and globals region is explicit (in the programmer model).

Several of these optimisations are possible only because we took a holistic view to system design. If, for example, we allowed globals from other compartments to be transparently accessed then this would have required a GOT approach from the start and would also have made reasoning about communication between compartments much harder. Instead, we have an explicit notion of a pre-shared object, which must come via a compartment’s import table and so is exposed to auditing. By making the import explicit, we can also go beyond core C features and make the permissions explicit, so we can see by reading the source code, and confirm with cheriot-audit that exactly what permissions a compartment has on a shared object.

Decades of prior work have shown us that security policies divorced from the source code are hard to reason about and often get out of sync with the implementation. When they get out of sync, people notice when components have too few permissions (they stop working) but not when they have too many (until an attacker notices).

For CHERIoT, we had an explicit design goal that you should be able to reason about the security of your code by reading the source and do coarser-grained auditing of other people’s code with additional tooling. Whether a function is in your compartment or elsewhere is explicit in the source. Whether a global is uniquely owned by you or a pre-shared object is explicit in the source. The ABI we co-designed with the programmer model to enable this and it lets us emit very short instruction sequences in the common case, while falling back to larger ones to handle corner cases.

#CHERIoT

Removing the AUICGP instruction

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away – Antoine de Saint-Exupéry

CHERIoT Platform

Michael Engel Mar 26

CHERI Blossoms in Cambridge!

#cheri #cheriot #capabilities #Cambridge

David Chisnall (*Now with 50% more sarcasm!*)Mar 21

We’ll be talking more about the progress on the CHERIoT port of Rust at CHERI Blossoms next week, but here’s a teaser:

The embedded graphics crate rendering an image on Sonata. This currently using a (memory-safe) C function to draw pixels (that can go away with a little bit more work) but the current compiler is able to build this crate and run it in a CHERIoT compartment.

#Rust #CHERI #CHERIoT #CHERIBlossoms

David Chisnall (*Now with 50% more sarcasm!*)Mar 12

Me? Trolling the other microcontroller vendors? Surely not! Maybe if they had bothered to do something about the most common source of vulnerabilities at some point in the last few decades, the could have been on the other side of the sign...

#CHERI #CHERIoT #SeriouslyFolksItsBeen40Years

David Chisnall (*Now with 50% more sarcasm!*)Mar 9

Yay, we won a Best in Show Award at #EmbeddedWorld ! In the Microcontrollers, Microprocessors, and IP category for our ICENI #CHERIoT chips!

#CHERI

David Chisnall (*Now with 50% more sarcasm!*)Mar 4

The PQCA has published a blog by me about using their ML-KEM and ML-DSA libraries on CHERIoT

#PQC #CHERIoT #CHERI

mlkem-native and mldsa-native in CHERIoT – Post-Quantum Cryptography Alliance

jona Mar 4

Already really looking forward to visiting the #CHERI / #CHERIoT stand at Embedded World next week - especially now that they've got their first silicon!

CHERIoT Platform Mar 4

First CHERIoT Silicon!

Most CHERIoT work to date has been done on software or FPGA simulations. We have several such implementations: The executable model built from our formal ISA specification, the MPact simulator from Google, Microsoft’s CHERIoT SAFE FPGA target for the Arty A7, and of course lowRISC’s beautiful Sonata FPGA board, which is designed to simulate CHERIoT systems. These were always intended to be developing and prototyping systems, so I’m delighted to announce that SCI Semiconductor has the first silicon CHERIoT implementation.

[ Conflict disclaimer: I am a co-founder of SCI Semiconductor. ]

The dev board pictured above contains one of the first batch of ICENI chips to come back from the fab. This is a complete CHERIoT system, with all of the core CHERI properties (spatial memory safety, no pointer injection, and so on) along with all of the CHERIoT extensions that provide deterministic use-after-free protection, auditable control over interrupt state, and everything that we need for an aggressively compartmentalised RTOS.

This chip uses the CHERIoT Ibex core, running at up to 250 MHz, and includes a few feature that accelerate temporal safety, improve interrupt determinism, and so on. These build on top of all of the benefits of any CHERIoT implementation: deterministic mitigation of memory safety bugs from simple buffer overflows up to use-after-free, fine-grained compartmentalisation, and a programming model co-designed with both the ISA and the software stack to provide a tiny TCB. Anything that works on CHERIoT SAFE or Sonata should be very easy to port to ICENI for production use. Anything that runs on the software simulators should just work.

We’ll be showing the chips at Embedded World (Stand 4A - 131) next week and at CHERI Blossoms a couple of weeks later. From tomorrow, one will also be on display in the CHERI 15th anniversary exhibit in the Cambridge Computer Laboratory.

Aside: The Iceni tribe were one of the pre-Roman tribes in Britain and are famous for their chariots (though more due to this statue than historical fact). I am only partially to blame for the bad puns in the naming.

#CHERIoT

First CHERIoT Silicon!

Welcome to the CHERIoT Platform, a hardware-software co-design project that provides game-changing security for embedded devices.

CHERIoT Platform

David Chisnall (*Now with 50% more sarcasm!*)Mar 4

First Post! Uh, I mean, First CHERIoT Silicon!

We have our first chips back! It is very exciting! Spatial and temporal memory safety, fine-grained compartmentalisation, and also a load of other big chips on a board, so you can play 'Where's ICENI?' on the board picture!

#CHERI #CHERIoT

First CHERIoT Silicon!

Welcome to the CHERIoT Platform, a hardware-software co-design project that provides game-changing security for embedded devices.

CHERIoT Platform