A couple of talks I've given recently about #CHERI have had people ask about performance overheads. That's a difficult question to answer, so probably benefits from a longer answer:
First, measuring performance overhead of hardware features is hard. A small tweak to a prefetcher, for example, may cause a 15% speedup on some workloads, but a 10% slowdown on others. I saw this with some Arm performance data on MTE, where one benchmark got measurably faster with MTE enabled. It turned out that enabling MTE disabled a specific prefetcher and this prefetcher made the benchmark performance worse (but made performance of others better and was typically a net win).
This is especially complicated for hardware because building an SoC is complicated. There are a lot of design decisions that trade performance, area, and power in different ways. Designers will optimise performance within the other constraints for workloads that they expect customers to care about. If you want to have a completely fair measure of how much feature X costs or helps performance, you need to have two equally competent teams build implementations, with the same budget.
If you do that (which, to be clear, is infeasible), you still have the problem of measurement. For example, AVX probably makes things faster (wider vectors, yay!), but moving between SSE and AVX vectors can make things slower. Turning on AVX can cause thermal throttling to kick in earlier and so make things slower. Even with a feature designed solely for performance, determining the degree to which is makes things faster (or if it does) is hard.
So, where does this leave us? We can talk about the unavoidable costs of CHERI. Capability checks must happen on every load, store, or jump, but that's a handful of fairly simple ALU operations in the load-store units. Those are very simple in comparison to memory-access logic. Pointers get bigger, and that's a real concern for performance, but you typically don't see a gradual decline from this, you see a cliff when workloads suddenly stop fitting in each layer in the cache hierarchy or in the TLB. An SoC design can size some of these structures differently to mitigate this, and you may be able to use larger cache lines rather than more associativity sets. There's a lot of performance tuning to be done here in a production SoC and it's not clear what the real impact would be. Beyond that, you have a bit of area overhead at the bottom of the memory hierarchy for storing tags. But that's basically everything.
What about the flip side? How much does CHERI improve performance? If you're doing mitigations against transient execution vulnerabilities, such as speculative taint tracking, CHERI can improve things. In a conventional STT implementation, an instruction that adds two integers to compute an address and then does a load can't retire until after both are untainted. On a CHERI system, only the capability operand needs to be untainted, the offset can be an arbitrary speculated value. Similarly, knowing that something is a pointer and what its bounds are enables better prefetching. There are even some fun things like connecting register writeback to the branch predictor (for shorter pipelines), because you know which things in the register are executable pointers and so can make a very good guess about which address is going to be a jump target. And that's ignoring the performance gains from simply disabling a load of weaker mitigations that people are shipping today.
That's all at the small scale though. Being able to share object graphs between mutually distrusting components can eliminate some large defensive copies. Last time I did a detailed look, Apple's XPC framework for process-base compartmentalisation did seven copies of objects sent between processes. In a CHERI system, that would be either zero or one, depending on your threat model, and would involve a lot less TLB pressure.
The kinds of overheads we see are hard to measure because they're well into the noise. The performance improvements we see from being able to actually build the systems programmers want to construct, without fighting hardware designed with totally different goals in mind, are much easier to measure. They tend to be complexity-class improvements: turning O(n) things into O(1) things. Or, sometimes, just a factor of 2-3 speedup.