At #sosp: Yuhong Zhong presents our work on using CXL to share NICs and other PCIe devices across multiple hosts for better utilization.

NICs are usually way overprovisioned on cloud servers. 27% of network bandwidth in a rack isn't even allocated because it runs out of CPU or memory resources first. The allocated bandwidth is also seriously underutilized, with racks typically having <20% utilization even in the 99.99th percentile.

If we could share NICs among multiple hosts, we'd get much better utilization by, e.g., using only one NIC for 4 hosts. And, by pooling resources, we can make extra NICs available within the rack to tolerate failures, without a huge utilization penalty.

The CXL interconnect gives us a way to share resources between servers that is relatively cheap (compared to PCIe switches) and high performance. For some workloads, it may already be deployed in order to pool memory.

Our work shows that we can also use it to share NICs, SSDs, and other PCIe resources. We do this by creating a message channel in shared CXL memory that allows a driver on one machine to send a message to the machine that hosts the PCIe device.

Thanks to a lot of optimization on Yuhong's part, the network stack can share network cards between hosts with only a 4-7 microsecond latency overhead. This lets us increase NIC utilization by 2x, and handle NIC failover with a short interruption time.

For more information, the full paper is available at https://drkp.net/papers/oasis-sosp25.pdf

and the source code at https://bitbucket.org/yuhong_zhong/oasis

@dan But what happens if you don't feed Miso in 4-7 microseconds from the time of request? 😺
@dan Is Slide 2 a picture of your cats sharing the food bowls?
@jawnsy _that_ system is not underutilized!
@dan In this paper, we present Cat eXpress Link (CXL), a system for sending Dan a mild shock when the food bowls are likely to be empty in the next hour. In recent feline trials, we found this to improve cat satisfaction by 47% ± 5% (p-value 0.005), n=4. We measured satisfaction scores using cat-attached microphones and machine learning to measure the ratio of meowing noises to satisfied purring events (SPEs.)