Back in December I got some Mellanox ConnectX-6 Dx, now I finally got around to playing with them. I got them because I was interested in two features:

  • True integrated switch between all VM virtual functions and outside links, with communication between VMs, VLAN filtering rules, LACP-bonding the links and so on
  • Hardware offloading for TLS-encryption

I now benchmarked the TLS-offload, and unfortunately I'm underwhelmed. A 🧵​

#networking #mellanox #tls #homelab

I hoped that TLS-offloading would increase throughput or at least keep throughput but reduce cpu load. But when you look at the throughput graph, the TLS-offloading (nginx+hw) is completely useless for small transfers. I'd have needed log charts to better show this, 8.7 MByte/s total for 500 clients repeatedly requesting a file of 10 kBytes. The regular userspace-only nginx can do 323 MBytes/s for the same load. Even with 100 kBytes requests it is still useless (83 MBytes/s).

It only becomes useful in the region somwhere between 1 and 10 MBytes file size.

While offloading TLS to the kernel (kTLS) has some setup cost, it pays off from shortly after 100k, offloading the transmission to the network card seems to be much slower. Since the CPU is nearly idle during this time it seems like setting up the offload is somehow implemented inefficiently.

Another problem I found is that the TLS hardware offload only seems to support TLS 1.2 ciphers. The datasheet from Mellanox/Nvidia claims "AES-GCM 128/256-bit key" and doesn't give more details.

It worked with the TLS 1.2 cipher ECDHE-RSA-AES256-GCM-SHA384. But as soon as I switched to TLS 1.3 and tried to use for example TLS_AES_256_GCM_SHA384, the kernel didn't use the hardware offload anymore. I'm not a crypto expert, but I'd say that encrypting the actual data after setting up the TLS session once should be the same for both. So it could be a kernel issue.

Setting up the hardware offload was totally easy and painless:

ssl_conf_command Options KTLS; in nginx.conf and then ethtool -K <nic> tls-hw-tx-offload on; ethtool -K <nic> tls-hw-rx-offload on;, for both nics of the bond.

An easy way to verify is looking at /proc/net/tls_stat.

A website with some helpful info is https://delthas.fr/blog/2023/kernel-tls/ , although the info that only AES128 works seems to be outdated as I got AES256 to work without problems as long as I stayed in TLS1.2.

Using Kernel TLS (kTLS) and TLS NIC offloading with OpenSSL · delthas

As server I used an Epyc 7232P (8-core, Zen 2) with kernel 6.7.9 (elrepo ml), stock kernel mlx5 driver, nginx 1.22.1 and the rest is a stock Rocky 9.3 linux.

The client is a much faster Ryzen 7950X3D (16-core, Zen 4). I set it up like this to really benchmark the server side and not the client.

Both were equipped with a ConnectX-6 Dx card and connected with 2 LACP-bonded 25 GBit/s links, xmit hash layer3+4, so a theoretical bandwidth of 50 GBit/s. I did not use any TLS offloading on the client since it is much faster than the server and never really got warm during the tests.

I used static files on disk on the server, sendfile on, TLS session caching off, socket reuse on.
On the client I used wrk with --header "Connection: Close" to simulate many users downloading a single file and not a few downloading much.

So I guess the way they implemented the TLS offload, it only makes sense when you have long-running connections that are used to transfer many Mega- or Gigabytes of data. So your regular webserver or fedi instance being hit by many clients downloading a picture doesn't profit.

But maybe I made some mistake in benchmarking or their closed source drivers are much better?

Let's hope the integrated switch function is more promising. I'll have to do some more reading on this it seems, it is based on Open vSwitch which I haven't used before.

After experimenting with the (underwhelming) performance of HTTPS offload of my Mellanox/Nvidia ConnectX-6 Dx smart NIC last week, I had one more idea how the crypto offload could be useful to me: encrypting NFS on a fileserver.

My test was one client doing random reads over NFS with fio. Since the test with the many HTTPS clients last week didn't perform well probably due to setting up the TLS session being slow, I now did a test with just one client and one TCP connection.

As you can see in the chart, offloading NFS encryption with the new RPC-with-TLS mode works and is even the fastest NFS option on this server hardware - but it won't even saturate a 10G link and is far slower than the unencrypted variant.

As the card is also able to do IPSec crypto offloading I also tried running the unencrypted NFS through an IPSec tunnel for auth and encryption, but the performance of it is totally useless.

So unfortunately I'm still underwhelmed with the performance of the crypto offload.

#networking #nfs #mellanox

Getting the NFS RPC-with-TLS encryption offloaded was a bit more tricky:

Either the kernel ktls doesn't currently support offloading TLS 1.3 at all or at least the Mellanox driver doesn't support it yet. The code in the mlx5 kernel driver makes it clear that it currently just supports TLS 1.2.

But NFS RPC-with-TLS is very new code, so it is only designed to work with TLS 1.3.

I had to make some dirty hacks to tlshd (the userspace daemon that initiates the TLS connection before handing it off to the kernel) to get it to work by forcing TLS 1.2 and a matching cipher on the client and server.

So this is probably not something you want to run in prod until the mlx5 driver gains TLS 1.3 offloading.

Since I now had just one TCP connection, LACP-bundling my two links became useless since the xmit_hash_policy=layer3+4 results to all packets being sent over the same link. So 25 GBit or roughly 3.1 GBytes/s was the theoretical limit.

I could see several nfsd kernel threads being used and being spread over different cores. So the NFS part profits from multiple cores. But probably the ktls part doesn't because all the data is stuffed into one TCP connection in the end. Maybe there is some path for future optimization? NFS RPC-with-TLS is very new code, so I have some hope that it's speed will improve in the future.

My guess is that the reason for the limited speed with the crypto offloading of my ConnectX-6 Dx is that the IC on the network card runs into it's limit in encryption speed for one connection or TLS flow. Combine that with the time consuming setup of each connection/TLS flow, and the usefulness of the whole idea gets smaller and smaller.

Or is this something they improved with the next gen ConnectX-7? I haven't seen Mellanox/Nvidia post any figures about crypto offload speed...

Does anybody know of some figures or has tested it? Maybe @manawyrm ?

@electronic_eel Haven't worked with the HW crypto acceleration yet, sorry :(

I'm mostly pushing customers packets into VMs, application layer is their problem 😺

@manawyrm wouldn't the customers be able to use the crypto offloading via their mlx5 virtual function NICs? Or do you not expose those to the customers to be more flexible regarding which hw you use?

But thanks for your reply anyway.

@electronic_eel They would be able to use them through VFs, but we can‘t use those as we (sadly) have many different NICs across the fleet and also because the network setup isn‘t just plain Ethernet.