Ephemeral ECDH key generation and shared secret calculation now use the FPGA accelerator and SSH session creation now feels about as fast as it does when logging into a regular PC.
I can probably extend the same accelerator block (with some minimal tweaks) to also support the public key side of signing, but for now crypto_sign() is still being done entirely in software and only the two crypto_scalarmult() calls in the SSH session creation are accelerated.
Still a massive improvement in responsiveness, it cut about 400ms of latency off session creation.