@arutaz Let’s assume it runs on air-cooled DGX #H100 systems with 8 NVIDIA H100s each, which deliver 25.6 petaFLOPS at 10.2kW.
Due to its mixture of experts architecture using only two of its 16 experts, #GPT-4 inference supposedly needs only 560 teraFLOPS per token generated in its forward pass.
So we’re at 25.6*10^15 FLOPS / 560*10^12 FLOPS = 45,7
Giving you 4.48 tokens per kWsecond or 16128 tokens per kWh.