Obviously when I teach #GPGPU I point out that the main point is performance, and “how to measure” is a topic we address. Obviously, the primary metric is kernel runtime. I then introduce the effective bandwidth metric (bytes read + bytes written, divided by time taken by the kernel) which is a good way to compare some similar kernels AND to discuss hardware limits and how close we are to them (I don't always discuss the roofline model though, maybe I should).

When we start to look into more advanced things, we hit the “snag” that sometimes a kernel using a more efficient technique may be _less_ effective at using a particular resource, e.g. by having having a _lower_ effective bandwidth —I do this on purpose to show how kernel runtime remains the ultimate “tell” on how good a kernel is compared to another (regardless of the additional information the effective bandwidth may tell us).

At this point I introduce a metric for which I don't actually know if there is a name: number of elements processed per seconds, which is just the number of elements, divided by the kernel runtime.

I call this effective throughput, but sometimes I get the nagging feeling that this may not be the correct term?

@giuseppebilotta This reminds me of MLUPS, "Million Lattice Updates per Second".
Throughput seems like an OK term to me. What's clear is that they understand what you mean by it.
@hattom in GPUSPH we use MIPPS (millions of iteration-particles per second). Other SPH codes use GigaPIPS (billions of particle interactions per second). I should probably go back to read our own papers to check if we used the term throughput (esp. effective throughput) 😁