Obviously when I teach #GPGPU I point out that the main point is performance, and “how to measure” is a topic we address. Obviously, the primary metric is kernel runtime. I then introduce the effective bandwidth metric (bytes read + bytes written, divided by time taken by the kernel) which is a good way to compare some similar kernels AND to discuss hardware limits and how close we are to them (I don't always discuss the roofline model though, maybe I should).

When we start to look into more advanced things, we hit the “snag” that sometimes a kernel using a more efficient technique may be _less_ effective at using a particular resource, e.g. by having having a _lower_ effective bandwidth —I do this on purpose to show how kernel runtime remains the ultimate “tell” on how good a kernel is compared to another (regardless of the additional information the effective bandwidth may tell us).

At this point I introduce a metric for which I don't actually know if there is a name: number of elements processed per seconds, which is just the number of elements, divided by the kernel runtime.

I call this effective throughput, but sometimes I get the nagging feeling that this may not be the correct term?

@giuseppebilotta I think that measure is normally just called "throughput". I teach it as the counterpart to "latency" in my systems course. I am also happy I am not the only one uncomfortable with using memory bandwidth or FLOP/s as the main measure of performance.