Obviously when I teach #GPGPU I point out that the main point is performance, and “how to measure” is a topic we address. Obviously, the primary metric is kernel runtime. I then introduce the effective bandwidth metric (bytes read + bytes written, divided by time taken by the kernel) which is a good way to compare some similar kernels AND to discuss hardware limits and how close we are to them (I don't always discuss the roofline model though, maybe I should).
When we start to look into more advanced things, we hit the “snag” that sometimes a kernel using a more efficient technique may be _less_ effective at using a particular resource, e.g. by having having a _lower_ effective bandwidth —I do this on purpose to show how kernel runtime remains the ultimate “tell” on how good a kernel is compared to another (regardless of the additional information the effective bandwidth may tell us).
At this point I introduce a metric for which I don't actually know if there is a name: number of elements processed per seconds, which is just the number of elements, divided by the kernel runtime.
I call this effective throughput, but sometimes I get the nagging feeling that this may not be the correct term?






