Mastodawn

David Chisnall (*Now with 50% more sarcasm!*)

It's a shame they don't dig a bit more into the parallelism. The paper says:

Further, RAPL samples include all cores, even if the program under test only uses a single core. If a benchmark is single-threaded or generally uses fewer cores than available, idle cores will be included in the energy consumption measurement. Therefore, using a varying level of parallelism across benchmark implementations can result in unfair comparison, as idle cores will add some constant energy consumption to each sample.

Which makes me scream 'yes but!'. Modern SoCs can independently adjust the clock speed of cores and typically have different cores with different power / performance tradeoffs. Leakage current means that it's far more power efficient if you can run the same workload in 1s on two cores clocked down to 800 MHz than one clocked at 1.6 GHz.

But there are confounding factors here. A few years ago, we found that turning off CPU affinity entirely in the FreeBSD scheduler made some workloads much faster. The workloads were bounded by the performance of a single-threaded component but pinning that to a single core made that core hot and then the CPU throttled the clock speed and made it slower. Having it move around unpredictably distributed the heat, which allowed the heat sink to work better.

I'm quite surprised by Fig 11d. I wonder how this varies across systems: actively reading DRAM consumes a lot more power than simply refreshing (the paper says 40% for refresh, I think this varies a bit across DRAM types), but perhaps the base load is so high and the read rates are so low that this doesn't make a difference. Or maybe the cache miss rates are all very low?

The highest numbers are around 90 M LLC misses per second. I think Intel chips do 128 B burst reads, so that's around 10 GB/s, which is around 2.5% of my laptop's peak memory bandwidth. Desktop / server RAM can sustain higher read rates. The difference between 0% and 2.5% of the maximum read rate may not be very big.

On this benchmark, Boost’s library performs significantly worse than PCRE. This outlier alone accounts for the entire reported gap between C and C++.

That's pretty embarrassing for Boost. I'd expect that a C++ RE implementation could build an efficient state machine at compile time and feed that through the compiler for further specialisation, whereas PCRE has to do it dynamically.

It's Not Easy Being Green: On the Energy Efficiency of Programming Languages