I guess sometimes the kernel can't put enough counters in the limited number provided by the hardware, and then it either samples (giving reduced data) or just... returns 0?
And kinda seems like whatever is using the counters is some other process, since I still have intermittent issues when I use a lock to limit this to one counter at a time in my code.
And you can actually see the time enabled vs time running and yeah this is a problem.
@dgl I can do that, and it does help, yeah, but this is for a library intended for use in uni tests, and that puts a burden on users I don't want to do.
Given a better mental model, I may need to dig into Linux kernel source code to validate if it's doing the easy thing or the clever thing, or maybe do some experiments. Specifically: I am asking for a hardware counter to be limited to one thread. CPU cores have a limited number (4, for example) of counters.
The easy thing for the kernel to do is add the counter to all cores since the thread might migrate. Which means other threads now have less hardware counters available.
The smart thing is to take into account core affinity and only ask for counters on relevant cores. This is harder since you need to update when affinity changes. And... doesn't explain where the 0 readings I was getting were coming from. So maybe my mental model is wrong. Or maybe the library I'm using has a bug.
(There's also the ability to ask for CPU counters for a specific _core_, which with pinning thread to core should work... except my perf_event_open() gets kernel error when I try to do that. So that's another thing I need to look into.)