Any ideas why perf_event_open() would often return 0 CPU instructions retired when I pin a thread to a particular core?
perf_event_open() always get zero when specifying CPU ID

I am reading Perf performance counters using this example: https://gist.github.com/teqdruid/2473071 However, instead of: fd[0] = perf_event_open(&attr[0], getpid(), -1, -1, 0); I only want to ...

Stack Overflow

I guess sometimes the kernel can't put enough counters in the limited number provided by the hardware, and then it either samples (giving reduced data) or just... returns 0?

And kinda seems like whatever is using the counters is some other process, since I still have intermittent issues when I use a lock to limit this to one counter at a time in my code.

And you can actually see the time enabled vs time running and yeah this is a problem.

@itamarst IIRC, since Sandy Bridge (i.e. a long time ago) you can have at least 4 hardware performance counters active at a time per core. I know it's 6 on newer Zens. But yeah, if you try to enable more counters than that, the software driver (perf in this case) has to time-multiplex the hardware counters and rely on statistical sampling to estimate the count. Not sure if that's the actual reason for what you're seeing, though.

@pervognsen @itamarst Intel had eight per physical core, halved to four per logical when booting with HT enabled.

AMD always has six per logical core, in contrast.

@pervognsen I eventually discovered that setting _both_ CPU core restriction _and_ process id restriction made schedaffinity no longer break the results from perf_event_open. So now I'm fiddling with APIs, then docs, then a release: https://codeberg.org/itamarst/bigo

(I guess also figure out if I want to fight a name squatter or pick another crate name.)

UPDATE: Nope, still broken, I was being fooled by my fallback logic which switches to time-based measures.

bigo

bigo

Codeberg.org
@itamarst @pervognsen I'd suggest to use .observe_self instead of .observe_pid (monitoring another task needs elevated privileges, I'm surprised it worked at all for you; .observe_self, and passing pid=0 to perf_event_open is specifically there to support the self-monitoring use-case which needs less privs)
@amonakov @pervognsen I want to observe a single thread, not the whole process (there are other threads I want to observe). Also it's not clear to me you actually _ought_ to need more privileges given it's a thread within the same process. The man page might be wrong in this edge case? Would have to dig into Linux kernel source code to check though.
@itamarst @amonakov @pervognsen you can specify the TID. IIRC, the problem with specifying the core is you can get access to counters that don't clearly map to a process/uid.

@pkhuong I'm only doing retired CPU instructions. As mentioned the reason I need to specify CPU core is that otherwise schedaffinity limiting to that core results in getting results of 0 instead of the real numbers... At some point I should verify if that happens with a different high level wrapper I guess.

UPDATE: Nope, still broken when using both, I was being fooled by my fallback logic which switches to time-based measures.

@itamarst @pervognsen perf_event_open with pid=0 is limited to current thread, and if event attribute .inherit is not set, the configured event is not propagated to new tasks (threads/processes)
@itamarst I wonder whether if you used cpusets to isolate a core (https://documentation.ubuntu.com/real-time/latest/how-to/isolate-workload-cpusets/) and then only ran your workload on it if it would help. Although a lot of setup for what should be simple.
How to isolate CPUs from general execution with cpusets

Cpusets is a kernel feature that allows users to assign specific CPUs and memory nodes to a set of tasks, enabling fine-grained control over resource allocation. It also allows you to exclude certa...

Real-time Ubuntu

@dgl I can do that, and it does help, yeah, but this is for a library intended for use in uni tests, and that puts a burden on users I don't want to do.

Given a better mental model, I may need to dig into Linux kernel source code to validate if it's doing the easy thing or the clever thing, or maybe do some experiments. Specifically: I am asking for a hardware counter to be limited to one thread. CPU cores have a limited number (4, for example) of counters.

The easy thing for the kernel to do is add the counter to all cores since the thread might migrate. Which means other threads now have less hardware counters available.

The smart thing is to take into account core affinity and only ask for counters on relevant cores. This is harder since you need to update when affinity changes. And... doesn't explain where the 0 readings I was getting were coming from. So maybe my mental model is wrong. Or maybe the library I'm using has a bug.

(There's also the ability to ask for CPU counters for a specific _core_, which with pinning thread to core should work... except my perf_event_open() gets kernel error when I try to do that. So that's another thing I need to look into.)

Apparently if you limit to _both_ a thread id (pid) _and_ a CPU id, then perf_event_open() will happily give you results.

UPDATE: Nope, still broken, I was being fooled by my fallback logic which switches to time-based measures.