-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ERROR: pthread_setaffinity_np for core 1 failed with code 22 Marking core 1 offline #744
Comments
are you running pcm in a restricted cgroup? |
Hijacking this thread, I'm trying to add pcm to an online monitoring system but in a restricted cgroup - is pcm not able to be run restricted to just some cpus? |
pcm-pcie needs all cpus (like in the example above). Other pcm utilities typically don't need all cpus but pcm won't be able to show per-cpu stats for excluded cpus. |
Thanks for the response. My use case is using pcm as a library, intermittently getting all the counter states (for all cores, not just the ones pcm's cgroup is limited to) to compute QPI metrics. I can set the cpuset for the cgroup to all cores to initialize PCM instance, but I would ideally restrict it for calling getCounterStates - does getCounterStates need all cpus if the initialization had all cpus? |
it might or might not work. This scenario is not tested. |
Ok I've tested with moving it to a restricted cgroup after initialization. This gives errors in some metrics, but the QPI Utilization seems to work fine in a restricted c group. When I restrict it before initialization, however, I get an exception thrown in discoverSystemTopology: line 1082. Do we need the topologyDomainMap to get QPI metrics across all sockets & links? |
Seems like there are a couple of places where we pin to core 0: TemporalThreadAffinity aff0(0); Could this instead pin to an available core within the cgroup? |
Let me see... |
The other one is within "readCPUMicrocodeLevel" |
could you try changing 0 to socketRefCore[0] ? |
I tried hardcoding it to 2, which I know is in the cluster of the cgroup. I added some try catches around the rest, but I can't get any QPI measurements unfortunately :/ but at least it doesn't crash |
Basically getting output similar to the following: ERROR: pthread_set_affinity_np for core 0 failed with code 22 But I expect 3 links and 2 sockets, so I guess the topology marking cores offline affects the number of QPI ports? |
yes, PCM thinks on single-socket systems UPI links don't need to be detected because UPI links are only there to connect 2 or more sockets... |
Ah of course, thanks for pointing that out. Is the setting of thread affinity necessary to detect that there are multiple sockets? |
yes. you need at least one core on the other socket to be in the cgroup |
Ok I fully see the problem. When initializing pcm in a restricted cgroup, the try block that populates Entry and fills the socketIdMap fails to add the cores from the other socket due to not being able to set affinity on cores outside of the cgroup, instead going to the catch block "Marking core offline", thus making the system topology inaccurate. |
Aha, and the reason it needs the thread affinity RAII there is so that getting the apic_id will work? Which uses pcm_cpuid, which calls the cpuid instruction, which returns "apic of the current logical processor"? |
correct |
Thanks for the help with this. In cases where cores are inactive, I wonder if you could just read /proc/cpuinfo to get the topology instead of running cpuid on each core..? |
yes, one can experiment with this and see how far can we go to support such config |
There are 128 cores in my server, the pcm-pcie said "the number of online logical cores: 2". It can not detect the other 126 cores as online. But when i exec lscpu in my server, these 128 cores are all online. How can i fix this problem.
The text was updated successfully, but these errors were encountered: