My c++ is not great (so it is hard for me to tell what is going on) and I'm used to OpenMP where my understanding has always been that you tend to get a single thread per processor (or per hyper-thread) -- not sure if that is guaranteed with the way your code is laid out? Perhaps it really is a NUMA issue as others suggest. I will note that one other variation I had (as it looks like you are already splitting across threads) is that the chunk sizes were actually smaller than the # of threads which meant a faster thread would take more chunks rather than waiting on the slowest thread. Good luck!