Are the details really helpful for performance work? MOSI, MOESI, MERSI, MESIF e...

vardump · on Nov 21, 2019

> Of course you might end up analyzing assembly level perf traces in a hot path for some game console with a less known CPU architecture

Or if you work in finance, mining, oil industry, medical, biosciences and countless other fields where you need to get good performance out of your hardware. Yes, even if you use GPUs, they're no magic bullet, they also have their architectural bottlenecks, strengths and weaknesses.

Or if you care about power consumption.

There are a lot of reasons to optimize hot paths and inner loops. CPU single core performance isn't improving much anymore, and we need to make better use of what we have.

fulafel · on Nov 21, 2019

I was trying to say that even in those cases you almost never benefit from knowing the details of cache coherency.

vardump · on Nov 21, 2019

That's a very bold statement.

You could just as well for example say that front end Javascript developer almost never needs to understand event callbacks or how DOM works.

If you write multithreaded high performance code, yeah, you do need to know about cache coherency at varying levels of detail. Sometimes rough rules of thumb work, not too often you need to understand all those annoying performance destroying details that leak through cache abstraction.

fulafel · on Nov 21, 2019

DOM and event callbacks are basic bread and butter for JS developers and in web apps it's essential to understand how DOM works and how to minimize DOM changes when doing performance work. Can you come up with an example where a oil or finance sector developer needs to understand MESIF?

Or are we having a terminology confusion and you use the coherency term for software visible performance characteristics of caches generally? I do agree that understanding cache effects and their intersection with multiprocessing is generally important in perf work. As is understanding the architected memory model, which tells you what you can and can't rely on semantically.

vardump · on Nov 21, 2019

> Can you come up with an example where a oil or finance sector developer needs to understand MESIF?

Yes. When they're writing high performance multithreaded analysis software and they're deciding which cache lines to write to and which only to read from. Those lines you only read from can be in S (shared) or F (forward) state.

And why is this important? Performance characteristics of a line entering in M (modified) state are pretty bad, if the line is shared between multiple cores.

Perhaps you'll also want to do all reads at once, knowing the line will more likely remain in F state for short periods, instead of bouncing S/F line state between CPU cores?

NUMA comes also to play, making this mistake even more costly. You really want to keep inter-core (especially NUMA socket!) communication to the minimum.

Of course, you could say you don't strictly need to understand MESI (or MESIF), but it really helps understanding why you do things certain way and reasoning about multi-core performance. The thing is, you can say same "you don't need to know this" about a lot of "low level details" in software trade.

Just like you need to understand DOM as a front-end developer to minimize DOM changes, even if you don't access it directly.

In cache coherency case it's analogously about reducing unnecessary multicast and broadcast messages sent between cores.

fulafel · on Nov 21, 2019

Thank you for writing a scenario and how it relates to coherency mechanisms.

So now we get to thinking about whether this gives the developer an advantage over just knowing "Dirtying cache lines across different cores/threads is slow". I don't think I would conclude so here.

But yeah I like reading details about microarchitectural details and other computer architecture topics, and am symphatetic to the point of view that knowing the "why" is nice. Just like I find it interesting to read about how DOM APIs are implemented in browsers and why they are hard to make faster...

vardump · on Nov 21, 2019

> So now we get to thinking about whether this gives the developer an advantage over just knowing "Dirtying cache lines across different cores/threads is slow". I don't think I would conclude so here.

The hidden thing behind all this is that even if the data is just read-shared, it can still generate traffic between cores and sockets.

Since these communication links are a shared resource [0], doing things wrong hurts performance in unrelated code and cores. Just because of storm of cache coherency packets is being sent between cores.

So yeah, you really do want to minimize this to maximize performance and scalability across the whole system!

[0]: In Intel's case, this shared resource is ring bus inside CPU socket and QPI between CPU sockets.

fulafel · on Nov 21, 2019

True, high validation traffic of read-shared data can indeed have effects in some cases, especially on multi socket systems, and it can sometimes be beneficial to have private copies of even read-only data for different threads if the data is small.

m0zg · on Nov 21, 2019

>> Dirtying cache lines across different cores/threads is slow

Very succinct and correct. This is particularly pernicious with atomics, which people use for lock-free stuff such as queues and work stealing thread pools. If atomics used by different threads/cores share a cache line and you mutate them a lot, perf gets instantly fucked, sometimes worse than if you used a mutex in the first place. And if you don't benchmark, you aren't going to notice.