If you don't know some basics, you can get in trouble. But if you do and if you ...

Groxx · on Aug 7, 2010

Actually, in just about anything non-JITted (without faking it by hinting the compiler), I see no reason why (basic) locality optimizations would change with different architectures. A sufficiently smart branch predictor deciding what to load into caches could take care of such a thing, but that can't really be relied on (too hard to predict), and it would likely benefit from such hints as well.

This is all taking into account that there is indeed too much of a "good" thing. More extreme methods of optimization can definitely shoot you in the foot, and not just while you're writing them.

Hoff · on Aug 7, 2010

The word-tearing case had unrelated data values packed within the same granule of cache storage, and that derailed the running environment in a very subtle way. With just the right (wrong) timing, you very occasionally saw slightly different values in the adjacent variable within the granule when (apparently-unrelated) threads were spun up, and got tangled and torn. No shared references. Just sharing that granule.