If you don't know some basics, you can get in trouble.
But if you do and if you get particularly clever, you can get in trouble on the next generation of whatever you're designing for.
TANSTAAFL.
With some of that C code shown, I'd try a "static inline" on the declaration, and see how well the C compiler and its code generator dealt with it. (If I can get out of dealing with it and cede the work to the compiler, all the better...)
And the caching locality discussions are the same ones that arise with virtual memory out in main memory; if you bounce around all over the place in address space (eg: setting up the wrong stride on an array, or partitioning your data differently from your access), you'll incur page faults.
And if you get particularly good with your packing of your data, you can sometimes trigger word tearing when you're accessing adjacent data within the same granule from different processes or different threads.
Caches and cache designs, too, are individually funky. There are processors which got faster by going to smaller and fewer caches and better (lower) latency. So if you fell out of cache in the new design, you'd assume you'd do worse, but with lower latency on the box out to memory, you actually did better pulling from main memory than if you were pulling from second-level on the previous generation.
Actually, in just about anything non-JITted (without faking it by hinting the compiler), I see no reason why (basic) locality optimizations would change with different architectures. A sufficiently smart branch predictor deciding what to load into caches could take care of such a thing, but that can't really be relied on (too hard to predict), and it would likely benefit from such hints as well.
This is all taking into account that there is indeed too much of a "good" thing. More extreme methods of optimization can definitely shoot you in the foot, and not just while you're writing them.
The word-tearing case had unrelated data values packed within the same granule of cache storage, and that derailed the running environment in a very subtle way. With just the right (wrong) timing, you very occasionally saw slightly different values in the adjacent variable within the granule when (apparently-unrelated) threads were spun up, and got tangled and torn. No shared references. Just sharing that granule.
But if you do and if you get particularly clever, you can get in trouble on the next generation of whatever you're designing for.
TANSTAAFL.
With some of that C code shown, I'd try a "static inline" on the declaration, and see how well the C compiler and its code generator dealt with it. (If I can get out of dealing with it and cede the work to the compiler, all the better...)
And the caching locality discussions are the same ones that arise with virtual memory out in main memory; if you bounce around all over the place in address space (eg: setting up the wrong stride on an array, or partitioning your data differently from your access), you'll incur page faults.
And if you get particularly good with your packing of your data, you can sometimes trigger word tearing when you're accessing adjacent data within the same granule from different processes or different threads.
Caches and cache designs, too, are individually funky. There are processors which got faster by going to smaller and fewer caches and better (lower) latency. So if you fell out of cache in the new design, you'd assume you'd do worse, but with lower latency on the box out to memory, you actually did better pulling from main memory than if you were pulling from second-level on the previous generation.