> I belive all SPARC implementations have been TSO (i.e. exactly the same as x86...

KMag · on Jan 10, 2021

> GP may have confused SPARC and Alpha

I seem to have confused which of the 3 memory ordering states was the default and supported on all models. I'm well aware of the Alpha.

> e.g. if you load a pointer then read the value at that pointer you may get stale data

For anyone thinking that sounds horrifying, a single core doing the writes and read won't observe anything out of the ordinary.

The designers of the Alpha thought "Okay, we can simplify and speed up the cache coherency logic if the "happens before" temporal dependency introduced by a memory fence between two writes won't make any guarantees about two reads on another core, unless there's also a "happens before" temporal dependency introduced by a memory fence between the two reads". This sounds totally reasonable, particularly at a time when lockfree data structures weren't as popular. Reads and writes done while holding a mutex work just like any other processor: the memory fence involved in acquiring a mutex guarantees that any reads made while holding the mutex will see all writes made before the previous release operation on that mutex.

The difficulty with the Alpha memory model comes mostly in writing lockfree data structures, where the safety comes from modifying some data structure while no other threads can see it, then atomically changing a pointer (usually a load-locked/store-conditional or compare-and-swap) to make that pointer available to other threads. On most architectures, you only need a memory fence right before (or a part of) the write operation that changes the pointer. On most architectures, this happens-before guarantee will be seen by all normal read operations. On Alpha, you also need a memory fence between reading that pointer and chasing it. In C, the loading the pointer and chasing it happen in a single expression *p, so you need to load the global pointer into a local variable, execute a memory fence, and then deference the pointer (or else use inline assembly, C11 atomics came out way way after the DEC Alpha AXP).

Lockfree data structures become more popular well after the Alpha was released, so it's understandable why the Alpha designers were willing to make these optimizations that don't affect code that correctly uses mutexes to protect global mutable state.