> The SPARC memory model is also more lax than x86, so x86 hides more concurrenc...

KMag · on Jan 9, 2021

> Am I correct that more concurrency bugs would be hidden in a less lax architecture?

No, a more strict (less lax) memory model gives the processor less freedom to re-order memory operations, meaning that missing memory fences (explicit ordering) have potentially less effect vs. more lax memory models.

The SPARC memory model makes fewer guarantees (less strict), allowing the processor more freedom in ordering memory operations, potentially getting more performance. (There's a reason aarch64 went with a more lax memory model than x86, despite being designed decades later.) The downside is that bugs with missing memory fences are more likely to show up.

commandlinefan · on Jan 9, 2021

I’m having trouble following - can you post a code example?

Majromax · on Jan 9, 2021

This has been in the news again lately with Apple Silicon. The ARM architecture has a weaker memory model than x86 in that it does not normally provide "total store ordering". Under that memory model, if thread A executes (from an initially zeroed heap):

    a = 1; b = 1;

then thread B can safely execute:

    if (b == 1) assert(a == 1); 
    if (a == 0) assert(b == 0);

x86 provides this guarantee, but ARM does not -- thread B might see the b=1 write before the a=1 write.

Apple Silicon has a runtime toggle (https://old.reddit.com/r/hardware/comments/i0mido/apple_sili...) to provide for that behaviour, which greatly improves performance of translated x86 code (i.e. the translator does not need to insert precautionary memory barriers).

KMag · on Jan 10, 2021

> if (a == 0) assert(b == 0);

Even on x86, you can't make this assertion without any synchronization primitives (mutexes, etc.). Without synchronization, the a = 1; b = 1; can run between the (a == 0) and assert(b == 0).

Majromax · on Jan 11, 2021

Ah, of course you're right. I added that line as a bit of an afterthought and meant for it to be atomic, but of course it isn't. Unthinking parallelism is pitch black, and you are likely to be eaten by a grue.

qwertycrackers · on Jan 9, 2021

No, a strict concurrency model means things are more likely to be consistently sequenced; that is, you'll encounter less actual concurrency artifacts. If this model is made more lax, you might discover that you had UB which the previous model was not exploiting.

mhh__ · on Jan 9, 2021

You're more likely to luck into the behaviour you want with x86 than a more relaxed memory model. I think the Alpha takes the pip for the latter, but I was only something like 3 months old when the IP was sold to Intel let alone in widespread use, so I could be wrong.

KMag · on Jan 9, 2021

Yes, the DEC Alpha AXP was a beast of a chip family. The Alpha design team made nearly as few guarantees as possible in order to leave nearly as much room for optimization as possible. The Alpha's lax memory model provided the least-common denominator upon which the Java memory model is based. A stronger Java memory model would have forced a JVM on the Alpha to use a lot more atomic operations.

All processors (or at least all processors I'm aware of) will make a core observe its own writes in the order they appear in the machine code. That is, a core by itself will be unable to determine if the processor performs out-of-order memory operations. If the machine code says Core A makes writes A0 and then A1, it will always appear to Core A that A0 happened before A1. As far as I know, all processors also ensure that all processors will agree to a single globally consistent observed ordering of all atomic reads and atomic writes. (I can't imagine what atomic reads and writes would even mean if they didn't at least mean this.)

On top of the basic minimum guarantees, x86 and x86_64 (as well as some SPARC implementations, etc.) have a Total Store Ordering memory model: if Core A makes write A0 followed by A1, and Core B makes write B0 followed by B1, the two cores may disagree about whether A0 or B0 happened first, but whey will always agree that A0 happened before A1 and B0 before B1, even if none of the writes are atomic.

In a more relaxed memory model like the SPARC specification or Aarch64 specification (and I think RISC-V), if the machine code says Core A makes write A0 before A1, Core B might see A1, but not yet see A0, unless A0 was an atomic write. If Core B can see a given atomic write from Core A, it's also guaranteed that Core B can see all writes (atomic and non-atomic) that Core A thinks it made before that atomic write.

With the DEC Alpha, the hardware designers left themselves almost the maximum amount of flexibility that made any semantic sense: if Core B makes an atomic read, then that read (and any reads coming after it in machine code order) is guaranteed to see the latest atomic write from Core A, and all writes that came before that atomic write in machine code order. On the Alpha, you can think of it as all of the cores having unordered output buffers and unordered input buffers, where atomic writes flush the core's output buffer and atomic reads flush the input buffer. All other guarantees are off. (Note that even under this very lax memory model, as long as a mutex acquisition involves an atomic read and an atomic write, and a mutex release involves an atomic write, you'll still get correct behavior if you protect all reads and writes of shared mutable state with mutexes. A reader's atomic read in mutex acquisition guarantees that all reads while holding the mutex will see all writes made before another thread released the mutex.) This might be slightly wrong, but it's roughly what I remember of the Alpha memory model.

The thing that confused some programmers with the Alpha is that with most memory models, if one thread makes a ton of atomic writes, and another thread makes a ton of non-atomic reads, the reading thread will still never see the writes in a different order than what the writer thought it wrote. There's no such guarantee on Alpha.

On a side note, the Alpha team was also pretty brutal about only allowing instructions that were easy for compilers to generate and showed a performance improvement in simulations on some meaningful benchmark. The first generation of the Alphas didn't even have single-byte loads or stores and relied on compilers to perform single-byte operations by bit manipulation on 32-bit and 64-bit loads and stores.

Many of the Alpha design people went on to the AMD K6 III (first AMD chips to give Intel a run for their money in the desktop gaming market), the PASemi PWRFicient (acqui-hired by Apple to start their A-series / Apple Silicon team), AMD Ryzen, etc.)

When I bought my first computer in the fall of 1997, the fastest Intel desktop processors were 300 MHz PIIs. DEC Alphas at the time were running at 500 MHz, and had more instructions per clock, particularly in the floating point unit. The Cray T3E supercomputer used DEC Alphas for good reason.

masklinn · on Jan 9, 2021

> On top of the basic minimum guarantees, x86 and x86_64 (as well as some SPARC implementations, etc.) have a Total Store Ordering memory model: if Core A makes write A0 followed by A1, and Core B makes write B0 followed by B1, the two cores may disagree about whether A0 or B0 happened first, but whey will always agree that A0 happened before A1 and B0 before B1, even if none of the writes are atomic. In a more relaxed memory model like the SPARC specification

AFAIK SPARC has always used TSO by default, and while v8 and v9 introduced relaxed memory modes (opt-in), these have been dropped from recent models e.g. M7 is back to essentially TSO-only. While it is backwards-compatible and supports the various instructions and fields, it ignores them and always runs in TSO.

jabl · on Jan 9, 2021

IIRC SunOS/Solaris always used TSO, but Linux originally used RMO, however they switched to TSO once chips that only supported TSO appeared on the market.