The paper was co-authored by an NVIDIA chap, although the prototyping was x86-ba...

CyberDildonics · on Sept 5, 2017

Again though, it is an answer of how to speed up a graph of lots of extremely tiny operations. If every node on the graph was an operation on lots more data, you wouldn't need to solve a problem of under utilization in the first place.

willvarfar · on Sept 6, 2017

Now I know how Baggage felt about people wanting to slice pineapples ;)

Coarse-grained SMT (what Intel call 'hyperthreading') is when a core swaps to a second task when one task stalls on a main memory access.

Intel chips typically have just two tasks they can pingpong between.

The Cray XMT takes this much further with a very long list of tasks it can swap between, with hardware caching, ready-monitoring and scheduling of them. This is excellent for graph problems with lots of main memory stalls. There are plenty of graph problems where there are many tasks that can make progress but they are all waiting on main memory all the time.

The research I linked to takes it further and allows the tasks to be executed speculatively. Basically a tree of hardware transactional memory.

Everything you say about speculation being cycles you can't use for non-speculable computation is true. But there are problems with bottlenecks that force them to be essentially serial.

When cores are cheap, the problem doesn't fit in on-chip-cache and the buffers needed for SMT or SMT+speculation are affordable, speculation is a good way to utilize cores that would otherwise sit idle if that fits your power budget.

CyberDildonics · on Sept 7, 2017

I'm not sure what part of all this explains why it is a good use of computing power. If you don't skip around in memory as much, you don't have to wait as much on memory latency. If you process data in bigger chunks, you don't have to skip around as much. On chip cache doesn't matter as much here since that can be worked around by making sure data will be prefetched.

Also OOO cores already pre-load things into memory so they can be actually executed, not speculativly executed. Speculative execution on modern CPUs will execute both sides of a branch to prefetch data further.

What I'm saying is, if you traverse a graph with tiny amounts of data at each node, it is wildly inefficient. Mitigating that inefficiency with more hardware might be faster, but is still an inefficient use of hardware since it doesn't confront the real problem of skipping around in memory.