> run-list of tasks that it can quickly switch between each time one of them stalls on IO
Sounds like a less granular version of hyper threading.
> versioning the memory so you can speculatively execute tasks
The more you speculatively execute the less you actually execute. With more threads you don't have to try to squeeze more granular parallelism at the cost of diminishing returns.
Are you asking me if the paper you linked to explains it?
You are saying "it's useful", I'm saying it quickly becomes too wasteful. CPUs already do speculative execution with the hopes of mitigating memory latency.
I'm saying if you spend all your transistors and power doing speculative execution, it is better to find an architecture that lets you use concurrency on a much courser level so that you don't have to waste so much doing speculative execution, only to get diminishing returns for the transistors and power you put in to it.
The paper was co-authored by an NVIDIA chap, although the prototyping was x86-based.
There are lots of hardware that has embarrassingly many cores which are underutilized on embarrassingly serial graph problems. Speculation is one way to get all those cores humming if you have a good heuristic for what is likely to be true and the hardware has versioned memory and hardware scheduling so as to make the speculation cheap enough.
Again though, it is an answer of how to speed up a graph of lots of extremely tiny operations. If every node on the graph was an operation on lots more data, you wouldn't need to solve a problem of under utilization in the first place.
Now I know how Baggage felt about people wanting to slice pineapples ;)
Coarse-grained SMT (what Intel call 'hyperthreading') is when a core swaps to a second task when one task stalls on a main memory access.
Intel chips typically have just two tasks they can pingpong between.
The Cray XMT takes this much further with a very long list of tasks it can swap between, with hardware caching, ready-monitoring and scheduling of them. This is excellent for graph problems with lots of main memory stalls. There are plenty of graph problems where there are many tasks that can make progress but they are all waiting on main memory all the time.
The research I linked to takes it further and allows the tasks to be executed speculatively. Basically a tree of hardware transactional memory.
Everything you say about speculation being cycles you can't use for non-speculable computation is true. But there are problems with bottlenecks that force them to be essentially serial.
When cores are cheap, the problem doesn't fit in on-chip-cache and the buffers needed for SMT or SMT+speculation are affordable, speculation is a good way to utilize cores that would otherwise sit idle if that fits your power budget.
I'm not sure what part of all this explains why it is a good use of computing power. If you don't skip around in memory as much, you don't have to wait as much on memory latency. If you process data in bigger chunks, you don't have to skip around as much. On chip cache doesn't matter as much here since that can be worked around by making sure data will be prefetched.
Also OOO cores already pre-load things into memory so they can be actually executed, not speculativly executed. Speculative execution on modern CPUs will execute both sides of a branch to prefetch data further.
What I'm saying is, if you traverse a graph with tiny amounts of data at each node, it is wildly inefficient. Mitigating that inefficiency with more hardware might be faster, but is still an inefficient use of hardware since it doesn't confront the real problem of skipping around in memory.
GPUs are massively multicore and there we see the opposite of what this article predicts or hopes would happen: grids win even more over graphs. The best graph algorithms for those chips try to turn the graph problem into a grid problem as soon as possible.
What on earth is this article talking about? Sparsity and control flow are horrendous and basically the two things a gpu wants nothing to do with. And graph algorithms generally run on CPUs in supercomputers, not GPUs, for precisely those reasons.
The only way any of this makes sense is if the article is talking specifically about graph processors (e.g. the Cray Eldorado, which to be fair, is mentioned), but there are very few commercial graph processors and even those are shrinking in importance. The number one parallel processor is the gpu, the only one with any chance of making it into the mainstream, and even that is far from mainstream. And the gpu hates control flow and sparsity.
if you have machines that do grids well (vectors, simd), and you have a problem that you can get into that form, then clearly you do. its actually kind of stunning how things you wouldn't think could be phrased that way can.
i'm extrapolating from the talk, but i think the point is that architectures that can tolerate dynamic flow control (mimd) are more performance general than ones that cant. and while no, they aren't mainstream, I think the historical jury is still out.
since the cost of keeping an instruction pointer per unit is (largely) absorbed by hardware, its pretty easy to argue that they should win in the limit. unfortunately there are other costs, the memory interface is vastly different. caching becomes interesting and potentially prohibitive.
but is mimd dead forever because gpus can be used for some things and are clearly the cheapest ops you can buy today?
No, MIMD is not dead, for example, see the Rex Neo or the Adapteva Epiphany, two of my favorite processors. However, when you say that something works well on "massively parallel processors", you have an obligation to specify that you mean MIMD processors and/or graph processors, and SPECIFICALLY NOT A GPU.
Because the programming principles espoused in this article are antithetical to what you should do on a GPU, which is definitely the de facto "massively parallel processor" that would first come to people's mind.
ok, so we're just arguing about terminology and expectation. maybe that reenforces Feo's point.
it is worth noting that the El Dorado started out as a general purpose computer, then a general purpose scientific computer, and then later as a "graph processor" in an XT frame because the MTA on its own was market-dead. and that Feo has been a big supporter since at least stage 2.
maybe is exactly this 'parallelism means GPU' which he's trying to argue against. I agree the overall arch of that point didn't really follow through in his talk, or someone's summary of that talk.
'mpp' was certainly a label adopted by TMC's cms 1-2, which were pretty damn simd. its was also used by the cray T3e, which was markedly not. the mta had 128 threads per cpu. a large enough instantiation could have had a few thousand hardware threads. later unbuilt designs backed the thread state in dram and could have gotten pretty damn fat. arguably 'massive' should be squishily about concurrency width.
Here's an approach for the massively multi-threaded CPUs that may be complimentary: versioning the memory so you can speculatively execute tasks: http://people.csail.mit.edu/sanchez/papers/2017.fractal.isca...
Putting it together sounds like a dream.