User-level threads do not solve the problem of synchronization and data movement...

imtringued · on Nov 3, 2020

It's pretty much common knowledge that if you want linear performance scaling on a multiprocessor then each workload has to be effectively independent from the others. Synchronization and communication is expensive and only takes away from your per core performance but it never makes a given core faster. So yes, sharding is a critical part of this "thread per core" architecture.

gpderetta · on Nov 3, 2020

To be fair, concurrent writes do not scale, but single writers multiple readers work just fine, so you only need to an hard partition of writers and can allow all threads (or the appropriate NUMA subset) to access any data.

gpderetta · on Nov 3, 2020

binding user level threads to hardware cores without preemption is completely equivalent to the thread-per-core + coroutines, except that user level threads allow deep stacks (with all the benefits and issues that it implies).

In fact a well designed generic executor can be completely oblivious to whether it is scheduling plain closures, async functions or fibers. See for example boost.asio.