Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

Having slower memory may not actually lead to lower memory bandwidth. The cuda cores can be broken up into compute complexes which larger blocks of memory directly attached to the cores. These could be filled with read operations from the bulk system memory. You can start executing and then page the next batch of data in while compute is working. For LLMs you don't have much random memory access, you can sequence your accesses in blocks.

If these chips become popular I am sure you will see LLM architectures taking advantage of the parallelism.

 help



> The cuda cores can be broken up into compute complexes which larger blocks of memory directly attached to the cores.

Perhaps in theory, but for the gb10 stuff the memory is all on the CPU die and connected to the GPU die via nvlink-c2c




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: