| 1. | | Creating custom kernels for the AMD MI300 (huggingface.co) |
| 2 points by skidrow 6 months ago | past |
|
| 2. | | Implementing a Fast Tensor Core Matmul on the Ada Architecture (spatters.ca) |
| 4 points by skidrow 6 months ago | past |
|
| 3. | | Matrix Core Programming on AMD GPUs (salykova.github.io) |
| 116 points by skidrow 6 months ago | past | 5 comments |
|
| 4. | | Implementing a Fast Tensor Core Matmul on the Ada Architecture (spatters.ca) |
| 3 points by skidrow 6 months ago | past |
|
| 5. | | Matrix Core Programming on AMD GPUs (salykova.github.io) |
| 2 points by skidrow 6 months ago | past |
|
| 6. | | Creating custom kernels for the AMD MI300 (huggingface.co) |
| 1 point by skidrow 6 months ago | past |
|
| 7. | | Implementing a Fast Tensor Core Matmul on the Ada Architecture (spatters.ca) |
| 2 points by skidrow 6 months ago | past |
|
| 8. | | Matrix Core Programming on AMD CDNA3 and CDNA4 Architecture (salykova.github.io) |
| 24 points by skidrow 6 months ago | past | 3 comments |
|
| 9. | | Creating custom kernels for the AMD MI300 (huggingface.co) |
| 2 points by skidrow 6 months ago | past |
|
| 10. | | Implementing a Fast Tensor Core Matmul on the Ada Architecture (spatters.ca) |
| 2 points by skidrow 6 months ago | past |
|
| 11. | | Advanced Matrix Multiplication Optimization on Multi-Core Processors (2024) (salykova.github.io) |
| 85 points by skidrow 6 months ago | past | 3 comments |
|
| 12. | | Creating custom kernels for the AMD MI300 (huggingface.co) |
| 2 points by skidrow 6 months ago | past |
|
| 13. | | Introduction to Matrix Core Programming on AMD CDNA3 and CDNA4 Architecture (salykova.github.io) |
| 2 points by skidrow 6 months ago | past |
|
| 14. | | Creating custom kernels for the AMD MI300 (huggingface.co) |
| 2 points by skidrow 8 months ago | past |
|
| 15. | | Implementing a Fast Tensor Core Matmul on the Ada Architecture (spatters.ca) |
| 2 points by skidrow 8 months ago | past |
|
| 16. | | Creating custom kernels for the AMD MI300 (huggingface.co) |
| 1 point by skidrow 8 months ago | past |
|
| 17. | | Implementing a Fast Tensor Core Matmul on the Ada Architecture (spatters.ca) |
| 4 points by skidrow 8 months ago | past |
|
| 18. | | Implementing a Fast Tensor Core Matmul on the Ada Architecture (spatters.ca) |
| 2 points by skidrow 8 months ago | past | 1 comment |
|
| 19. | | Compiler Explorer: An Essential Kernel Playground for CUDA Developers (nvidia.com) |
| 2 points by skidrow 8 months ago | past |
|
| 20. | | Creating custom kernels for the AMD MI300 (huggingface.co) |
| 1 point by skidrow 8 months ago | past |
|
| 21. | | DeepSeek-R1 and FP8 Mixed-Precision Training (colfax-intl.com) |
| 2 points by skidrow 11 months ago | past |
|
| 22. | | How to Write a Fast Matrix Multiplication from Scratch with Tensor Cores (2024) (alexarmbr.github.io) |
| 147 points by skidrow 11 months ago | past | 17 comments |
|
| 23. | | DeepSeek-R1 and FP8 Mixed-Precision Training (colfax-intl.com) |
| 2 points by skidrow 11 months ago | past |
|
| 24. | | Implementing a Fast Tensor Core Matmul on the Ada Architecture (spatters.ca) |
| 1 point by skidrow 11 months ago | past |
|
| 25. | | How to Write a Fast Matrix Multiplication from Scratch with Tensor Cores (alexarmbr.github.io) |
| 2 points by skidrow 11 months ago | past |
|
| 26. | | Understanding Peak, Max-Achievable and Delivered FLOPs (amd.com) |
| 1 point by skidrow on April 1, 2025 | past |
|
| 27. | | DeepSeek-R1 and FP8 Mixed-Precision Training (colfax-intl.com) |
| 1 point by skidrow on April 1, 2025 | past |
|
| 28. | | Outperforming cuBLAS on H100: A Worklog (cudaforfun.substack.com) |
| 3 points by skidrow on April 1, 2025 | past |
|
| 29. | | Optimizing Matrix Multiplication on RDNA3 (seb-v.github.io) |
| 118 points by skidrow on March 25, 2025 | past | 26 comments |
|
| 30. | | Outperforming cuBLAS on H100: A Worklog (cudaforfun.substack.com) |
| 1 point by skidrow on March 25, 2025 | past |
|
|
| More |