Can someone ELI5 why AMD is not in this game? Is it really so much harder to imp...

captainbland · on June 13, 2023

CUDA is the best supported solution, tends to get you access to the best performance, has a great profiler (it will literally tell you things like "your memory accesses don't seem to be coalesced properly" or "your kernel is ALU limited" as well as a bunch of useful stats), even works in windows, all of that.

OpenCL is (was?) the main open alternative to CUDA and was mainly backed by AMD and Apple. Apple got bored of it when they decided Metal was the future. AMD got bored of it when they developed rocm and HIP (basically a partially complete compatibility layer with CUDA).

There's also stuff like DirectML which only works on windows and e.g. various (Vulkan, directx etc.) compute shaders which are really more oriented at games.

There's also a bit of a performance aspect to it. Obviously GPGPU stuff is massively performance sensitive and CUDA gets you the best performance on the most widely supported platform.

AMD are also the main competitor in the space but all but totally dropped support for GPGPU in desktop cards, trying to instead focus on gaming for Radeon and compute in MI/CDNA. They seem to have realised their mistake a bit and are now introducing some support for RDNA2+ cards.

sorenjan · on June 13, 2023

If I remember correctly it's not just that AMD has poor support for their consumer cards, their Rocm code doesn't compile to a device agnostic intermediate, so you have to recompile for each chip. New and old Cuda compatible cards (like all Geforce cards) can run your already shipped Cuda code, as long as it doesn't use new and unsupported features. So even if AMD had supported more cards, the development and user experience would be much worse anyway where you have to find out if your specific card is supported.

slavik81 · on June 13, 2023

All RDNA2 GPUs and many Vega GPUs could use the same ISA (modulo bugs). Long ago, there was an assumption made that it was safer to treat each minor revision of the GPU hardware as having its own distinct ISA, so that if a hardware bug were found it could be addressed by the compiler without affecting the code generation for other hardware. In practice, this resulted in all but the flagship GPUs being ignored as libraries only ended up getting built for those GPUs in the official binary releases. And in source releases of the libraries, needlessly specific #ifdefs frequently broke compilation for all but the flagship ISAs.

There was an implicit assumption that just building for more ISAs was no big deal. That assumption was wrong, but the good news is that big improvements to compatibility can be made even for existing hardware just by more thoughtful handling of the GFX ISAs.

If you know what you're doing, it's possible to run ROCm on nearly all AMD GPUs. As I've been packaging the ROCm libraries for Debian, I've been enabling support for more hardware. Most GFX9 and GFX10 AMD GPUs should be supported in packages on Debian Experimental in the upcoming days. That said, it will need extensive testing on a wide variety of hardware before it's ready for general use. And we still have lots more libraries to package before all the apps that people care about will run on Debian.

_y8kz · on June 13, 2023

True, it's better just to use OpenSYCL that stores intermediate device agnostic form and complies it as needed to specific card.

I don't understand why isn't SYCL more widely used.

SubjectToChange · on June 15, 2023

SYCL is still very early on in development and I don’t see it really picking up until support is up-streamed to the llvm project, at the very least. That said, I am a firm believer in the single source philosophy. There just isn’t a tractable alternative.

SubjectToChange · on June 15, 2023

“Apple got bored of it when they decided Metal was the future. AMD got bored of it when they developed rocm and HIP”

Apple backed OpenCL because they needed an alternative after their divorce with Nvidia. No one was going to target an AMD alternative when they had such a trivial market share, so it had to be an open standard. Initially this arrangement was highly productive and OpenCL 1.x enjoyed terrific success. Vendors across compute markets piled support behind OpenCL and many even started actively participating in it. However this success is what precipitated the disastrous OpenCL 2.x series. In other words, OpenCL 2.x was far too revolutionary for many and far too conservative for others. What followed was Apple pulling out to pursue Metal, AMD having shoddy drivers, Nvidia all but ignoring it, and mobile chip vendors basically sticking to 1.2 and nothing more. Eventually this deadlock was fixed after OpenCL 3.0 walked back the changes of 2.x, but this was in large part because the backers of 2.x moved to SYCL.

As for AMD, OpenCL was a tremendous boon when it was first introduced. At least initially it gave them a fighting chance against CUDA. But it was never realistic for OpenCL to be a complete CUDA alternative. I mean, any standard that is basically “everything in CUDA and possibly more” is a standard no one could afford or bother to implement. ROCm and HIP are basically AMD using an API people are already familiar with software underneath to play to the strengths of their hardware.

“AMD are also the main competitor in the space but all but totally dropped support for GPGPU in desktop cards, trying to instead focus on gaming for Radeon and compute in MI/CDNA.”

Keep in mind that AMD has been under intense pressure to deliver world class HPC systems, and they managed to do so with ORNL Frontier. I don’t blame them for being selective with ROCm support because most of those product lines were in flight before ROCm started development in earnest. That said, Nvidia is obviously the clear leader for hardware support, as therefore the safest option for desktop users.

znpy · on June 13, 2023

> partially complete compatibility layer

So… partial compatibility layer?

captainbland · on June 13, 2023

Is your point that "partially complete" is a redundant phrasing?

In this case I still prefer my version. I feel that it puts greater emphasis on the fact that it can potentially be complete, given the massive value that could give to the project.

Also "I would have written you a shorter letter but I did not have the time" sentiment springs to mind.

throwaway888abc · on June 13, 2023

Thanks

WithinReason · on June 13, 2023

Short version: AMD's software incompetence. Very few hardware companies have the competence to properly support their HW. You see this problem again and again, HW companies designing HW that's great on paper but can't be used properly because it's not properly supported with SW. Nvidia understands this and has 10 times as many SW engineers than HW engineers. AMD doesn't. Intel might too.

SilverBirch · on June 13, 2023

I think you're totally right with this, just to add - it's often possible to do a lot of things that are neat in hardware but create difficult problems in software. Virtually always when this happens it turns out to be nearly impossible to actually create software that takes advantage of it. So it's massively important to have a closed loop feedback system between the software and hardware so that the hardware guys don't accidentally tie the software up in knots. This is common in companies that consider themselves hardware companies first.

davidgl · on June 13, 2023

Examples being the PS3 cell architecture and HP's Itanium chips

pjc50 · on June 13, 2023

Strongly agree. There's a surprising cultural difference between the two. As a software engineer in a different hardware company, I can see where the fault lines are, and it takes continual management effort to make it work properly.

(I note that if we had an "open" GPU architecture in the same way that we have CPU architectures, things might be a lot better, but the openness of the IBM PC seems to be a historical accident that no company will allow again)

SubjectToChange · on June 15, 2023

Chalking it up to “software incompetence” is a bit simplistic, to say the least. AMD was on the brink of bankruptcy not too long ago and their GPU division was struggling to even trend water. They didn’t have an alternative to CUDA because they couldn’t afford one and no one would use it anyway, OpenCL stagnated because most vendors didn’t want to implement functionality that only the biggest players wanted, and their graphics division had to pivot from optimizing for gaming (where they could sell) to optimizing for compute as well.

Now that AMD has the capital they are playing catch up to Nvidia. But it’s going to take time for their software to improve. Hiring at boat load of programmers all at once isn’t going to solve that.

WithinReason · on June 15, 2023

It's been a while since AMD was on the brink of bankruptcy, they had enough time to do something about compute and yet it's still not usable, see the George Hotz rant. OpenCL stagnated because 2.0 added mandatory features that Nvidia didn't support so it never got adopted by the biggest player.

lhl · on June 13, 2023

llama.cpp can be run with a speedup for AMD GPUs when compiled with `LLAMA_CLBLAST=1` and there is also a HIPified fork [1] being worked on by a community contributor. The other week I was poking on how hard it would be to get an AMD card running w/ acceleration on Linux and was pleasantly surprised, it wasn't too bad: https://mostlyobvious.org/?link=/Reference%2FSoftware%2FGene...

That being said, it's important to note that ROCm is Linux only. Not only that, but ROCm's GPU support has actually been decreasing over the past few years. The current list: https://rocm.docs.amd.com/en/latest/release/gpu_os_support.h... Previously (2022): https://docs.amd.com/bundle/Hardware_and_Software_Reference_...

The ELI5 is that a few years back, AMD split their graphics (RDNA) and compute (CDNA) architectures, which Nvidia does too, but notably (what Nvidia definitely doesn't do, and a key to their success IMO) AMD also decided they would simply not support any CUDA-parity compute features on Windows or their non "compute" cards. In practice, this means that community/open-source developers will never have, tinker, port, or develop on AMD hardware, while on Nvidia you can start with a GTX/RTX card on your laptop, and use the same code up to an H100 or DGX.

llama.cpp is a super-high profile project, has almost 200 contributiors now, but AFAIK, no contributors from AMD. If AMD doesn't have the manpower, IMO they should simply be sending nsa free hardware to top open source project/library developers (and on the software side, their #1 priority should be making sure every single current GPU they sell is at least "enabled" if not "supported" in ROCm, on Linux and Windows).

[1] https://github.com/SlyEcho/llama.cpp/tree/hipblas

Kelteseth · on June 13, 2023

I just tried this [1] and it still uses my CPU even though the prompt says otherwise.

[1] https://github.com/ggerganov/llama.cpp/issues/1433#issuecomm...

lhl · on June 13, 2023

I saw there was an answer already in your issue, although you plan on doing a lot of inferencing on your GPU, I'd highly recommend you consider dual-booting into Linux. It turns out exllama merged ROCm support last week and more than 2X faster than the CLBlast code. A 13b gptq model at full context clocks in at 15t/s on my old Radeon VII. (Rumor has it that ROCm 5.6 may add Windows support, although it remains to be seen what that exactly entails.)

Kelteseth · on June 13, 2023

So it now uses the GPU after some help, but it is not that much faster on my Vega VII than on my 5950x 16 core cpu :/

born-jre · on June 13, 2023

short answer is they have somewhat competent hardware but software sucks or you can watch george hotz rant about how amd driver sucks

https://www.youtube.com/watch?v=Mr0rWJhv9jU

alecco · on June 13, 2023

He got a tarball fix for the driver after his rant got viral. Still not looking good IMHO.

https://geohot.github.io/blog/jekyll/update/2023/06/07/a-div...

xiphias2 · on June 13, 2023

It looks like some great engineers inside the company fighting the burocracy.

jandrese · on June 13, 2023

> So it fixed the main reported issue! Sadly, they asked me not to distribute it, and gave no more details on what the issue is.

I think they missed the thrust of the rant.

HellDunkel · on June 13, 2023

he goes from building a ml rack out of ati graphics cards to coding some python to recommending reading the unabomber manifesto, from marx to saying he owns a rolls royce... lord, please have mercy!

marcyb5st · on June 13, 2023

I think there are several reasons.

Firstly, nVidia has been at it much longer. Just because of this tools on nVidia side feel easier to set up / are more polished (at least that was my feeling when fiddling with ROCm like a year ago).

Second but still related to #1, from the beginning even consumers nVidia cards were able to run CUDA and this made so that hobbyist and prosumers/researchers on a budget bought nVidia cards compounding even further the time/tooling advantage nVidia had. I.e. a huge user base of not only gamers but people that use their cards to do other things than gaming and know that things work on these cards.

These are, IMHO, the main reasons why everyone targets CUDA and explain why frameworks like Tensorflow or Pytorch targeted it as a first class citizen.

rapsey · on June 13, 2023

If AMD was software competent the frameworks would support their drivers just as well. No one wants a monopoly.

marcyb5st · on June 13, 2023

Agreed. Sorry if I gave the impression of being pro-nVidia. I am not.

But the reality is that when Tensorflow and Pytorch came to be there was no alternative. Now you need to jump through hoops to make it work with non CUDA hardware.

Additionally, while drivers play a role, I think the main difference is in the computing libraries (CUDA vs ROCm)

sorenjan · on June 13, 2023

I'm hopeful for SYCL [0] to become the cross platform alternative, but there doesn't seem to be a lot of uptake from projects like this, so maybe my hope is misplaced. It's an official Khronos standard, and Intel seems to like it [1], but neither of those things are enough to change things.

Someone who knows about this space that can comment on the likelihood that SYCL will be a good option eventually? Cross platform and cross vendor compatibility would be really nice, and not supporting the proprietary de facto standard would also be a bonus as long as the alternative works well enough.

[0] https://www.khronos.org/sycl/

[1] https://spec.oneapi.io/versions/latest/elements/sycl/source/...

mschuetz · on June 13, 2023

> It's an official Khronos standard

I think that's the problem. Khronos isn't known for good UX, and being from Khronos is exactly the reason why I'm not even bothering to check it out. I want an alternative to CUDA, but I also want it to be as easy to use as CUDA.

LoganDark · on June 13, 2023

Being from Khronos is also a reason why it might actually be usable in a decade's time, like Vulkan.

(Vulkan is from 2015 and is just recently starting to become usable.)

sicariusnoctis · on June 17, 2023

From a non-expert's standpoint, Vulkan feels quite unusable and complex.

mschuetz · on June 20, 2023

From an expert's standpoint, it's still quite unusable and complex.

bilekas · on June 13, 2023

I'm no expert but if I understand correctly the CUDA cores are the main pull and the API to them.

They're supposed to be more optimized and more stable compared to AMD. That's how it was before anyway, not sure today.

Aardwolf · on June 13, 2023

Isn't the main component for AI matrix multiplication? What makes it so hard to create a good alternative API for matrix multiplication?

dotnet00 · on June 13, 2023

It's a lot more complicated than just writing a matrix multiplication kernel because there are all sorts of operations you need to have on top of matrix multiplication (non linearities, various ways of manipulating the data) and this sort of effort is only really worthwhile if it's well optimized.

On top of that, AMD's compute stack is fairly immature, their OpenCL support is buggy and ROCm compiles device specific code, so it has very limited hardware support and is kind of unrealistic to distribute compiled binaries for. Then, getting to the optimization aspect, NVIDIA has many tools which provide detailed information on the GPU's behavior, making it much easier to identify bottlenecks and optimize. AMD is still working on these.

Finally, NVIDIA went out of its way to support ML applications. They provide a lot of their own tooling to make using them easier. AMD seems to have struggled on the "easier" part.

bilekas · on June 13, 2023

Well I think there are 2 types right ? Tensor cores (which afaik AMD dont have) which are better for matrix ops, and CUDO which are better for general parallel ops.

Maybe someone more clever than me can go into the specifics, I only understand the minimum of the low lvl GPU details.

Nice high lvl document

[0] https://www.acecloudhosting.com/blog/cuda-cores-vs-tensor-co...

marcyb5st · on June 13, 2023

I think API for matrix multiplication is just a part of the issue. CUDA tooling has better ergonomics, it's easier to set up and treated as first class citizen in tools like Tensorflow and Pytorch.

So, while I can't talk about the hardware differences in detail, developer experience is greatly on nVidia side and now AMD has a moat to overcome to catch up.

emmender · on June 13, 2023

there is nccl, gpudirect, nvlink and so on and so forth.. It is not just matmul on gpus.

lofaszvanitt · on June 13, 2023

Planned economy, planned who does what.