Ugh. I hope the upcoming discrete GPUs from Intel [1] and ImgTec [2] are competitive, because Nvidia desperately needs more competition. The Nvidia Linux driver situation is the single worst part of using Linux for me. I've lost count of the number of times Nvidia has rendered my machine unbootable for one reason or another.
There's always AMD/ATI with its completely open drivers and hardware.
AMD even redesigned the silicon to isolate HDCP from rest of the core to be able to open video encoders and other HDCP facing hardware to open drivers.
Also, AMD's open driver's written by a different team in AMD. So they're not exactly 3rd party drivers.
Open source is great, but AMD is not a good example of how to do open source right. AMD's drivers are often not usable in released kernels until many months after hardware is released. For example, AMD is just now adding initial support for their new GPUs [1]. From what I have read, good support for AMD's previous generation is only now getting to distributions.
Intel on the other hand typically has their drivers in released kernels long before their hardware is released. Intel's New GPUs haven't been released yet, but they first started adding support in early 2019 [2]. I can't find it now, but I remember hearing a story about Intel removing drivers from the kernel tree for some hardware that they cancelled.
If you want to use a modern high-end GPU in Linux, your only real choice continues to be Nvidia. Their drivers are not open source, but at least they work at launch (except maybe with a 5.9 kernel?).
This is why Intel entering into the high-end GPU space is so exciting. Soon we may have a high-end option with good open source drivers while the part is still high-end.
This comment is very weird. For Linux users it is definitely better buying an AMD discrete graphics card than Nvidia, and it has been for an eternity. Just google "torvalds Nvidia" as an example of how the community feels about Nvidia. Is it even possible using Nvidia cards with Wayland?
Nvidia has a history of problems in Linux. I can still remember having to log out of my X session just to change monitor layout; which is something you do alot with a laptop and external displays. Took them several years to fix!
I've experienced intel boards where I had trouble getting anything but the newest fedora with custom kernel parameter to even get the installer running. But most of the time intel has very good upstream support, so it is normally a pretty safe bet.
With AMD there are sometimes delays getting full support for a new generation upstreamed (e.g. hdmi audio and such), so it is usually good to Google for Linux reviews first.
In general AMD has superb open source support for their cards, if not immediately at launch, and I recommend them for Linux users who want the least amount of hassle.
I've seen people recommending Nvidia to Linux on YouTube (and now here), and I honestly think it's a disfavor to anyone interested in Linux.
Your comment is highly inaccurate- and in fact is basically straight up Nvidia propoganda. The article you linked to is an article about AMD submitting code for unreleased hardware, which is exactly what you said they are not doing.
I have been using a Polaris based card for years (card released in 2017), and the drivers were submitted in 2016. The kernel drivers were ready at launch and key features like HDMI audio were all working out of the box. Now, you might need to install a modern kernel for everything to work- but that is the Linux way.
It is true that the initial performance was not super great- maybe about 70% of the possible performance, however this was addressed by subsequent improvements to mesa (userspace portion)
AMD has even gone out of their way to support older, obsolete graphics cards that were released before AMD adopted the open source first strategy.
The difference is the timeline. Basic support landing ~1 month before launch is not acceptable. We barely have a stable kernel with this initial support, and that kernel won't have been picked up by any distributions. Further, for previous AMD hardware releases, this initial support has not been really usable from what I have read. Most of the posts I found on Reddit pointed to it taking several months after launch for the Navi drivers to stabilize. Even then, running a custom kernel in my experience is a lot more trouble than the proprietary Nvidia drivers.
Compare to Intel where you have initial basic support landing around 2 years before launch with incremental improvements following. This is the correct way to do open source drivers.
Unfortunately AMD totally missed the boat on ML. They need to dramatically increase their investment both in hardware and software. My impression in the past has also been that they lag in performance on the high end, and their OpenGL driver quality is not as good in terms of features and conformance either. Though I am not totally up to date on their latest stuff.
I think by this measure so did everyone else. Nvidia pushed cuda, and everyone took it, even though 8 or so years ago openCL was a reasonable competitor. But various orgs choose to make a proprietary technology a first class citizen rather than putting the effort into an open alternative. Big win for nvidia, big loss for everyone else not large enough to create their own ML frameworks and accelerators.
Its not really about CUDA or OpenCL. Its about the choice to support FP16 matrix-multiplication at the assembly level of NVidia chips.
When you build a processor that literally has "do 4x4 matrix multiplication" as an assembly statement, you'll get far superior ML performance compared to everyone else. (Except Google's TPUs, which have larger matrix-multiplication arrays)
I don't think any company, aside from NVidia, is working on FP16 Tensor cores. Not Intel, not AMD. Google has a systolic processor for tensor units, but that's cloud only.
I know ML is where the current money is at, but coprocessors in general seem to be applicable to far more problems than just machine learning and graphics.
True, a systolic processor can be built out of these components, but I doubt it'd be anything as fast as a proper matrix-multiplication unit that NVidia is putting out. And that's after the complications associated with FPGA programming.
----------
Xilinx's FPGAs are a combination of LUTs (4-bit or 6-bit look-up tables), Routing, and "DSP Slices" (prefabricated multipliers). The bulk of your math is still going to be performed in the DSP Slice.
Where FPGAs win is that their LUTs and Routing tables allow you to create custom high-speed glue-logic between these DSP slices. But when it comes to something like a regular Convolutional neural net, the routing is very straightforward.
------
I'm not really a deep learning expert. I know that FP16 seems to be pushed by NVidia. This INT8 stuff seems odd to me (there's no way INT8 gets the same dynamic range as FP16), but I'm also not really sure if the dynamic range is needed or useful in deep learning applications.
I've done some systolic processors on Xilinx hardware. It's really dang hard, on their biggest, fanciest chips, to pump the DSPs at anything close to their advertised clock speeds. Maybe if I was smarter, I could do an asynchronous design that would somehow avoid clocking issues, but I suspect that doing async between mismatched clock regions would effectively halve the clock (and that might be optimistic).
The only consolation for me is that I/O over PCI ends up being my ultimate bottleneck. If AMD did a full-on integration with their infinity fabric, it could actually be a game changer. For me, in my weirdo niche.
I'm pretty sure I/O over PCIe is a substantial bottleneck in many applications (which is part of the reason for the 300GB/s POWER9 / OpenCAPI / NVLink / GPUs). Its a shame that NVLink is isolated to pretty much the DGX-line of computers and POWER9 (The "big" ones too, I don't think Talos II supports NVLink). But I guess an interconnect that fast will require dedicated, non-standard hardware.
----------
AMD's GPU Infinity Fabric seems to be a different infinity fabric from their CPU. CPU Infinity Fabric is ~50GB/s, but the GPU Infinity Fabric is GPU-to-GPU links of ~90GB/s.
Realistically, PCIe 4.0 and 5.0 will be the path forward for standard PCs. There will always be incredible application-specific solutions (such as NVLink / NVSwitch on the DGX), but the general computer user will wait for the open standards to support such speeds before adopting them.
At the moment, GPU-to-GPU communications probably will scale better than CPU-to-GPU comms. CPUs just aren't built for the bandwidth (outside of POWER9 / POWER10).
On my ThinkPad I've just disabled the Nvidia and purchased a usb hdmi dongle (pretty low profile, little more than a standard hdmi cable). I leave the Nvidia off permanently because it's the only way I can have a laptop that supports multiple monitors and doesn't fucking chew through battery. I couldn't give a toss about graphics performance, this is a work machine so I'm not playing games on it. Fuck Nvidia, and fuck Lenovo for making the Nvidia necessary in order for the hdmi port to work.
You can buy thinkpads that just have integrated graphics. The dedicated card is a desirable feature for a lot of people - Lenovo didn't go out of their way to screw you, you just bought a product that has a feature you don't need. I'm not sure why you're so angry...
I didn't buy it. It's a work laptop that I didn't have a choice in. But to that comment, many Linux users assume that ThinkPads are a solid mobile work horse. I certainly would have said so. But I would happily switch to a MacBook Pro from 2016. I would say that having to restart X any time you want to switch between battery saving on-the-go, and multiple monitors when you're at your desk, is a hard fail.
I think it's fair to assume that most HN readers have heard of Nvidia's single existing competitor in the discrete GPU market. But many may not have heard that Intel and ImgTec are entering the market in the next year or two.
You're right that PowerVR drivers are likely to suck. However, you're wrong about Intel being at a process node disadvantage, because their GPUs are going to be fabricated by TSMC :-O
>because their GPUs are going to be fabricated by TSMC
Only one of their specific SKUs out of dozen are fabbed by TSMC, and it was a very late addition to the lineup. As long as Intel operate their own leading edge Fab, Intel's incentive is and always will be to fully utilise their own fab capacity as much as possible in order to gain economy of scale.
Interesting to see new players. I hope they can compete on the high end, Nvidia seems a bit lost with it's direction and power budgets right now. I would imagine Apple has the know-how to scale their mobile GPU chips in some ways too but they just want efficient chips for their own devices.
Nice wording by Nvidia. Linux has an unstable/ever changing internal API and Nvidia need to update their out of tree driver for the new release, and probably every release.
Nvidia is trying to put the blame on Linux, but Intel and AMD don't have these issues because they open sourced and upstreamed their drivers (for AMD they mostly did), which is the way to work with open source.
Linux has always been fighting an uphill battle, but as it becomes more and more relevant it is becoming increasingly difficult to fight against it. Intel understands this, and AMD understands this.
The wording by Nvidia is telling for how they view Linux I think. In fairness it's probably very costly for them to go open source (one time cost at least).
The specific wording was "Linux Kernel 5.9+ is incompatible with current and previous NVIDIA Linux GPU drivers."
So it's not the driver that is incompatible, it is Linux. And that led to the grandparent comment asking if Linux people are "deliberate to fuck with nvidia".
If nvidia had said "our driver is incompatible with the newest Linux kernel" then we probably wouldn't been having this thread.
Probably not. If there is a hole in the wall, you'd expect that it's gonna be fixed at some point, and not claiming it's a feature of the wall.
nVidia probably saw this coming, but doesn't care that much to proactively work on it.
Most of those running ML applications don't have to update the kernel very promptly, and they only have to wait until mid-November, according to nVidia. This few months of delay couldn't possibly hurt nVidia to any meaningful degree.
it's a little frustrating that things which used to be kosher, like nvidia and nvidia_uvm linking are all of a sudden not because they got caught up in this crossfire.
Caught in the crossfire? NVIDIA has been skirting around the GPL for years with its Linux module, without ever contributing back. They've being hindering the adoption of Wayland for years due their refusal to implement GBM in their driver. I guess they kinda deserve the flak, especially since AMD and Intel have shown the world time and time again that you can have a fully open GPU driver merged with Linux without any negative downsides.
So basically kernel team repeatedly breaking the driver is how they want to bludgeon nVidia into giving them all their code and forcing them into supporting their own APIs?
Seems like petty behaviour to break users hardware just because you disagree with driver licensing.
I wonder if you'd say the same if the team changing the API would be Googles and the teams constantly having to scramble to unbreak their code would be an opensource project.
Half the benefit of having open source drivers is so that the kernel team can update the drivers themselves when they change the API. I don't think there would be a lot of complaints from open source developers if Google were to change one of their APIs while submitting high quality patches to update all the open source projects that use it.
And it was never kosher and NVIDIA knew exactly that what they were doing was really controversial and risky (I was there). Management at NVIDIA needs to have a forcing function to open the the drivers before anything changes. They depend on and profit from Linux (ML!) so they have no choice but to comply.
IMO proprietary modules have never been kosher. They rely on a particular legal interpretation of the GPL that assumes that...
1. "Programs" (as defined by the GPL) can be legally separate works and share the same address space (on a platform where this arrangement is highly unusual)
2. Separate GPL Programs hosted in the same address space can share linked symbols determined to not be "internal APIs" (in a world where the Supreme Court might run roughshod over this and just say all API implementation is copyright-infringing)
3. "Operating system kernel" and "GPU driver for that self-same kernel" can be considered, regardless of address space colocation, to be separate GPL Programs.
This interpretation is highly unusual but holds primarily because Linus Torvalds and every other major kernel contributor endorsed it. Whether or not this constitutes promissory estoppel, implied license, or something else is up to Nvidia legal to decide; but it's highly likely that nobody with standing to challenge what would otherwise be an obvious GPL license violation is actually in a position to do so. Hence, Nvidia has access to a market they shouldn't.
You are not the only one having no issues with 5.9 and nvidia. I read somewhere(Phoronix?) the issue is related to compute and not graphics. Anyone using the card strictly for graphics and not compute should be okay.
Nvidia refuses to work with upstream properly, so that's not surprising. Once they'll stop fooling around and will get behind Nouveau, things might improve. Though I'm fine using AMD for my gaming needs on Linux.
I understand the frustration to some degree. Part of me even agrees with it.
However, what I don't see is that nVidia should have any obligation (ethical, legal or otherwise) of any sort to provide a Linux driver at all. The demands of Linux developers and users alike seem particularly entitled.
Linux is the only mainstream kernel, free or proprietary, which doesn't have a sane and supported driver interface. And while I can see some of the justifications for it, I think after nearly 20 years they are starting to wear a little thin. The way other out of tree subsystems, like ZFS, have been treated is appalling. Deliberately using the GPL as a cudgel to beat other projects into submission is just plain nasty, and to treat other open source projects this way is spiteful and unnecessary.
> Linux is the only mainstream kernel, free or proprietary, which doesn't have a sane and supported driver interface.
I don't believe OpenBSD or FreeBSD has such an interface either, and I'd classify those as mainstream kernels.
> However, what I don't see is that nVidia should have any obligation (ethical, legal or otherwise) of any sort to provide a Linux driver at all.
Well, they don't. But they have a commercial need for Linux support: their recent revenue growth has been in datacenter usage of GPUs, and Linux support is necessary for that computing environment. To that end, IMHO NVidia is the entity that needs to align its development practices to support the existing Linux driver community rather than Linux needing to align its process to accommodate NVidia's wishes.
This isn't to say that Linux has been right in its treatment of other external entities: the treatment of ZFS, as you note, does seem to be driven more by spite than actual technical reasons. But in the case of NVidia, there is definitely a good deal of fault on NVidia's side that needs to be acknowledged.
My understanding is that FreeBSD freezes its kernel ABI for each major release and this enables third-party drivers to be written against it and supported for that major release. There are out-of-tree graphics drivers in the ports tree making use of this today. Both AMD and Nvidia drivers from what I can tell. Other non-graphics drivers as well.
A freeze for a couple of years seems like a reasonable compromise to me. It's not forever, but it's long enough for third-parties to develop, validate and support a product for. The closest Linux gets to this is vendor kernels which are both vendor-specific, and highly-custom and badly dated. This at least gets you something reasonably current and supported to work with.
No obligation, yes. Nvidia are messing things up for Nouveau and don't upstream their own driver. So Linux users don't have any obligation to use Nvidia either ;) I'd say good riddance, AMD and Intel are doing a good job.
Also, imagine Linux didn't enforce GPL. Would Nvidia be more motivated to upstream their driver? Not in the least. They don't care about GPL. What they want is to have a leverage over the market and charge more for industrial usage, by forcing usage of the blob on their terms (which upstreamed driver would prevent by giving ability to use the hardware to anyone on any terms). That's their whole motivation in this as far as I understand.
There is 1. the political aspect of creating an incentive for vendors to inline their drivers and 2. the technical aspect of not having to provide a stable ABI for kernel drivers.
About 2: I have no idea how difficult this is. Microsoft makes it look easy, but on the other hand, Apple is about to remove any 3rd party code out of their kernel.
About 1: I believe that this hurts more than it helps. Expecting nVidia to move their OS-agnostic codebase into the Linux kernel sounds like a massive effort with no real benefit for them. That seems unlikely to happen. With, for example, OpenZFS, this might be possible if Oracle relicenses their last open-source version. Still, I don't see the Android OEMs budging on their stance of releasing binary blobs alongside their SoCs.
Android is currently the largest install base of Linux kernels across the world, and the lack of a stable ABI means that none of these devices can run the latest and greatest Linux version ever during their lifetime. It is easy to shove all the blame over to Qualcomm, Mediathek, etc. but we also have to see it as a failure of Linux as a platform.
Not every participant of a platform will always play nice, and these kinds of issues could all be solved by providing a stable ABI.
> About 1: I believe that this hurts more than it helps. Expecting nVidia to move their OS-agnostic codebase into the Linux kernel sounds like a massive effort with no real benefit for them. That seems unlikely to happen.
Nouveau is already upstream. Nvidia should just stop messing it up and allow it to access GPU registers for proper reclocking. So there is no valid argument about difficulty at all. It's purely political and anti-competitive on Nvidia's part.
Sure, neither nVidia nor Linux have any obligation to support each other.
But given that nVidia makes billions every year selling hardware to companies to run CUDA on Linux I would argue that it's nVidia that's acting entitled rather than the Linux developers.
So basically the plan is to fuck over nVidia until they just give you all their code?
Imagine if that behaviour would be done by Google, Microsoft or others - constantly breaking Mozillas or anyone elses code until they'd change their software license.
I'd say the plan is to mostly ignore them on Linux until they become completely irrelevant or will become a good citizen and upstream their driver. I'm OK with Nvidia fading on Linux into irrelevance with their blob - good riddance. If they don't care about doing things on Linux properly, Linux users and developers don't need to care about Nvidia.
That's been the case for Linux users for quite a while already. Only those who are stuck with CUDA lock-in need to deal with this mess these days.
Nouveau developers expressed all this best:
> Moral of the story... just get an Intel or AMD board and move on with life. NVIDIA has no interest in supporting open-source, and so if you want to support open-source, pick a company that aligns with this.
> I'm OK with Nvidia fading on Linux into irrelevance with their blob - good riddance.
The vast majority of Linux use is on servers. When servers need GPUs it's usually for machine learning or other computation. Nvidia currently holds almost all of that market.
All dropping Nvidia support would accomplish is forcing everyone in that market to switch to Windows. Cloud revenue would drop significantly for Linux companies like RedHat, which would in turn mean they have less spare money to work on desktop distros. So as much as I despise Nvidia, until AMD/RTG get their shit together AND (yes, we need both) Intel's dGPUs come out (and don't suck), we desperately need Nvidia support.
I mentioned above that some are still stuck with CUDA, but for the most part I was talking about desktop Linux users who can for all practical purposes ignore Nvidia today.
But in general I agree, the sickening CUDA lock-in situation has to change. AMD and Intel are working on it so things are improving, though slowly.
I'm on Fedora and have been for 10 years. My Nvidia card was installed in 2012. It's always worked fine with the proprietary Nvidia driver up until I got kernel 5.9. My system has been hell since then. I had to switch back to Nouveau, and now I get crashes in Xwayland and Nouveau several times a day. This is bad enough I'm about to order an AMD card on Amazon and ban Nvidia from my life.
I know Nvidia is not very good for OSS, but when you’re doing a lot of ML work it’s just simply so much better to be able to use Nvidia’s toolkit, it’s just so much more mature and better supported.
It won't bother them for a very long time. Enterprise systems and installations rarely follow latest kernels and packages.
CentOS 8 still uses 5.8 for example and will use it for a long time.
I manage such systems and unless something explicitly needs 5.9, we won't upgrade them. If it's working, it's working. Everything is isolated from outside network so, external threats are not very important. We're also an academic installation so, there's nothing sensitive inside.
Edit: This doesn't mean that we won't security-patch our systems. That's done pretty quickly and regularly. It's different.
[1] https://www.engadget.com/intel-xe-gpu-gamers-130000411.html
[2] Yes, ImgTec/PowerVR is getting back into the discrete GPU game! https://www.imgtec.com/blog/back-in-the-high-performance-gam...