Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

> The reality is even cutting edge games and consumer workloads don’t actually take full use of the PCIe bandwidth of the GPU or the bandwidth of its GDDR memory

Game dev here. For anyone reading this - it’s not because we’re lazy, it’s because _it’s really hard to do_.

One of the biggest differences between the current generation consoles and the current gen PCs is unified memory.

 help



I live with a game dev myself, so I get it. Hell, it's hard even for PC developers who want to do things without leaning on abstraction layers or existing engines. Managing multiple discrete memory pools, asset swaps or calls between them, getting the respective subsystems to exchange data at just the right time so as not to impact other code and drag down performance - it's fucking hard in general.

A unified pool of memory suddenly makes that simultaneously easier, but also far more flexible, which frees up developer time and bandwidth to focus on other, more important tasks.


How much of that difficulty comes from the chosen game engine? I assume the engine is the primary factor in how resources are allocated.

Both lots and none at the same time. The engines definitely make decisions for you but with unreal (for example) you can modify the RDG any way you see fit.

The problem is that when you need something in gpu you have to go through RAM first (unless you have DMA which is a more recent addition). That doesn’t just add latency it also adds an extra step of cache invalidation, so you have to plan for that from the highest level of gameplay. If you need to prepare for a GPU memory miss _and_ a CPU memory miss as a worst case all the time, it’s very hard to make good use of the bandwidth in the best case


One related question that you need to follow that with is the associated costs of switching the whole studio to another engine that's technically better, or if proposing teach studio tailor-make their own engine the costs of that engineering, if presumably they have or learn the expertise to surpass whatever they're using currently.

I'm not a game developer, but it would also seem to be a link between resource usage by the engine, and whatever content the production side are making. For all the commentary about how brilliant the id software engines are, if you examine the levels you pass through they're also very efficient with what they demand out of the engine - it's like an orchestra playing well together, not one instrument that means you can do anything.


I think much of the difficulty is just that, for example, the 1.8 TB/s of an RTX 5090 is a lot of bandwidth for a game to use. That's over 50,000 4k textures per second at 32bpp.

I agree with you in theory. A couple of points - that’s currently the most experience and high performing card on the market. Most people on steam are using an RTX 3060 which has more like 360GB/s. That’s a factor of 6. How do you design resource usage that scales with that amount of extremity? (We try to, fwiw).

That spec is also a throughput measured per second whereas our frame rates are much higher than 1/s. At 60hz, that’s now between 140 and 800 textures a frame. If you miss _one_ you don’t get that back.

A single main character in a game can be 2-5 regular textures, plus all of the extra mapping textures we have these days. Now do landscapes, environments, props, background videos, and it all adds up. 4k textures are pretty universally used. If you look at a tiny object up close we need a higher res texture to be able to show it neatly.

You also have memory pressure - raytracing makes heavy use of VRAM so you have to make the tradeoff of how much do you want to allocate to caching lighting, vs how much you want to keep textures and geo around.

Lastly, as you say, actually keeping up with 360GB/s from the CPU side is tough. If you require any transformation or CPU operations that’s just not going to happen. If you need to pull from disk, even on an NVMe drive reading synchronously, the max throughput is < 10% of that, and that assumes you are actually reading 360GB from disk. If you pause to do anything else, you’ll significantly slow that down. Players also generally don’t like it if we thrash their NVMe disks :)


All good points.

Absolutely an RTX 3060 is a more normal gamer GPU than the 5090, but you're also not playing in 4k without DLSS on a 3060. Drop to the most common resolution on Steam (1080p), and turn on DLSS and you've basically cancelled out that 6x factor in bandwidth. Even if the 3060 had more bandwidth, it doesn't have enough processing power for native 4k gaming in typical games. So 360 GB/s is still a lot of bandwidth for the resolution most 3060 gamers are using.


Playing at 1080p doesn’t reduce your texture size, for the most part. You still use those 4k textures because you’re only seeing a subset of the texture projected at a close distance. We’re still using 4k textures for terrain brushes to cover the 6km open worlds.

DLSS isn’t just a magic on switch for free perfect up scaling. If you rendered at 720p and DLSS’ed up to 1080 it’s still going to look pretty rubbish.Its always surprising to me just how many people have 1080 monitors though given we’ve had more than that for two generations of consoles.

And lastly - all the same points still apply about frame rate (which can be more than 60) and memory bandwidth per frame and cache invalidation etc at 360GB/S, as they do at 1.8TB/s


> Playing at 1080p doesn’t reduce your texture size, for the most part. You still use those 4k textures because you’re only seeing a subset of the texture projected at a close distance.

That greatly reduces your GPU memory bandwidth though. Sampling a subset of the texture only transfers that subset. Reading from higher mip levels uses less bandwidth. If your textures are high enough resolution to appear sharp at both resolutions (at least one texel per pixel), you need 4x more bandwidth to sample your material textures at 4k screen resolution for the same scene.

More importantly, material texture sampling is not most of your bandwidth to begin with. At 4k, most of your bandwidth is going to your full screen render passes. Especially with deferred rendering.

> DLSS isn’t just a magic on switch for free perfect up scaling. If you rendered at 720p and DLSS’ed up to 1080 it’s still going to look pretty rubbish.

I don't find this true at all. DLSS 4 Balanced looks excellent and renders at less than 720p for 1080p output.


That sounds like a lot, but: modern renderers do between 20 to 40 passes, many of them in screen space. And each screen space pass typically reads from at least two input images, sometimes 3 or 4 even with optimally packed inputs. At 60fps that can quickly get up to way over 2000 full screen buffer reads per second and more for less than optimal access patterns in some algorithms. That also doesn't account for texture access during shading passes, which are somewhat random memory accesses.

Very true, but I'll point out that even those 2000 full screen reads per second at 4k are only 4% of the 5090's bandwidth. Sacrificing some of that speed for a unified memory architecture seems like a good trade.

Plus, DLSS can greatly reduce the bandwidth requirements for 4K gaming.


I'm being very, very conservative with my estimates here. Based on the renderers I know, I could have easily tweaked the numbers to go up to 8000 full screen texture reads per second. That doesn't include texture or geometry or BVH reads or any memory writes. That is all in addition to those operations.

But do you think you'll reach 1.8 TB/s?

Quite likely, but the transfer throughput is required in bursts, not necessarily continously.

Let me put it this way: what I care about is how quickly data arrives after a bunch of shader threads request it. Throughput is one way for hardware to reduce that time. The other way is to hide the latency (GPUs do a lot to keep themselves busy while waiting for memory), but those strategies can only do so much.

Lower memory throughput almost always leads to a longer runtime of GPU calls in practice, and thus lower update rates.


Empirically, these benchmarks are showing it doesn't make much difference once you reach this level of bandwidth: https://www.tomshardware.com/pc-components/gpus/early-rtx-50...

What? It's incredibly easy to take full use of memory bandwidth. For example, put proper volumetric smoke/fire/explosion sim in your game. But game developers don't do that because they are lazy.

No, we don’t do it because the tradeoff isn’t worth it. A gpu based particle sim is very difficult to do well - it’s easy (but computationally expensive) to do a volumetric sim, but when you want that simulation to interact with world geometry correctly it comes with an explosion in complexity and performance.

I promise you want our games to look as good as you want them to look.


How does interaction with world geometry come with an explosion in complexity and performance? Advection has almost same cost regardless of if some cells are solid or not. It's one extra line in your shader + 1 bit per cell. JFA to build solid mask.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: