Fedora 38 LLVM vs. Team Fortress 2

admax88qqq · on April 24, 2023

Unfortunately this is exactly the type of stuff that makes supporting commercial apps on linux a nightmare. Weird crashes due to weird linking of system libraries.

Common distros are very adamant about dynamic linking everything in order to support the use case of "core library has vulnerability, upgrade it in place without rebuilding consuming apps." Along with a desire to avoid "dll hell" and force a single canonical version of every library systemwide. This leads to these sorts of issues.

Windows gets around it by letting applications put the DLLs they care about beside the executable, and having it check there first by default.

mjg59 · on April 25, 2023

This is largely solved with approaches such as Flatpak or Snap, but graphics drivers are still an issue - they're expected to be supplied by the distribution, and components of them end up in-process in the application even if the rest of the application's runtime is shipped with the application. If there's an incompatibility between the application runtime and assumptions made by that driver code (as there appears to be in this case - TF2 ships its own malloc() implementation, but the graphics driver code ends up using it inconsistently and so blows up) then you're going to have problems.

I don't think there's anything about Windows that would fundamentally change things here. Windows apps aren't shipping their own graphics drivers, even if they're bundling everything else.

cesarb · on April 25, 2023

> TF2 ships its own malloc() implementation, but the graphics driver code ends up using it inconsistently and so blows up) then you're going to have problems. I don't think there's anything about Windows that would fundamentally change things here.

Yes, there is: on Windows, due to the way DLL linking works there, the graphics driver wouldn't use the malloc() implementation from TF2. The flat linking namespace in which you can globally replace the memory allocator for every dynamic library does not exist on Windows; if the graphics driver is linked to the memory allocator from the C library, it will get the memory allocator from that C library, not from some other DLL in the same process.

That's not to say Windows is free of dynamic linking problems. While on Linux it's mostly only NSS and the graphics driver (and only when explicitly requested), on Windows it's common for unrelated third party software to inject DLLs and threads all over every process on the system. And it's not uncommon for these injected DLLs to do things like hooking into system DLLs (by overwriting the entry point of exported functions, or even internal functions), leading to hard-to-diagnose crashes when things are not like they expected.

eklitzke · on April 25, 2023

Linux does not have a flat linking namespace (for example, see RTLD_NEXT in the dlsym man page). If it did, this wouldn't be a problem because everything would use the libc malloc or everything would use the tcmalloc implementation. It's just that glibc exports pretty much everything as a weak symbol, most programs and libraries that link against libc bind to whatever ld.so picks as the first in the search object search order, and messing with linker settings is the last thing most developers want to think about.

I just read TFA so I haven't dived into all the details but my wild speculation is that the graphics driver is doing an aligned new and an unaligned delete. The alignment parameters in the new and delete give them different symbol names but you're free to implement them by dispatching to the unaligned version, so this wouldn't necessarily cause a problem (even in C++17 mode) if the malloc implementation doesn't actually handle them differently. However if tcmalloc was built for C++14 then it would only have the unaligned operators, and therefore the aligned new would resolve to the glibc implementation and the unaligned delete would resolve to the tcmalloc implementation, or something like that.

pjmlp · on April 25, 2023

Not only Windows, Aix is also COFF based and has a similar approach to dynamic libraries, including export definitions and import files.

yrro · on April 25, 2023

Isn't this what protected linkage is for?

Dalewyn · on April 25, 2023

>I don't think there's anything about Windows that would fundamentally change things here. Windows apps aren't shipping their own graphics drivers, even if they're bundling everything else.

Over on Windows, GPU drivers are provided and distributed by the manufacturer, Microsoft themselves might also distribute them through Windows Update.

GPU manufacturers also work together with Microsoft and bigger game dev studios (read: studios with sufficient cash/influence) to make sure everything works well together. The drivers are also signed off by Microsoft, both figuratively and literally.

Linux has none of this. Drivers are provided primarily by volunteers (most of whom couldn't care less about proprietary code), packaged and distributed by each distro, and most game devs couldn't care less about issues concerning less than one percent of their customers.

pedrocr · on April 25, 2023

Getting consistent quality out of Linux drivers is usually much easier than Windows ones exactly because they're not supplied by a random hardware maker but are almost all upstream in the kernel. GPU drivers used to be a nightmare for this exact reason and are now finally all becoming properly upstream. Meanwhile the horror stories about Windows GPU drivers are all too common. Having to "clean up" old drivers. Having to stay on older versions because the new ones have a relevant broken feature, etc.

zamnos · on April 25, 2023

Let's be real, the two contenders for GPU performance (which, gamers appreciate) are Nvidia, which is not in the kernel, and AMD, which has an in-kernel driver, plus their own proprietary driver, same as Nvidia. TF2 is old enough that its runnable even on old Intel hardware, but let's not pretend that "almost all [drivers are] upstream in the kernel", so long as Nvidia and AMD both have proprietary, out-of-kernel-tree drivers, which mean that the "windows GPU horror stories" are also Linux GPU driver horror stories. Having to "clean up" old drivers. Having to stay on older versions because the new ones have a relevant broken feature, or the best: staying on an older version because the newer one doesn't support your hardware.

I've spent way too long on Nvidia's legacy unix driver page[1], but only Nvidia cards have Cuda support, or the performance, so I'm stuck there.

[1] https://www.nvidia.com/en-us/drivers/unix/

nemetroid · on April 25, 2023

The proprietary AMD driver (AMDGPU-PRO) is only needed (or even an improvement over the open source version) in a few niche use cases. The majority of users are better off using the open source driver.

> These days our packaged drivers are mostly intended for:

> - customers using slower moving enterprise/LTS distros which do not automatically pick up the latest graphics drivers - we offer them both open source and proprietary/workstation options

> - customers using workstation apps who need the extra performance/certification from a workstation-oriented driver (although Marek has done a lot of great work over the last year to improve Mesa performance on workstation apps)

> The third target audience is customers looking for ready-to-go OpenCL, either for use with the packaged open/closed drivers or with the upstream-based stack in a recent distro.

https://www.phoronix.com/forums/forum/linux-graphics-x-org-d...

pedrocr · on April 25, 2023

GPU drivers are notorious for being out of tree and even there most GPU drivers over the years have been upstream. Even if you limit to just GPUs AMD are not the same as Nvidia. The in-kernel AMD driver is the primary one these days and even Nvidia has finally relented and is finally helping move at least the kernel driver upstream. That Nvidia's closed-source driver has been a mess for all these years is just a further example of how the Linux model is much superior to the Windows one.

gabcoh · on April 24, 2023

Can Linux not trivially do the same thing as windows with LD_PRELOAD? If so why is this more of an issue on Linux than Windows? Is it really less a technical challenge and more just a matter of Linux getting less support from upstream developers?

stabbles · on April 24, 2023

LD_PRELOAD is too global to be useful, it's hard to scope it to one process (and not child processes). macOS is better in the sense that it clears DYLD_* variables when the dynamic linker has done its work and the process starts. (Although that can also be painful when you want to run a shell script and set DYLD_* outside)

nly · on April 24, 2023

You can compile binaries with additional relative library paths in to them that will take priority over /usr/lib64

josephg · on April 24, 2023

How? Maybe this should be better documented & recommended. I suppose at some point you're just statically linking with more steps - though for a problem like this it might be worth it.

xioxox · on April 25, 2023

See the ELF rpath, which can be set by the linker. This can be modified after using patchelf.

yrro · on April 25, 2023

https://sourceware.org/binutils/docs-2.40/ld/Options.html#in...

ywei3410 · on April 25, 2023

The other comments have already covered the how, but I'd like to add that the mechanism used extensively in Nix [1].

[1] https://nixos.org/

rkeene2 · on April 25, 2023

$ORIGIN has been pretty well-known and documented for a very long time

ungamedplayer · on April 25, 2023

You can set it in the environment for a single process.

bravetraveler · on April 24, 2023

I was thinking/wondering this myself. Not to reinvent the wheel - more toss an idea around, but a 'venv for LD_PRELOAD' sounds like it'd deal with this pretty handily

Not... in a way I'd use as a distribution/release maintainer. Probably as an administrator [of my LAN]

gabcoh · on April 24, 2023

Such things already exist. Eg. Appimage or even docker.

lnxg33k1 · on April 24, 2023

and even that has been managed to be split between snap appimage and flatpak :D

(sorry not meant to offend, long time linux day-to-day user here, but it was just ironic for me to point out fragmentation of fragmentation ^^)

bravetraveler · on April 24, 2023

Right, but I don't really want to get into a distribution model - the hack suits me fine :)

More an exercise in curiosity than anything

Flatpak (or Snap, ew) probably deals with it fine today, Steam's there

0x457 · on April 25, 2023

That's Nix with extra steps.

bravetraveler · on April 25, 2023

I specifically said I'm not really trying to solution this, lol. More toying with the LD_PRELOAD aspect than anything

Nix is neat, and I don't think I've used it enough to be too critical - but in some ways it feels like 'extra steps'

I wanted to make a 'reproducible' installation (ala kickstart, not strictly binary)... but it felt very much like distribution work; declaring dependencies and the like

0x457 · on April 25, 2023

Oh, nix is an extra mile. A lot could be improved, but that's what I'm using to deal with dependencies.

bravetraveler · on April 26, 2023

Gotcha, I don't feel so floundering now!

I plan to spend more time with it, I see a lot of merit

The amount of control is great, but the docs could use some work. For my simple goals (install Sway, Ansible, some other things) it was a broadsword when I need a butter knife

0x457 · on April 26, 2023

What sold me on nix is home-manager and flakes: I can easily bootstrap my environment anywhere nix is available.

never_inline · on April 25, 2023

There are tools which overwrite linked libraries, eg: chrpath.

LispSporks22 · on April 24, 2023

It can be done by setting rpath to origin, even post compilation using the patchelf tool. Works great with C shared libraries. Perhaps ABI issues with C++ shared libs introduces other problems.

JonChesterfield · on April 24, 2023

With the warning that rpath!=runpath, both are called rpath, and which you get depends on your linker and whether you also pass -Wl,--disable-new-dtags

Runpath is the default, and also the one that is non-transitive and overridden by environment variables.

admax88qqq · on April 25, 2023

Yes linux _can_, the machinery is there, but culturally the common distros do not. And the defaults do not. On windows I can literally drop a DLL next to an executable and it will pick it up. On linux I have to do a wrapper script to set LD_PRELOAD, or mutate the binary's rpath to get it to load.

It's not really a question of capability, but a question of culture and defaults that makes linux hard to support.

Debian for example goes through great pains (or used to at least) to unbundle shared libraries such as openssl from projects like chromium.

aidenn0 · on April 24, 2023

This sounds like it's an interaction with the GPU driver though, which could also happen on windows...

ho_schi · on April 24, 2023

Aehm. That is what a lot of closed-source applications do on Linux. And Valve does that, too.

The open-source ones are maintained in the packing system and kept lean.

doublepg23 · on April 24, 2023

The funny thing is in on Fedora in 2023 I don't feel like I'm missing out on most software.

nsajko · on April 25, 2023

Completely off base. If you want to distribute your application to users yourself (instead of letting the distro take care of that), then distribute all dependencies together with it.

marcthe12 · on April 25, 2023

There are a few dependencies that can not be easily vendored (At least not recommended). Mesa is probably the biggest example (An this case was caused by a mesa dependency). You can vendor them technically or even static link them but then you might end up with limited hardware support. The only alternative is to setup a mini opengl distro.

q3k · on April 25, 2023

Which libGL.so should I be distributing alongside my application?

sebazzz · on April 25, 2023

> Unfortunately this is exactly the type of stuff that makes supporting commercial apps on linux a nightmare. Weird crashes due to weird linking of system libraries.

That is the true reason containers were born, isn't it? The kernel is perfect, the public interface of the kernel is perfect. Userspace is a mess. Let's fix it by adding a layer between and have a userspace per application.

eikenberry · on April 24, 2023

Isn't this exactly the use case for which flatpaks are designed? Isn't Redhat/Fedora in the process of adopting them as the primary way to support third party/proprietary graphical apps like Steam? Doesn't the current Steam flatpak avoid this issue?

TLDR; isn't this already addressed?

MichaelZuo · on April 25, 2023

It would appear so, I don't understand why this blog post is so popular.

olliej · on April 24, 2023

This is a predictable outcome of overriding the global operator new. It remains annoying that this was ever allowed, and is a constant source of pain for c++ standard library implementations.

DannyBee · on April 24, 2023

It actually should still work, since fedora38 includes the llvm15 versioned libs.

The only way to make this break is if something is loading random unversioned solibs or whatever the latest one it can find is, and expecting this to work forever.

If it actually used a versioned solib, it would get llvm 15 just like it did before.

This is the whole point of versioned solibs.

olliej · on April 25, 2023

Versioning does not solve the problem.

The aligned allocation operators have existed since llvm 8.x.

The problem is not that the aligned allocation APIs are new. The problem is that TCMalloc is only partially replacing the global allocation APIs, it's just taken until this year for that bug to be exposed.

What has happened is presumably some part of the OS has updated its target C++ version so is now using the aligned allocators, which exposes the gap in TCMalloc.

I'm not sure if the spec explicitly allows an aligned allocation to be fed into an unaligned operator delete, but it seems like implementations do, so that's probably why adopting aligned operator new wouldn't be seen as an ABI break.

DannyBee · on May 1, 2023

Thanks. This at least makes more sense than what I read, and explains what might actually have happened :)

phkahler · on April 24, 2023

It seems more like the app and driver are mixing their new/delete pairs. That seems like a bug to me. Maybe even an API design issue if it's supposed to happen.

mmh0000 · on April 24, 2023

I loved the premise of the article, though I really wish the author had gone into detail about how he discovered the root cause.

boutique · on April 25, 2023

Funnily enough, on Half-Life 1 engine-based games (i.e. the engine that came before HL2 - on which Team Fortress 2 runs; such as Counter-Strike 1.6), a different allocator problem exists -- glibc's malloc() just decides to fail miserably[0] on some setups.

[0] https://github.com/ValveSoftware/halflife/issues/3158

kimixa · on April 25, 2023

that's exactly the sort of error you get if something has written just out of bounds on a malloc'd chunk - it clobbers the allocator's internal state, which appears to be what that assert() is checking.

It's probably an allocation before the failing one that is being misued - so the backtrace pointing to openal doesn't necessarily mean it's openal's fault.

Running with valgrind or another heap memory checking tool will probably be helpful to track down that particular linked bug.

EDIT:

It looks like that there's at least one out-of-bounds write when starting up half life (On arch linux, so maybe slightly different library versions and not loaded the counterstrike mod).

It looks like a valve bug - writing 2 bytes at index [30] of a malloc'd size of 31 goes one byte over, and it looks from the backtrace it's all valve's code and not deep in some library that might have been loaded in. Writing 2 bytes to a string is a bit odd, perhaps it's trying to null-terminate but somehow uses a wstring null? Or some attempt at SIMD that isn't correctly bound?

It doesn't seem to crash for me though, it might just be luck that nothing important is put 1 byte over, and it feels a bit unlikely something would be due to allocation and type alignment requirements, but it's perfectly valid for the malloc implementation to keep something important in that byte.

Or perhaps there's some other dynamics that change this - it looks like it's doing stuff with paths, so may change size (of the allocation or even the amount written) based on where the steam app is installed - stuff like your user name length changing that may be the difference between a crash. Or even another issue somewhere else I didn't see, or valgrind didn't catch.

Just goes to show how many games ship for years with "big" bugs :P

For reference:

  ==27467== Invalid write of size 2                                                                                                                                                                                                            
  ==27467==    at 0x406526A: GetSteamContentPath() 
  (pathmatch.cpp:523)
  ==27467==    by 0x4065927: pathmatch(char const*, char\*, 
  bool, char*, unsigned int) [clone .part.1] (pathmatch.cpp:594)
  ==27467==    by 0x4066849: pathmatch (pathmatch.cpp:541)
  ==27467==    by 0x4066849: CWrap (pathmatch.cpp:685)
  ==27467==    by 0x4066849: __wrap___xstat (pathmatch.cpp:907)
  ==27467==    by 0x406294A: stat (stat.h:455)
  ==27467==    by 0x406294A: CFileSystem_Stdio::FS_stat(char const*, stat*) (FileSystem_Stdio.cpp:225)
  ==27467==    by 0x4060819: CBaseFileSystem::AddPackFiles(char const*) (BaseFileSystem.cpp:1325)
  ==27467==    by 0x4060AA4: CBaseFileSystem::AddSearchPathInternal(char const*, char const*, bool) (BaseFileSystem.cpp:254)
  ==27467==    by 0x4060B37: CBaseFileSystem::AddSearchPath(char const*, char const*) (BaseFileSystem.cpp:186)
  ==27467==    by 0x8049003: main (launcher.cpp:413)
  ==27467==  Address 0x45e5f4e is 30 bytes inside a block of size 31 alloc'd
  ==27467==    at 0x4041714: malloc (vg_replace_malloc.c:393)
  ==27467==    by 0x4357C4A: strdup (strdup.c:42)
  ==27467==    by 0x42F1A76: realpath_stk (canonicalize.c:410)
  ==27467==    by 0x42F1A76: realpath@@GLIBC_2.3 (canonicalize.c:432)
  ==27467==    by 0x406525B: GetSteamContentPath() (pathmatch.cpp:520)
  ==27467==    by 0x4065927: pathmatch(char const*, char\*, bool, char*, unsigned int) [clone .part.1] (pathmatch.cpp:594)
  ==27467==    by 0x4066849: pathmatch (pathmatch.cpp:541)
  ==27467==    by 0x4066849: CWrap (pathmatch.cpp:685)
  ==27467==    by 0x4066849: __wrap___xstat (pathmatch.cpp:907)
  ==27467==    by 0x406294A: stat (stat.h:455)
  ==27467==    by 0x406294A: CFileSystem_Stdio::FS_stat(char const*, stat*) (FileSystem_Stdio.cpp:225)
  ==27467==    by 0x4060819: CBaseFileSystem::AddPackFiles(char const*) (BaseFileSystem.cpp:1325)
  ==27467==    by 0x4060AA4: CBaseFileSystem::AddSearchPathInternal(char const*, char const*, bool) (BaseFileSystem.cpp:254)
  ==27467==    by 0x4060B37: CBaseFileSystem::AddSearchPath(char const*, char const*) (BaseFileSystem.cpp:186)
  ==27467==    by 0x8049003: main (launcher.cpp:413)

boutique · on April 26, 2023

Thanks a lot for the guidance/tip, I've learned something new. And you're absolutely right about the cause of the mentioned crash -- I've updated the Github issue with a bit of new info I've gathered.

Regarding the function, here it is: https://github.com/dreamstalker/rehlds/blob/master/rehlds/fi...

Interestingly, strdup gets compiled into:

  89 04 24           mov   [esp+101Ch+name], pszContentPath ; s
  E8 82 DC 00 00     call  strlen
  66 C7 04 03 2F 00  mov   word ptr [pszContentPath+eax], 2Fh ; '/'

Which is basically:

  *(_WORD *)&pPath[strlen(pPath)] = '/';`

and would explain why Valgrind says it goes one byte over.

kimixa · on April 26, 2023

Yeah, looks like the Q_strcat(pszContentPath, "/"); is invalid, as glibc has only allocated exactly enough to fit the path in the buffer returned by realpath().

The compiler seems to completely inline the strcat and write the '/' and null as a single 2-byte word write, the null then being out of bounds of the malloc'd chunk and likely causing the error as it overwrites something important.

Interestingly, the open group spec says that a null argument to realpath is "Implementation defined" [0]

And the linux (glibc) man pages say it allocates a buffer "Up to PATH_MAX" [1]

I guess "strlen(path)" is "Up to PATH_MAX", but the man page seems unclear - you could read that as implying the buffer is always allocated to PATH_MAX size, but that's not what seems to be happening, just effectively calling strdup() [2]. I have no idea how to feed back to the linux man pages, but might be worth clarifying there.

[0] https://pubs.opengroup.org/onlinepubs/009696799/functions/re...

[1] https://linux.die.net/man/3/realpath

[2] https://github.com/bminor/glibc/blob/0b9d2d4a76508fdcbd9f421...

amluto · on April 24, 2023

It should be straightforward to make a little LD_PRELOAD shim to implement the new operator new on top of old overloads and thus restore proper functioning.

It would be a gross kludge, though.

olliej · on April 24, 2023

I'm not sure that's sound. You can't just redirect an aligned new to the unaligned operator new as you may get unaligned result. It _sounds_ like what is happening is

    a = ::operator new(some size, some alignment)
    ...
    ::operator delete(a);

where delete is dropping the align_val_t parameter that would guarantee it hits the same allocator family. There are a variety of ways this can happen, and let's just take it as given that it is.

The problem is that if operator new(size_t, align_val_t) is called then the struct has an alignment annotation. That can lead to codegen that reasonably assumes alignment, even without any source level decisions that depend on alignment. The result of having some equivalent of (either at runtime or link time)

    void * operator new(size_t sz, align_val_t a) {
      if (operator new(size_t) has been overridden) return ::operator new(sz); 
      ...
    }

could be an "aligned" allocation returning an unaligned value, causing crashes later on.

jenadine · on April 24, 2023

That's not sound in general, but it is "probably" going to work for this specific case because the previous build was build with allocator that did not support this alignment, meaning that they did not need extra alignment. This is pretty rare actually. And you had anyway to use a custom allocator already with previous C++ versions to make it work.

olliej · on April 24, 2023

While I do agree with you, and think it's probably worth seeing if detecting the override and falling back to unaligned allocation works, the problem is not that the code in TF is compiling assuming/requiring over aligned data.

The problem is that there is system code that they are calling that is making using of over aligned allocation, so therefore could be generating code dependent on said alignment. The failure mode can very easily be

    someSystemLibrary.so`someFunction:
      alignedThing = ::operator new(size, alignment)
      ...
      i_dunno_dma_memcpy_or_something(a, somewhere else)
      ...
      ::operator delete(a)

With no interaction with TF code at all. Except TF has replaced operator delete so that fails due to the allocator mismatch. If you make ::opeator new(size_t, align_val_t) redirect to ::operator new(size_t) if it detects an override then the aligned operation can fail. The above example is moderately difficult to induce so it's more likely that there's an explicit split with the system is doing one half of new/delete and TF is doing the other, but the important thing is that it implies the system code is built aware of alignment and it depends on the alignment even if TF does not.

viraptor · on April 24, 2023

If you don't mind wasting a bit of time, you could forward size+alignment to the allocator, return the aligned version and keep a record of aligned-to-allocation mapping. (For freeing later)

But as the other comment mentioned - it should be a problem for tf2 in the first place since that's not the behaviour they're after.

olliej · on April 24, 2023

> If you don't mind wasting a bit of time, you could forward size+alignment to the allocator, return the aligned version and keep a record of aligned-to-allocation mapping. (For freeing later)

I'm unsure what you're proposing here - the only methods you know in the replacement allocator are operator new(size_t) and operator delete(void). The two possible failure paths are:

    a = ::operator new(some size)
    ...
    ::operator delete(a, alignment)

and
a = ::operator new(some size, some alignment) ... ::operator delete(a)
In the first case what you could do is say "if I did not allocator this pointer, optimistically forward it to operator delete(void)", in the latter case you can identify that a different operator new(size_t) exists but you have no idea how to make that allocator produce an aligned allocation. What I guess you could do is round the size up to a multiple of the specified alignment, and then just repeatedly allocate in the hope that you will eventually get a correctly aligned value out. But that would not be guaranteed.

viraptor · on April 24, 2023

> and then just repeatedly allocate in the hope that you will eventually get a correctly aligned value out

If you preload something that patches all the new/delete interfaces, you can do this without guesswork.

    new(size, alignment) ->
      res=alloc(size+alignment)
      res_aligned=res+...
      offsets[res_aligned] = res

    new(size) -> alloc(size)

    free(ptr) ->
      free(offsets[ptr] || ptr)
      offsets.del(ptr)

olliej · on April 24, 2023

Haha, you've missed the issue. The question is what does the system do when someone overrides the builtin allocator functions, but does not override all of them.

You are absolutely correct that as a developer you can have your process override the allocator functions, and that is in fact what TF has done. The problem is that they have not overridden all of the allocation functions, and so they're crashing due to mismatching allocators being used. TF2 can "easily" fix this crash by implementing the aligned new, new[], delete, and delete[] operators in their custom allocator, or by simply removing their custom allocator's override of the global new & delete operators and using a common base class to get their faster allocator.

The question we're talking about is "how does the standard library respond to this scenario in a way that maximizes correctness?".

viraptor · on April 24, 2023

I was going for "ignore the issue, let's just re-patch all alloc/free pointers, built-in or external, new or old" which I think would still work. (As long as anticheat doesn't freak out) It wouldn't suffer from inconsistencies, because you'd control all the calls again. Or is there something missing in this approach?

olliej · on April 25, 2023

You can't repatch all the calls. The OS/standard library provides a set of global operator new and delete implementations, and for largely historical reasons they are _required_ to allow processes to override them with their own implementations.

Now when a program does decide that they're going to override the global operator new and delete functions the standard library is required to default to them instead. So generally the standard library exposes them as weak symbols, and the OS and stdlib links to them by symbol name. That way on program launch the program's version of the operator new/delete symbols are what win. So that's how the OS and standard library are able to interact with the program despite it overriding what is ostensibly the system allocator.

So in principle the OS could simply make sure that the user provided operator new, delete, etc are always directed to the system allocator routines. The problem is that when compiling user code there's no obligation to call the user provided new, delete, etc through a symbol, and in general won't. Instead the calls will generally be compiled down to PC relative loads and branches as those are significantly faster. The net result is that while the OS _could_ force the symbols to always resolve to the system functions, things would break due to the user code still using the user specified functions, but those functions then would not be compatible and the result would be sadness. Hence the user defined operator implementations have to win.

The problem is what happens when not all of the operators are overloaded. This historically hasn't been a problem: there's the plain and [] variants, which can be overloaded independently, and the no_throw variants of each which have in practice not been an issue because the way those are implemented by default is essentially

    try { return ::operator new(size); } catch(...) { return nullptr; }

So does just directly the operator new that people override.

The problem that operator new(size_t, align_val_t) is that depending on your compiler flags you will get different versions of ::operator new being called, and because of the alignment requirements the aligned operator new can't just forward to the default new implementation. So introducing it is the first time failure of a program to implement the full suite of operators results in an actual runtime error vs minor inefficiency.

The reality is that the minimally correct solution is for all programs and libraries that overload the global allocation operators to override all of them. The better solution is for these programs and libraries to stop overloading the global allocators.

Many years ago (talking >a decade at this point) when webkit first adopted a non-default allocator it overloaded the global operators. Perhaps unsurprisingly this caused issues, and now webkit (and presumably blink) do the correct thing: there's a standard base class (FastAllocated or something) that defines operator new, delete, and the [] variants, and using that as a base class results in the non-default allocator.

nemetroid · on April 25, 2023

> Haha, you've missed the issue.

That's not very nice. The root comment said nothing about making the system handle this automatically, it just described an idea for a potential fix to be applied to this particular case:

> It should be straightforward to make a little LD_PRELOAD shim to implement the new operator new on top of old overloads and thus restore proper functioning.

olliej · on April 27, 2023

> That's not very nice.

:(

It was not intended as a dismissive or derisive laugh at the author, but a laugh at the absurdity of the issue itself. Think "haha, you'd think that the reason is X, but technology is involved, and so everythong is terrible" vs "haha you're dumb" which sure as heck was not my intended message.

nneonneo · on April 24, 2023

The latter suggestion assumes that there’s enough entropy in the allocation process to make this work. But that’s not guaranteed! Suppose that your allocator doesn’t pad allocations (e.g. because it uses a bitmap), and that it only guarantees 0x10 alignment. If the top of the heap happens to be unaligned with respect to your desired alignment (e.g. address ends in 0x10 when you want 0x20 alignment), you might wind up just repeatedly allocating unaligned blocks off the top of the heap forever.

This is not an easy problem to solve, unfortunately. On MacOS I believe they solve this problem using the two-level namespace: symbol references include the library name, so “operator new(size_t)” from libstdc++ is distinct from “operator new(size_t)” from libtcmalloc.

Symbol versioning also seems like it should solve the problem: have the new interfaces explicitly declared with a newer ABI version (e.g. @@LIBCXX_17) and link only to those new versions from code that expects them. Of course, symbol versioning comes with its own set of nasty drawbacks, but in this case it seems like a solution that might work?

olliej · on April 24, 2023

> The latter suggestion assumes that there’s enough entropy in the allocation process to make this work. But that’s not guaranteed!

Oh absolutely, there's no guarantee it's ever aligned: the allocator could wrap an aligned allocator but include a pointer sized prefix (a la array allocations) so you would be _guaranteed_ to never be more than pointer size aligned :D

As you say versioning and namespacing is super problematic, but I'm not sure they'd even work here.

At it's core the problem is that some code is compiling with the knowledge it has aligned allocations, so can assume alignment, and the some parts are not. There are a bunch of options that ensure that the allocator is consistent, but they devolve to either ignoring the new+delete overrides, or having the aligned allocators detect the override and forward to unaligned allocators while hoping nothing depended on correct alignment.

amluto · on April 24, 2023

See my comment above. tcmalloc implements the C API as well, including aligned_alloc().

olliej · on April 25, 2023

It doesn't matter what C APIs the allocator you're using provides, if it (or you) want to override global new and delete operators, you need to override all of them need to use that. The system implementation can't just assume that the overriding implementation happens to override and/or be compatible with C's implementation.

Libcxx (the example here) uses posix_memalign for its aligned allocation - which tcmalloc could _also_ have overridden but doesn't. Again the problem is only some of the allocation routines being overridden.

Asooka · on April 24, 2023

The C interface for aligned memory allocation is aligned_alloc(). The returned pointers are always freed with free(). So what is probably happening is that aligned new calls aligned_alloc(), and then aligned delete simply calls the regular delete, expecting to end up in free(), which by design should work with both kinds of pointers.

I think the problem here is partly with the implementation of aligned new/delete. Since one is free to override only the old versions, the ones supplied by the standard library should make sure not to fall back to functions that may be partially overriden.

amluto · on April 24, 2023

As pure speculation, one could forward to aligned_alloc and still free with ::delete. I haven’t tested this, nor have I looked at the code.

zokier · on April 24, 2023

LD_PRELOAD would probably run afoul with VAC though?

Polycryptus · on April 24, 2023

Steam on Linux already uses LD_PRELOAD under-the-hood to load the overlay. Valve signs the overlay SO files, so they could be making an exception for Valve-signed-preloads in VAC, but it's also possible that VAC does something else to check for suspicious libraries loaded in.

Karliss · on April 24, 2023

Whole graphics drivers using LLVM in the backend has caused countless issues. The way I look at it one of the main problems is that graphic API libraries shouldn't leak symbols from implementation details like them using LLVM. They should expose only the graphics API and nothing more.

vchuravy · on April 24, 2023

Don't ask me about GNU_UNIQUE...

Due to some wonderful C++ features the dynamic linker is forced to unify symbols across shared libraries, even if those symbols have different versions.

This utterly breaks loading multiple libLLVM's except if you build the copy you care about with -no-gnu-unique (or whatever the flag was called)

I have seen wonderful things like the initializers of an already loaded libLLVM being rerun when a new one is loaded.

planede · on April 25, 2023

The presume wonderful C++ feature is spelled __attribute__((weak)) in GNU C.

IceWreck · on April 24, 2023

Isn't this the reason why people recommend using the flatpak version of Steam ?

ChocolateGod · on April 24, 2023

Yes, especially on Fedora.

This isn't something Fedora is doing wrong, unfortunately some games build against older libraries or are built against Debian/Ubuntu and the Flatpak runtimes generally have better compatibility.

DannyBee · on April 24, 2023

Fedora 38 includes the LLVM15 libs to maintain backwards compatibility.

Why is this automatically using a new, incompatible solib, instead of a versioned solib?

AnssiH · on April 24, 2023

The LLVM dependency is in the HW-specific driver solib which is loaded by the OpenGL library, which has the same soname as before.

DannyBee · on April 25, 2023

Okay, then why does it fail when it should then still use llvm15?

The author states this is an llvm16 issue, but unless the driver was built against llvm16, it should still be loading llvm15.

If it was built against llvm16 (or loads llvm16), and doesn't work, that's not a failure of anything other than QA testing.

AndyKelley · on April 24, 2023

related: this talk that I made 2 years ago about an experiment to ship static executables on linux that could do graphics: https://www.youtube.com/watch?v=pq1XqP4-qOo

stryan · on April 24, 2023

Valve does this for a couple of their games, see a similar issue with Dota 2[0].

[0] https://github.com/ValveSoftware/Dota-2/issues/2285

exabrial · on April 24, 2023

I thought TF2 was pretty much 100% hacked... like no legit non-hackers playing except at LAN parties.

chrisdalke · on April 24, 2023

It's still got a small but vibrant community in the community-run servers (not accessed through matchmaking). These are typically hand-moderated.

themoonisachees · on April 25, 2023

I played for a few hours just yesterday and a few bots joined but people are proactive at kicking them. Community servers are also thriving.

pnpnp · on April 24, 2023

There are bots, but they’re easily avoidable in community servers.

exabrial · on April 25, 2023

holy crap, I didn't expect downvotes for asking such a simple question. I actually used to enjoy tf2, but the hacking became unbearable.

sosodev · on April 24, 2023

It's unfortunate but the Steam experience on Linux seems to be progressively getting worse (outside of Steam Deck ofc). The Steam client is often borderline unusable for Linux users. You can find many issue threads on GitHub reporting client freezes and crashes.

It seems like a big part of the issues is a lack of maintenance. TF2 would actually run better on Linux via Proton but VAC isn't enabled so you can't join the vast majority of servers.

Valve also has existing Source engine tooling that allows Linux ports to drop OpenGL entirely (dxvk-native as used by Portal 2 and L4D2) but they haven't added it to TF2... :(

zamalek · on April 24, 2023

> You can find many issue threads on GitHub reporting client freezes and crashes.

The fact that these are happening does not necessarily mean the client is getting worse. For example, it could mean that more people are installing Steam for Linux. There is no baseline to say it's getting worse, because nobody opens an issue saying "all working here."

In my experience, the only issue I have on Wayland is this: https://github.com/ValveSoftware/steam-for-linux/issues/7245 (workaround: disable animated avatars) (edit: all AMD machine)

> outside of Steam Deck ofc

There is nothing special about the Steam Deck. It's just another Linux machine.

> TF2

I don't play any Source games, but I could see TF2 having issues because it's in maintenance mode. If it is bjorked that has nothing to do with Steam.

mariusor · on April 24, 2023

> There is nothing special about the Steam Deck. It's just another Linux machine.

That's not true. It's a read-only linux on a fixed hardware platform, which is a vaaastly different beast than the myriad of hardware/software combinations that exist out there in the wild.

zamalek · on April 24, 2023

> It's a read-only linux on a fixed hardware platform

I have heard that argument about macOS a lot, and this is nothing like that. There isn't some "special sauce er... Source" that they apply to their platform. It's just GPL Linux. They may have avoided bad decisions like relying on NVIDIA for Linux gaming, but that's hardly the level of ownership that you see with other vertical integrations. If I use an AMD CPU (or Intel, which would be arguably better) and AMD GPU, there is no reason why my PC couldn't be just as "first-party" as the Steam Deck.

Wine/Proton ultimately access the GPU through DRM, that remains the same for Valve hardware or custom-built hardware. Both Steam and Wine/Proton currently render via X11 (via XWayland if necessary), on both my PC and the Steam Deck.

I feel like there is a gap of understanding how a HAL works here.

mlyle · on April 24, 2023

> It's just GPL Linux.

"Just" GPL Linux encompasses myriad library versions, kernel versions, driver versions and varied hardware.

> I feel like there is a gap of understanding how a HAL works here.

Just because you have a HAL doesn't mean that you don't get different behavior and crashes with different numbers of CPUs/concurrency or other hardware beneath. Modern GPUs are also pretty complicated beasts, and assuming that's fully abstracted is a mistake.

And this all leaves aside the myriad of other problems you can have with the ensemble of software running on the machine that interacts with the game (directly or indirectly).

Being able to test and make one restricted platform work well is a far different beast than covering the huge mass of variation users create on their own machines.

admax88qqq · on April 24, 2023

I feel like there is a gap in understanding of how commercial software deployment goes.

When you have a platform like Steam Deck, it's the platform that gets tested by QA, and the platform that most of your devs are building for every day.

mariusor · on April 24, 2023

Sure, a linux machine is made out of just a CPU and a GPU. Even if that would be the case, what about the software combinations that can exist and that the SteamDeck simplifies?

In the gamedev world I heard a lot of people not wanting to support linux because they never know which glibc version to support, which mesa version to support, which hardware GPUs to support, which graphical API to support, etc.

Cutting down that matrix (and I just mentioned the most egregious examples) to only one element is invaluable in ensuring your users have a bug free experience.

sosodev · on April 24, 2023

True, I don't have enough data to really make that claim. I can say that my own hardware hasn't changed in ~4 years and I've been using Steam for Linux since I built this machine. It's only within the last year or so that I started having major issues with the client.

> there is nothing special about the steam deck

How is first party support for the hardware and software stack "nothing special"?

> If it is bjorked that has nothing to do with Steam.

Maybe it's not directly related to the rest of my comment but it's related to the OP. I also think it's indicative of Valve's issues with Linux.

zamalek · on April 24, 2023

> How is first party support for the hardware and software stack "nothing special"?

Because the vast majority of that stack (the kernel, GPU driver, window manager, and so forth) has nothing to do with Valve. They might contribute drivers to the kernel (I'm not sure if they actually do - I would expect AMD to be doing that), but otherwise it's an Arch-based distro with the same Steam client and Proton runtime that everyone else is using.

sosodev · on April 24, 2023

Yes, the Steam Deck is using a fairly standard stack and the default client but you're missing the point. Valve directly tests the Steam Deck and prioritizes bug fixes for it. When users report issues with other setups it often takes months for the identified bug to be fixed if it ever is.

Arnavion · on April 24, 2023

I run Steam in a Docker container of Ubuntu 22.04 for reasons like this. Also my actual system isn't polluted with 32-bit libs, Steam can't rm-rf my home directory and games can't steal files from my home directory (homedir inside the container is a separate directory on the host), and access to X and dbus is restricted (dbus socket not forwarded, X socket is from a nested Xephyr instance) so nothing can be stolen from there either.

Edit: More details in https://hackertimes.com/item?id=34634854

sosodev · on April 24, 2023

Is there a guide for this? I'd really like to isolate Steam from the rest of my system

donio · on April 25, 2023

flatpak is probably the easiest way to do this.

ho_schi · on April 24, 2023

I can’t see how native Linux support is getting worse. Linux users are good at bug reporting. Maybe some developers should care more about compatibility. And yes, especially the heterogeneous setups used by some makes support difficult.

I’m worried that Valve puts too much resources into Proton (derivate of WINE) instead of tooling for native ports. Yes, Proton is needed to provide initial compatibility. But Proton is another layer of complexity (more bugs, integration, system resources) which requires more programming. I started playing CS again after it was ported natively in 2014, it runs well and all issues with WINE were gone.

If Proton becomes to “good” we end up in a situation with a high maintenance burden for Valve. Game developers will rely on it and Valve has all the constant work. Instead game developers should treat Linux as first-class platform for AAA-Titles, for which the need appropriate APIs, compatibility and tooling. As Valve does itself support Linux as first-class platform from HL2 to CSGO. The target shall be official support from the very first day.

Anyway. Looks like Valve has chosen a special implementation for TF2? What I miss here is a link to a bug report. Ideally opened months ago :)

pjmlp · on April 24, 2023

Game studios already know Linux distributions quite well on the server, and most AAA games on Android are basically only using the NDK, meaning ISO C and C++, OpenGL ES, Vulkan, OpenSL.

Besides that, PlayStation OS is based on FreeBSD. Even if the 3D API is different, it is just yet another backend.

They don't port them, because the QA and support aren't worth the sales, that is about it.

danbolt · on April 24, 2023

I think Valve has a financial incentive to keep Proton compatibility in a positive state, as it increases sales of the Steam Deck and encourages players to remain in their ecosystem. Or, I think it's more likely than the majority of AAA game developers having a financial incentive to maintain Linux versions of their products.

Entinel · on April 24, 2023

Devil's advocate, I use Steam on Fedora and have had 0 issues. Very rarely freezes or crashes. It's probably the most stable application I use daily.

2OEH8eoCRo0 · on April 24, 2023

I use Steam on Fedora as well and I notice a lot of jank with the Steam client (Nvidia 1080ti). Dropdown menus popping through windows, sound may or may not work for videos, freezing, etc. It's usable but it's not very pleasant.

Entinel · on April 24, 2023

Just for GPU comparison I'm on an AMD RX card so it could be an Nvidia issue which is known to be jank on Linux.

_piif · on April 25, 2023

The client seems to be very suseptible to I/O stalls and even randomly just locks up for a few seconds by itself every once in a while on my end for some reason. That in itself would be fine if it wouldn't also directly affect games launched through Steam.

_z2co · on April 24, 2023

I also use Steam on Fedora, and I've not had any issues with Stardew Valley, Factorio, Celeste, N++, Undertale, and others. I remember having a brief issue with Portal, but I was able to resolve it. Overall, I've had a good experience.

sosodev · on April 24, 2023

It seems to work perfectly for some people. I’ve regularly had issues with the client not rendering at all, freezing, and crashing on Pop_OS 22.04 LTS with an nvidia GTX 1660ti.