The openmp device runtime library was originally written in cuda. I ported that to hip for amdgpu, discovered the upstream hip compiler wasn't quite as solid as advertised, then ported it to openmp with some compiler intrinsics. The languages are all essentially C++ syntax with some spurious noise obfuscating llvm IR. The libc effort has gone with freestanding c++ based on that experience and and we've now mostly fixed the ways that goes wrong.
You might also find raw c++ for device libraries saner to deal with than cuda. In particular you don't need to jury rig the thing to not spuriously embed the GPU code in x64 elf objects and/or pull the binaries apart. Though if you're feeding the same device libraries to nvcc with #ifdef around the divergence your hands are tied.
> You might also find raw c++ for device libraries saner to deal with than cuda.
Actually, we just compile all the device libraries to LLVM bitcode and be done with it. Then we can write them using all the clang-dialect, not-nvcc-emulating, C++23 we feel like, and it'll still work when someone imports them into their c++98 CUDA project from hell. :D
You might also find raw c++ for device libraries saner to deal with than cuda. In particular you don't need to jury rig the thing to not spuriously embed the GPU code in x64 elf objects and/or pull the binaries apart. Though if you're feeding the same device libraries to nvcc with #ifdef around the divergence your hands are tied.