I'm curious about both: - what's special about the memory allocation, and how mi...

65a · on March 16, 2024

Ollama does a nice job of looking at how much VRAM the card has and tuning the number of gpu layers offloaded. Before that, I mainly just had to guess. It's still a heuristic, but I thought that was neat.

I'm mainly just using llama.cpp as a native library now, mainly for the direct access to more of llama's data structures, and because I have a sort of unique sampler setup.

rahimnathwani · on March 16, 2024

Oh right... I've just been guessing, to try and find the value one fewer than the one which causes CUDA OOM errors.