It did somewhat reduce the total memory used. Now you can load the 30B model while only using ~20gb of RAM, which is about the aggregate size of the 4bit quantized weight files for that model. The real win is that you can kill the main inference binary and try another prompt, and it will start doing inference basically immediately instead of spending 10-15 seconds loading up all the weights into RAM each time.
It's neither memory savings or a speed-up really. The advantage of mmap is that you can treat a file on a disk a block of memory, so pages from it can be loaded (or unloaded) as necessary instead of one big upfront load into RAM. The benefit is that you can work with data that's bigger than your physical RAM because the kernel can swap it back out to disk if needed. Another benefit could be that if only a small part of the data is needed to compute something then the OS will automatically only load those, but it's unclear to me whether this is the case with LLaMA.
It's still somewhat faster if you benchmark it. I assume the os is doing good enough prefetching in the mmap case to hide the loads from disk mostly. So it's not just hiding the initial load of 30gb from disk.
Obviously if you're swapping because you don't have enough memory to hold the model in RAM, the mmap version is going to be much faster, since you don't need to swap anything out to disk but just discard the page and re-read from disk if you need it again later.
> So it's not just hiding the initial load of 30gb from disk.
The issue is typically that that initial load involves some sort of transformation - parsing, instantiating structures, etc. If you can arrange it so that the data is stored in the format you actually need it in memory, then you can skip that entire transformation phase.
I don’t know if that’s what’s been done with llama.cop though.
It is a speedup for me. When I run llama.cpp from CLI first tiem it takes a very long tiem to load the model in memory. If the program exits or I stop it with Ctrl+C and start it again it will start almost instant.
That's down to caching. If you used your system to do something else for a while, you'd find those pages evicted and the performance back down to Earth. That's one of the things that makes mmap so useful, though. The system can take advantage of access patterns to dramatically improve performance.
Yes, makes sense. And is great. Though honestly not sure why it takes minutes to load a 23Gb model in RAM, I feel is not proportional with the smaller models.
Sort of, but without the duplication and initial wait to load. A traditional fat in-memory app would do this:
file (DISK) to process active pages (RAM) to paged out virtual memory if saturated (elsewhere on DISK)
Using mmap typically goes something like:
file (DISK) to process active pages (RAM) to released and cached (RAM) to uncached if evicted (back to same place on DISK) OR back to process active pages from cache (RAM)
For the cost of fixing up some process page tables, the physical memory pages necessary can be brought back from the cache rather than read from disk. It's an orders-of-magnitude performance savings.
more like memory mis-reporting, since when you mmap a file in, IIRC that counts against page cache rather than memory usage (you can evict the page without causing write IO so the memory isn't "used")