I lost track since things move so quickly. Was there still memory savings just n...

eigenvalue · on April 3, 2023

It did somewhat reduce the total memory used. Now you can load the 30B model while only using ~20gb of RAM, which is about the aggregate size of the 4bit quantized weight files for that model. The real win is that you can kill the main inference binary and try another prompt, and it will start doing inference basically immediately instead of spending 10-15 seconds loading up all the weights into RAM each time.

chpatrick · on April 3, 2023

It's neither memory savings or a speed-up really. The advantage of mmap is that you can treat a file on a disk a block of memory, so pages from it can be loaded (or unloaded) as necessary instead of one big upfront load into RAM. The benefit is that you can work with data that's bigger than your physical RAM because the kernel can swap it back out to disk if needed. Another benefit could be that if only a small part of the data is needed to compute something then the OS will automatically only load those, but it's unclear to me whether this is the case with LLaMA.

iforgotpassword · on April 3, 2023

It's still somewhat faster if you benchmark it. I assume the os is doing good enough prefetching in the mmap case to hide the loads from disk mostly. So it's not just hiding the initial load of 30gb from disk.

Obviously if you're swapping because you don't have enough memory to hold the model in RAM, the mmap version is going to be much faster, since you don't need to swap anything out to disk but just discard the page and re-read from disk if you need it again later.

antonvs · on April 3, 2023

> So it's not just hiding the initial load of 30gb from disk.

The issue is typically that that initial load involves some sort of transformation - parsing, instantiating structures, etc. If you can arrange it so that the data is stored in the format you actually need it in memory, then you can skip that entire transformation phase.

I don’t know if that’s what’s been done with llama.cop though.

simion314 · on April 3, 2023

It is a speedup for me. When I run llama.cpp from CLI first tiem it takes a very long tiem to load the model in memory. If the program exits or I stop it with Ctrl+C and start it again it will start almost instant.

chaboud · on April 3, 2023

That's down to caching. If you used your system to do something else for a while, you'd find those pages evicted and the performance back down to Earth. That's one of the things that makes mmap so useful, though. The system can take advantage of access patterns to dramatically improve performance.

simion314 · on April 3, 2023

Yes, makes sense. And is great. Though honestly not sure why it takes minutes to load a 23Gb model in RAM, I feel is not proportional with the smaller models.

toxik · on April 3, 2023

Regarding your last point: No, you need all of the weights all of the time.

Edit: except embedding weights but those are not the problem.

CyberDildonics · on April 3, 2023

They weren't asking about mmap, they were asking about the program itself.

rovr138 · on April 3, 2023

It's basically paging to disk.

Not necessarily memory savings, but the improvements here are that it will run on computers with less ram because it can page to disk.

Not necessarily the number reported (since you do need to load chunks into ram), but still lower.

detrites · on April 3, 2023

If that's the case then part of this may be the different interpretations of "memory".

One persons "paging the same memory requirement from disk to RAM" can be someone elses "requiring less memory/RAM".

chaboud · on April 3, 2023

Sort of, but without the duplication and initial wait to load. A traditional fat in-memory app would do this:

file (DISK) to process active pages (RAM) to paged out virtual memory if saturated (elsewhere on DISK)

Using mmap typically goes something like:

file (DISK) to process active pages (RAM) to released and cached (RAM) to uncached if evicted (back to same place on DISK) OR back to process active pages from cache (RAM)

For the cost of fixing up some process page tables, the physical memory pages necessary can be brought back from the cache rather than read from disk. It's an orders-of-magnitude performance savings.

hnav · on April 3, 2023

more like memory mis-reporting, since when you mmap a file in, IIRC that counts against page cache rather than memory usage (you can evict the page without causing write IO so the memory isn't "used")

astrange · on April 3, 2023

It being safe to evict is why it's correct to not report it as "memory usage".

It's part of the program's working set but measuring that is a completely different story.