Similar situation happened to me too "and 24GB RAM, this took 20 seconds to start up. After posing the question "What is the most common way of transportation in Amsterdam?", the Vicuna model began to generate its response word by word, taking *15 minutes* to complete the task"
I'm running vicuna on a machine with a 6GB Nvidia card using llama.cpp. I get 300ms per token which is totally usable. I tried to tune it using the ngl switch but never got much better than that and that was fine for interacting with it. I'm blown away at the performance increases on consumer hardware in the last few months.
I tried several of open LM to test what they can. The performance is very poor on PC with GPU and quality is far away to use them in commercial tasks. But to play for fun and chat like a friend "Hello, what do you think about cats? " can be ok.
Time is money and the openai is pennies