Depends on the model size. A model like GPT3 that has hundreds of billions of paramaters, you can do few-shot learning with. You'll still pay for the tokens processed and it'll at least linearly increase response times the larger your input is.
Fine-tuning can get you similar results on smaller / faster models. The downside is you have to craft the dataset in the right way. There are trade-offs to both approaches but fwiw, I don't think Alpaca-7b can do few-shot learning.
Almost. If your dataset contains questions and answers about your own projects documentation, then yes. The UX around how to prompt a fine-tuned model depends on the format of the dataset it's trained on.
One way you can do this is pass your documentation to a larger model (like a GPT3.5 / OSS equivalent) and have it generate the questions/answers. You can then use that dataset to fine-tune something like Llama to get conversation / relevant answers.
to my understanding, fine tuning is slow and would be quite bad to update. embeddings seems to be the way to go. i don't understand it well enough, but it seems with the langchain framework you can create an embedding of your own data and submit it to the GPT API and i believe emeddings should be a similar principle in llama. at least i did it with diffusers in stablediffusion.
Can you comment on the '8bit version' from above? Does that mean these parameters are uint8's (converted from the original float16 params)? Looking in your pytorch code I see some float16 declarations.
I've been running alpaca.cpp 13b locally and your 7b model performs much better than it does. I had assumed this was because alpaca.cpp was converting weights to 4bits from float16, but is there some other fine tuning you're doing that might also account for the better performance of chatLLaMA over alpaca.cpp?