It looks like the library in Rust is using `tract-onnx` to do the inference: htt...

ikhatri · on June 7, 2023

You can use the onnx cpu runtime in python or c++ too. It doesn’t have to be rust. And if you want GPU support you can even run models saved in the onnx format on Nvidia GPUs with the TensorRT runtime.

Honestly while ggml is super cool. It started as a hobby project and you probably shouldn’t use it in production. ONNX has been the defacto standard for ML inference for years. What it is missing (compared to ggml) is 2-6bit inference which is helpful for large scale transformers on edge devices (and is what helped ggml gain adoption so fast).

touisteur · on June 7, 2023

Intel OpenVINO is also quite punchy for CPU inference.

ikhatri · on June 7, 2023

Yeah I've heard of it but never used it. Looks like they have a backend/runtime for ONNX models as well (https://pypi.org/project/onnxruntime-openvino/) neat!

ONNX really is the universal format. If you can get your model exported to ONNX, running it on various platforms becomes much easier.*

*as long as every hardware platform supports the ops you use in your network and you're not doing anything too fancy/custom :P

touisteur · on June 7, 2023

Yeah I've only used it with networks in ONNX format (converted from tensorflow or torch). I was looking for high perf low latency / real-time, the C or C++ APIs for OpenVINO are quite OK if you spend some time playing with it. I hope Intel keeps investing on it...

Edit: often if you go through the ONNX intermediate format, be prepared to perform some 'network surgery' to clean up some conversion cruft, but also to remove training-only stuff left in the network...