Running the #LLM #Llama 2 with #Rust + #Wasm to run on #edge: https://www.secondstate.io/articles/fast-llm-inference/
It’s using #WasmEdge, a super lightweight runtime that can on a wide range of hardware, and directly as a #Docker #container runtime.
It supports the #WASI NN API, so can run inference modules efficiently. It is compiled only once, in #WASM, and can run anywhere WasmEdge + GGML can.
Expect 25 token per seconds on a low-end M2 Macbook, 50 tokens per seconds on an Nvidia A10G. So pretty big "edge" devices.

