Một người dùng đã tự tạo trợ lý AI cá nhân ngoại tuyến mang tên "Ghost"! Hệ thống này dùng RWKV-7 và cơ sở dữ liệu vector, chạy hiệu quả trên CPU, khắc phục vấn đề "quên ngữ cảnh" tốt hơn Llama 3. Ghost có thể lập tức lập chỉ mục tài liệu (ví dụ PDF 50 trang) và nhớ chi tiết sau nhiều tuần mà không cần đọc lại. Hoàn toàn bảo mật và offline. Đang cân nhắc phát hành bản ứng dụng!
#AI #LocalAI #OfflineAI #RWKV #PersonalAssistant #TrợLýAI #AIngoạituyến

https://www.reddit.com/r/LocalLLaMA/comments/1

So with a local #SSM State Space Model (#Mamba, #RWKV) I can snapshot contexts.

I've now got a general purpose pipe/filter that can feed and read from any saved context, for stable static or dynamic (by re-saving the checkpoint) sessions.

Any of which can become an always ready CLI filter I can pipe stuff through.

*Could* make a local SSM model useful providing instantly ready small named natural language filters that work in the shell just like grep, awk, etc.

Useful?

Don't know yet.

It's a useless analogy, but I can't help thinking of ANN weights as equivalent to synapses. I figure that puts the model that's now taken up long-term residence in my laptop's VRAM (7.2B #RWKV) at somewhere around the complexity of at least a small rodent.

On a related note: I *REALLY* want a wired USB mouse that is also a microphone, muted in normal orientation, but which un-mutes when rotated into speaking position as demonstrated by Scottie.

Responds in Majel Barrett Roddenberry's voice.

I'm just at the experimenting and building tools to hopefully build tools with, but RWKV7 with "diegetic" prompting using defined voice roles in a narrative transcript as priming seems to be quite promising.

A saved KV checkpoint weighs in at 34MB for this 7.2B model: https://huggingface.co/BlinkDL/rwkv7-g1/tree/main

I'm running with a local llama.cpp build + Python llama_cpp wrapper. Inference is blazing fast on my 8GB laptop GPU.

TCP service for the model with /save /load commands for context checkpoints. #RWKV

BlinkDL/rwkv7-g1 at main

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Yes, LLMs are just overgrown auto-complete.

But auto-complete can be a useful tool.

Transformers with o(l^2) context penalty are neat, but not practical or efficient.

State space models with "unlimited context", o(1) generation, and context captured in KV states that can be saved and restored are different.

I *think* a 7B local SSM model *might* be a useful tool with context specialists on small discrete scrip-able tasks that justify a model living in VRAM at 8-bit quantization. #rwkv #llama

Finally! After I don't even know how much head-banging, I've got a State Space Model (RWKV7) running using llama.cpp with GPU offloading and memory states successfully saved and restored. And I *think* the same code can be minimally modified for Mamba (Mistral, Falcon).

Yay for infinite context length!

And for massive time and energy savings.

llama.cpp: https://github.com/ggml-org/llama.cpp

Llama CPP Python: https://pypi.org/project/llama-cpp-python/

RWKV7 live demo: https://huggingface.co/spaces/BlinkDL/RWKV-Gradio-2

#SSM #llama #rwkv #mamba

GitHub - ggml-org/llama.cpp: LLM inference in C/C++

LLM inference in C/C++. Contribute to ggml-org/llama.cpp development by creating an account on GitHub.

GitHub

So I've been playing around with small language models locally and grabbed and tried quite a number from different families to see how fast they generated and how they behaved and how much memory they needed, etc.

And ALL of them. 100% of models that I told to tell me a joke - including RWKV7 - have chosen the same joke:

"Why don't scientists trust atoms?"

"Because they make up everything."

How. Why. How is that ONE joke so over-represented in training data *everywhere*?

#rwkv #llm

So it occurs to me that #RWKV or #RetNet can both offer an alternative RAG implementation with saved memory states having just read priming information to start with reading new prompts.

With RWKV - there's no time oscillations. Wondering if loss on input1+input2 + loss on mean_state(read_1,read_2) continuing to generate target could lead to composable memory: start with mean of relevant priming documents and have grounding to process a session prompt?

So I asked #GPT to explain #RWKV and when I asked for specifics in terms of layers and activations and such it spat out a #PyTorch implementation.

Read it. Liked it. Training it.

But as I try to decipher the papers for all the different RWKV V1..V7 I find that GPT gave me is actually something like V1+per-dimension learned decay.

But it works and is simple enough to understand - though odd that there seems to have been some LLM improvisation in it's design.

#GPT first clues me in to the existence of #RWKV and #RetNet and then when I ask for details on how each works proceeds to spit out perfectly viable #Pytorch implementations of each...

Which *did* answer my questions beautifully...

But it also talked me into some experiments involving training one of each from scratch...

It turns out #LLMs love talking about implementing and training LLMs.

Effectively, reproduction? (any time they can find a willing partner with a GPU).