Unexpected Discovery:

Training a #RetNet model on the CPU isn't *nearly* as slow as I'd expected.

Yes, it's slower than the GPU but probably by less than a factor of 10 - not 100's of times slower.

Which means I can periodically move my model to the CPU and train on nice *long* sequence lengths realistically.

My expectations were warped by having moved from an Intel Atom netbook in 2016 to NVidia EC2 spot instances playing with simple FFNs.

Never tried running or training on CPU since.

So it occurs to me that #RWKV or #RetNet can both offer an alternative RAG implementation with saved memory states having just read priming information to start with reading new prompts.

With RWKV - there's no time oscillations. Wondering if loss on input1+input2 + loss on mean_state(read_1,read_2) continuing to generate target could lead to composable memory: start with mean of relevant priming documents and have grounding to process a session prompt?

#GPT first clues me in to the existence of #RWKV and #RetNet and then when I ask for details on how each works proceeds to spit out perfectly viable #Pytorch implementations of each...

Which *did* answer my questions beautifully...

But it also talked me into some experiments involving training one of each from scratch...

It turns out #LLMs love talking about implementing and training LLMs.

Effectively, reproduction? (any time they can find a willing partner with a GPU).

RetNetを線形時変システムにする - Qiita

RetNetの計算式は以下のようになっています:Y_i=X_iW_Q\sum_{j\in[0,i]}A^{i-j}W_K^\mathsf{T}X_j^\mathsf{T}X_jW_Vここで…

Qiita
🌖 [2307.08621] 保留網絡: 大型語言模型的繼任者
➤ 保留網絡(RetNet):大型語言模型的繼任者
https://arxiv.org/abs/2307.08621
本文提出了保留網絡(RetNet)作為大型語言模型的基礎架構,同時實現了訓練並行性、低成本推理和良好性能。我們在理論上推導了循環和注意力之間的聯繫。然後,我們提出了序列建模的保留機制,支持並行、循環和分塊循環三種計算範式。具體而言,並行表示允許訓練並行性。循環表示實現了低成本的$O(1)$推理,提高了解碼吞吐量、延遲和GPU內存,同時不損失性能。分塊循環表示實現了線性複雜度的高效長序列建模,其中每個分塊都可以並行編碼,同時循環總結分塊。語言建模的實驗結果表明,RetNet實現了有利的擴展結果、並行訓練、低成本部署和高效推理。這些有趣的特性使RetNet成為大型語言模型的強大繼任者。
+ 這個保留網絡(RetNet)的概念聽起來很有趣,我想了解更多關於它的細節。
#大型語言模型 #RetNet #Transformer
Retentive Network: A Successor to Transformer for Large Language Models

In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost $O(1)$ inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. The intriguing properties make RetNet a strong successor to Transformer for large language models. Code will be available at https://aka.ms/retnet.

arXiv.org