Mastodawn

🌖 [2307.08621] 保留網絡: 大型語言模型的繼任者
➤ 保留網絡（RetNet）：大型語言模型的繼任者
✤ https://arxiv.org/abs/2307.08621
本文提出了保留網絡（RetNet）作為大型語言模型的基礎架構，同時實現了訓練並行性、低成本推理和良好性能。我們在理論上推導了循環和注意力之間的聯繫。然後，我們提出了序列建模的保留機制，支持並行、循環和分塊循環三種計算範式。具體而言，並行表示允許訓練並行性。循環表示實現了低成本的$O(1)$推理，提高了解碼吞吐量、延遲和GPU內存，同時不損失性能。分塊循環表示實現了線性複雜度的高效長序列建模，其中每個分塊都可以並行編碼，同時循環總結分塊。語言建模的實驗結果表明，RetNet實現了有利的擴展結果、並行訓練、低成本部署和高效推理。這些有趣的特性使RetNet成為大型語言模型的強大繼任者。
+ 這個保留網絡（RetNet）的概念聽起來很有趣，我想了解更多關於它的細節。
#大型語言模型 #RetNet #Transformer

Retentive Network: A Successor to Transformer for Large Language Models

In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost $O(1)$ inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. The intriguing properties make RetNet a strong successor to Transformer for large language models. Code will be available at https://aka.ms/retnet.

arXiv.org