Retentive Network: A Successor to Transformer for Large Language Models

This is an exciting new paper that replaces attention in the Transformer architecture with a set of decomposable matrix operations that retain the modeling capacity of Transformer models, while allowing parallel training and efficient RNN-like inference without the use of attention (it doesn't use a softmax).

It achieves lower perplexity than Transformers models with more than 2B parameters and requires much lower GPU memory and FLOPs compared Transformers for inference.

Abstract:

In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost $O(1)$ inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. The intriguing properties make RetNet a strong successor to Transformer for large language models. Code will be available at https://aka.ms/retnet.

missing, 10 months ago

If the claims here are true.. wow research and development are moving very quickly

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Lenguador, 10 months ago

This looks amazing, if true. The paper is claiming state of the art across literally every metric. Even in their ablation study the model outperforms all others.

I'm a bit suspicious that they don't extend their perplexity numbers to the 13B model, or provide the hyper parameters, but they reference it in text and in their scaling table.

Code will be released in a week https://github.com/microsoft/unilm/tree/master/retnet

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

KingsmanVince, 10 months ago

https://github.com/Jamie-Stirling/RetNet non-official implementation

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

SSamDav, 10 months ago

Would love to now how it compares with hyenna on the LRA.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Add comment