This is an exciting new paper that replaces attention in the Transformer architecture with a set of decomposable matrix operations that retain the modeling capacity of Transformer models, while allowing parallel training and efficient RNN-like inference without the use of attention (it doesn't use a softmax)....
Retentive Network: A Successor to Transformer for Large Language Models (arxiv.org) en
This is an exciting new paper that replaces attention in the Transformer architecture with a set of decomposable matrix operations that retain the modeling capacity of Transformer models, while allowing parallel training and efficient RNN-like inference without the use of attention (it doesn't use a softmax)....
Extending Context Window of Large Language Models via Positional Interpolation (arxiv.org) en
Interesting technique to increase the context window of language models by finetuning on a small number of samples after pretraining....