Recurrentgemma: Moving Past Transformers For Efficient Open Language Models
Botev Aleksandar, De Soham, Smith Samuel L, Fernando Anushan, Muraru George-cristian, Haroun Ruba, Berrada Leonard, Pascanu Razvan, Sessa Pier Giuseppe, Dadashi Robert, Hussenot Léonard, Ferret Johan, Girgin Sertan, Bachem Olivier, Andreev Alek, Kenealy Kathleen, Mesnard Thomas, Hardin Cassidy, Bhupatiraju Surya, Pathak Shreya, Sifre Laurent, Rivière Morgane, Kale Mihir Sanjay, Love Juliette, Tafti Pouya, Joulin Armand, Fiedel Noah, Senter Evan, Chen Yutian, Srinivasan Srivatsan, Desjardins Guillaume, Budden David, Doucet Arnaud, Vikram Sharad, Paszke Adam, Gale Trevor, Borgeaud Sebastian, Chen Charlie, Brock Andy, Paterson Antonia, Brennan Jenny, Risdal Meg, Gundluru Raj, Devanathan Nesh, Mooney Paul, Chauhan Nilay, Culliton Phil, Martins Luiz Gustavo, Bandy Elisa, Huntsperger David, Cameron Glenn, Zucker Arthur, Warkentin Tris, Peran Ludovic, Giang Minh, Ghahramani Zoubin, Farabet Clément, Kavukcuoglu Koray, Hassabis Demis, Hadsell Raia, Teh Yee Whye, De Frietas Nando. Arxiv 2024
[Paper]
Attention Mechanism
Model Architecture
Pretraining Methods
Reinforcement Learning
Transformer
We introduce RecurrentGemma, a family of open language models which uses
Google’s novel Griffin architecture. Griffin combines linear recurrences with
local attention to achieve excellent performance on language. It has a
fixed-sized state, which reduces memory use and enables efficient inference on
long sequences. We provide two sizes of models, containing 2B and 9B
parameters, and provide pre-trained and instruction tuned variants for both.
Our models achieve comparable performance to similarly-sized Gemma baselines
despite being trained on fewer tokens.
Similar Work