Gemma 2: Improving Open Language Models At A Practical Size
Gemma Team, Riviere Morgane, Pathak Shreya, Sessa Pier Giuseppe, Hardin Cassidy, Bhupatiraju Surya, Hussenot Léonard, Mesnard Thomas, Shahriari Bobak, Ramé Alexandre, Ferret Johan, Liu Peter, Tafti Pouya, Friesen Abe, Casbon Michelle, Ramos Sabela, Kumar Ravin, Lan Charline Le, Jerome Sammy, Tsitsulin Anton, Vieillard Nino, Stanczyk Piotr, Girgin Sertan, Momchev Nikola, Hoffman Matt, Thakoor Shantanu, Grill Jean-bastien, Neyshabur Behnam, Bachem Olivier, Walton Alanna, Severyn Aliaksei, Parrish Alicia, Ahmad Aliya, Hutchison Allen, Abdagic Alvin, Carl Amanda, Shen Amy, Brock Andy, Coenen Andy, Laforge Anthony, Paterson Antonia, Bastian Ben, Piot Bilal, Wu Bo, Royal Brandon, Chen Charlie, Kumar Chintu, Perry Chris, Welty Chris, Choquette-choo Christopher A., Sinopalnikov Danila, Weinberger David, Vijaykumar Dimple, Rogozińska Dominika, Herbison Dustin, Bandy Elisa, Wang Emma, Noland Eric, Moreira Erica, Senter Evan, Eltyshev Evgenii, Visin Francesco, Rasskin Gabriel, Wei Gary, Cameron Glenn, Martins Gus, Hashemi Hadi, Klimczak-plucińska Hanna, Batra Harleen, Dhand Harsh, Nardini Ivan, Mein Jacinda, Zhou Jack, Svensson James, Stanway Jeff, Chan Jetha, Zhou Jin Peng, Carrasqueira Joana, Iljazi Joana, Becker Jocelyn, Fernandez Joe, Van Amersfoort Joost, Gordon Josh, Lipschultz Josh, Newlan Josh, Ji Ju-yeong, Mohamed Kareem, Badola Kartikeya, Black Kat, Millican Katie, Mcdonell Keelin, Nguyen Kelvin, Sodhia Kiranbir, Greene Kish, Sjoesund Lars Lowe, Usui Lauren, Sifre Laurent, Heuermann Lena, Lago Leticia, Mcnealus Lilly, Soares Livio Baldini, Kilpatrick Logan, Dixon Lucas, Martins Luciano, Reid Machel, Singh Manvinder, Iverson Mark, Görner Martin, Velloso Mat, Wirth Mateo, Davidow Matt, Miller Matt, Rahtz Matthew, Watson Matthew, Risdal Meg, Kazemi Mehran, Moynihan Michael, Zhang Ming, Kahng Minsuk, Park Minwoo, Rahman Mofi, Khatwani Mohit, Dao Natalie, Bardoliwalla Nenshad, Devanathan Nesh, Dumai Neta, Chauhan Nilay, Wahltinez Oscar, Botarda Pankil, Barnes Parker, Barham Paul, Michel Paul, Jin Pengchong, Georgiev Petko, Culliton Phil, Kuppala Pradeep, Comanescu Ramona, Merhej Ramona, Jana Reena, Rokni Reza Ardeshir, Agarwal Rishabh, Mullins Ryan, Saadat Samaneh, Carthy Sara Mc, Perrin Sarah, Arnold Sébastien M. R., Krause Sebastian, Dai Shengyang, Garg Shruti, Sheth Shruti, Ronstrom Sue, Chan Susan, Jordan Timothy, Yu Ting, Eccles Tom, Hennigan Tom, Kocisky Tomas, Doshi Tulsee, Jain Vihan, Yadav Vikas, Meshram Vilobh, Dharmadhikari Vishal, Barkley Warren, Wei Wei, Ye Wenming, Han Woohyun, Kwon Woosuk, Xu Xiang, Shen Zhe, Gong Zhitao, Wei Zichuan, Cotruta Victor, Kirk Phoebe, Rao Anand, Giang Minh, Peran Ludovic, Warkentin Tris, Collins Eli, Barral Joelle, Ghahramani Zoubin, Hadsell Raia, Sculley D., Banks Jeanine, Dragan Anca, Petrov Slav, Vinyals Oriol, Dean Jeff, Hassabis Demis, Kavukcuoglu Koray, Farabet Clement, Buchatskaya Elena, Borgeaud Sebastian, Fiedel Noah, Joulin Armand, Kenealy Kathleen, Dadashi Robert, Andreev Alek. Arxiv 2024
[Paper]
Attention Mechanism
Distillation
Efficiency And Optimization
Model Architecture
Pretraining Methods
Reinforcement Learning
Transformer
In this work, we introduce Gemma 2, a new addition to the Gemma family of
lightweight, state-of-the-art open models, ranging in scale from 2 billion to
27 billion parameters. In this new version, we apply several known technical
modifications to the Transformer architecture, such as interleaving
local-global attentions (Beltagy et al., 2020a) and group-query attention
(Ainslie et al., 2023). We also train the 2B and 9B models with knowledge
distillation (Hinton et al., 2015) instead of next token prediction. The
resulting models deliver the best performance for their size, and even offer
competitive alternatives to models that are 2-3 times bigger. We release all
our models to the community.
Similar Work