BASE TTS: Lessons From Building A Billion-parameter Text-to-speech Model On 100K Hours Of Data · The Large Language Model Bible Contribute to LLM-Bible

BASE TTS: Lessons From Building A Billion-parameter Text-to-speech Model On 100K Hours Of Data

Łajszczak Mateusz, Cámbara Guillermo, Li Yang, Beyhan Fatih, Van Korlaar Arent, Yang Fan, Joly Arnaud, Martín-cortinas Álvaro, Abbas Ammar, Michalski Adam, Moinet Alexis, Karlapati Sri, Muszyńska Ewa, Guo Haohan, Putrycz Bartosz, Gambino Soledad López, Yoo Kayeon, Sokolova Elena, Drugman Thomas. Arxiv 2024

[Paper] [Code]    
GPT Has Code Model Architecture Pretraining Methods Tokenization Transformer

We introduce a text-to-speech (TTS) model called BASE TTS, which stands for \(\textbf{B}\)ig \(\textbf{A}\)daptive \(\textbf{S}\)treamable TTS with \(\textbf{E}\)mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes (“speechcodes”) followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported “emergent abilities” of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.

Similar Work