CAMB.AI introduces MARS5, a fully open-source (commercially usable) TTS with break-through prosody and realism available on our Github: https://www.github.com/camb-ai/mars5-tts
Watch our full release video here:
https://www.youtube.com/watch?v=bmJSLPYrKtE
Why is it different?
MARS5 is able to replicate performances (from 2-3s of audio reference) in 140+ languages, even for extremely tough prosodic scenarios like sports commentary, movies, anime and more; hard prosody that most closed-source and open-source TTS models struggle with today.
We're excited for you to try, build on and use MARS5 for research and creative applications. Let us know any feedback on our Discord!
Top comments (3)
Highlights:
Training data: Trained on over 150K+ hours of data.
Params: 1.2 Bn (750/450)
Multilingual: Open-sourcing in English to begin with, but can access it in 140+ languages on camb.ai
Diversity in prosody: can handle very hard prosodic elements like commentary, shouting, anime etc.
The model follows a two-stage setup, operating on 6kbps encodec tokens. Concretely, it consists of a ~750M parameter autoregressive part (which we call the AR model) and a ~450M parameter non-autoregressive multinomial diffusion part (which we call the NAR model). The AR model iteratively predicts the most coarse (lowest level) codebook value for the encodec features, while the NAR model takes the AR output and infers the remaining codebook values in a discrete denoising diffusion task. Specifically, the NAR model is trained as a DDPM using a multinomial distribution on encodec features, effectively ‘inpainting’ the remaining codebook entries after the AR model has predicted the coarse codebook values.
The model was trained on a combination of publicly available datasets, as well as internally provided by our customers which include large sports leagues, and international creatives.
The model follows a two-stage setup, operating on 6kbps encodec tokens. Concretely, it consists of a ~750M parameter autoregressive part (which we call the AR model) and a ~450M parameter non-autoregressive multinomial diffusion part (which we call the NAR model). The AR model iteratively predicts the most coarse (lowest level) codebook value for the encodec features, while the NAR model takes the AR output and infers the remaining codebook values in a discrete denoising diffusion task. Specifically, the NAR model is trained as a DDPM using a multinomial distribution on encodec features, effectively ‘inpainting’ the remaining codebook entries after the AR model has predicted the coarse codebook values.
The model was trained on a combination of publicly available datasets, as well as internally provided by our customers which include large sports leagues, and international creatives.
Links:
Discord: discord.gg/4GVdQ28cZC
Github: github.com/camb-ai/mars5-tts
Website: camb.ai
Youtube: youtube.com/@camb-ai