This paper introduces VALL-E 2, the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time.
Some results have been hidden because they may be inaccessible to you
Show inaccessible results