Abstract:
Neural end-to-end text-to-speech (TTS) is superior to conventional statistical methods in many ways. However, the exposure bias problem, that arises from the mismatch between training and inference process in autoregressive models, remains an issue. It often leads to performance degradation, especially for out-of-domain test data. To address this problem, we study a novel decoding knowledge transfer strategy, and propose a multi-teacher knowledge distillation (MT-KD) network for Tacotron-based TTS model. The idea is to pre-train two Tacotron2-based TTS teacher models in teacher forcing and scheduled sampling modes, and transfer the pre-trained knowledge to a student model that performs free running decoding. In this way, the student model learns to emulate the behaviors of the two teachers, at the same time, minimizing the mismatch between training and run-time inference. Experiments on both Chinese and English data show that MT-KD system consistently outperforms the competitive baselines in terms of naturalness, robustness and expressiveness for in-domain and out-of-domain test data. Furthermore, we show that knowledge distillation outperforms adversarial learning and data augmentation in addressing the exposure bias problem.




Fig. 1: An illustration of our multi-teacher knowledge distillation (MT-KD) scheme for end-to-end TTS. The blue box means its parameters are initialized by random values, while the green box means its parameters are initialized by the pretrained parameters from Phase I.




The in-domian and out-of-domain set can be found at cn, en.


Main Results



English Speech Samples:


(GL: Griffin-Lim algorithm [1]; PW: Parallel-WaveGAN vocoder [2])

TF [3] SS [4] TF-GAN [5] SS-GAN [5] MT-KD TF-KD [6] SS-KD Ground Truth
Naturalness Evaluation
(1) Input Text: A very few years saw the birth of Roman character not only in Italy, but in Germany and France.
GL
PW
(2) Input Text: In fourteen sixty-five Sweynheim and Pannartz began printing in the monastery of Subiaco near Rome.
GL
PW
(3) Input Text: But which must certainly have come from the study of the twelfth or even the eleventh century MSS.
GL
PW
(4) Input Text: They printed very few books in this type, three only; but in their very first books in Rome, beginning with the year fourteen sixty-eight.
GL
PW
(5) Input Text: They discarded this for a more completely Roman and far less beautiful letter.
GL
PW
Robustness Evaluation
(1) Input Text: S D S D Pass zero - zero Fail - zero to zero - zero - zero Cancelled - fifty nine to three - two sixty four Total - fifty nine to three - two
GL ____
PW ____
(2) Mine is here backslash backslash g a b e h a l l hyphen m o t h r a backslash S v r underscore O f f i c e s v r
GL ____
PW ____
(3) Input Text: S D S D Pass zero - zero Fail - zero to zero - zero - zero Cancelled - fifty nine to three - two sixty four Total - fifty nine to three - twoh t t p colon slash slash teams slash sites slash T A G slash default dot aspx As always , any feedback , comments ,
GL ____
PW ____
(4) Input Text: You can call me directly at four two five seven zero three seven three four four or my cell four two five four four four seven four seven four or send me a meeting request with all the appropriate information .
GL ____
PW ____
(5) Input Text: Failed zero point zero zero percent < one zero zero one zero zero zero zero Internal . Exchange . ContentFilter . BVT ContentFilter . BVT_log . xml Error ! Filename not specified .
GL ____
PW ____
Expressiveness Evaluation
(1) Input Text: For short, | do not remember every detail, | every mood. | How can I believe that there is still pure love in this world?
GL ____
PW ____
(2) Input Text: Honestly, | if you are not willing to sound stupid, | you don't deserve to be in love.
GL ____
PW ____
(3) Input Text: Be nice to people on the way up, | because you'll need them | on your way down.
GL ____
PW ____
(4) Input Text: Good things | come to those who smile. | Have you smile today? | Keep smiling.
GL ____
PW ____
(5) Input Text: A person who truly loves you | will never let you go, | no matter how hard | the situation is.
GL ____
PW ____


Chinese Speech Samples:


(GL: Griffin-Lim algorithm [1]; PW: Parallel-WaveGAN vocoder [2])

TF [3] SS [4] TF-GAN [5] SS-GAN [5] MT-KD TF-KD [6] SS-KD Ground Truth
Naturalness Evaluation
(1) Input Text: 赵凌飞的话,反映了沈阳赛区所有奥运志愿者的共同心声。
GL
PW
(2) Input Text: 因为,我们所发出的力量必会因难度加大而减弱。
GL
PW
(3) Input Text: 我希望每个人都能够尊重我们的隐私。
GL
PW
(4) Input Text: 漫天的红霞使劲给两人增添气氛。
GL
PW
(5) Input Text: 晚上加完班开车回家,太累了,迷迷糊糊开着车,走一半的时候,"铛"一声!
GL
PW
Robustness Evaluation
(1) Input Text: 第一, 家庭整体实力不错,至少是收入稳定的中产阶级吧。
GL ____
PW ____
(2) Input Text: 第二, 父母非常负责, 常年支持孩子的体育项目, 这背后, 无论是钱, 精力, 时间的投入, 都是海量的。
GL ____
PW ____
(3) Input Text: 第三, 这个孩子能够坚持下来, 而且成绩不错, 说明这个学生既有恒心, 又有方法, 还获得过真实的成功。
GL ____
PW ____
(4) Input Text: 这说明, 你在内心深处充分相信资源是充足的, 是对未来充满信心的, 不只是思想上相信, 而且是本能上彻底相信。
GL ____
PW ____
(5) Input Text: 所以你看, 把一件事做成会有两个时间点, 第一是把事情做成了的那一天, 第二是在心理上真的认为这事做成了的那一天。
GL ____
PW ____
Expressiveness Evaluation
(1) Input Text: 神龟虽寿,| 犹有竟时。
GL ____
PW ____
(2) Input Text: 枯藤老树昏鸦, | 小桥流水人家, | 古道西风瘦马。
GL ____
PW ____
(3) Input Text: 夕阳西下, | 断肠人在天涯。
GL ____
PW ____
(4) Input Text: 在一个微信群看到一句话说, | 一个人成熟的重要标志, | 是在吃自助餐的时候 | 能不吃撑。
GL ____
PW ____
(5) Input Text: 我倒觉得未必是有意为之了, | 不过这件事倒确实是在提醒我们, | 定长期的事情, | 一定要考虑到长期的变量, | 没准会有奇妙的效果。
GL ____
PW ____


Section V: Evaluation on GST-Tacotron

English Speech Samples:


GST-TF GST-SS GST-MT-KD GST-TF-KD GST-SS-KD Ground Truth
Naturalness Evaluation
(1) Input Text: A very few years saw the birth of Roman character not only in Italy, but in Germany and France.
GL
PW
(2) Input Text: In fourteen sixty-five Sweynheim and Pannartz began printing in the monastery of Subiaco near Rome.
GL
PW
(3) Input Text: But which must certainly have come from the study of the twelfth or even the eleventh century MSS.
GL
PW
(4) Input Text: They printed very few books in this type, three only; but in their very first books in Rome, beginning with the year fourteen sixty-eight.
GL
PW
(5) Input Text: They discarded this for a more completely Roman and far less beautiful letter.
GL
PW
Robustness Evaluation
(1) Input Text: that the forms of printed letters should follow more or less closely those of the written character, and they followed them very closely.
GL
PW
(2) Input Text: The first books were printed in black letter, i.e. the letter which was a Gothic development of the ancient Roman character
GL
PW
(3) Input Text: and which developed more completely and satisfactorily on the side of the "lower-case" than the capital letters;
GL
PW
(4) Input Text: and was in fact the kind of letter used in the many splendid missals, psalters, etc., produced by printing in the fifteenth century.
GL
PW
(5) Input Text: But the first Bible actually dated (which also was printed at Maintz by Peter Schoeffer in the year 1462)
GL
PW
Expressiveness Evaluation
(1) Input Text: It must be said that it is in no way like the transition type of Subiaco,
GL
PW
(2) Input Text: A further development of the Roman letter took place at Venice.
GL
PW
(3) Input Text: John of Spires and his brother Vindelin, followed by Nicholas Jenson, began to print in that city.
GL
PW
(4) Input Text: their type is on the lines of the German and French rather than of the Roman printers.
GL
PW
(5) Input Text: After his death in the "fourteen eighties," or at least by 1490, printing in Venice had declined very much;
GL
PW


Chinese Speech Samples:


GST-TF GST-SS GST-MT-KD GST-TF-KD GST-SS-KD Ground Truth
Naturalness Evaluation
(1) Input Text: 赵凌飞的话,反映了沈阳赛区所有奥运志愿者的共同心声。
GL
PW
(2) Input Text: 因为,我们所发出的力量必会因难度加大而减弱。
GL
PW
(3) Input Text: 我希望每个人都能够尊重我们的隐私。
GL
PW
(4) Input Text: 漫天的红霞使劲给两人增添气氛。
GL
PW
(5) Input Text: 晚上加完班开车回家,太累了,迷迷糊糊开着车,走一半的时候,"铛"一声!
GL
PW
Robustness Evaluation
(1) Input Text: 遇到颠簸时,应听从乘务员的安全指令,回座位坐好。
GL
PW
(2) Input Text: 他在后面呆惯了,怕自己一插身后的人会不满,不敢排进去。
GL
PW
(3) Input Text: 傍晚七个小人回来了,白雪公主说:你们就是我命中的七个小矮人吧。
GL
PW
(4) Input Text: 每一个“火箭官”都顺利地升上去了,冉冉若星光。
GL
PW
(5) Input Text: 一种表示商品所有权的财物证券,也称商品证券,如提货单、交货单。
GL
PW
Expressiveness Evaluation
(1) Input Text: 每天把牢骚拿出来晒晒太阳,心情就不会缺钙。
GL
PW
(2) Input Text: 昨日,这名“伤者”与医生全部被警方依法刑事拘留。
GL
PW
(3) Input Text: 钱伟长想到上海来办学校是经过深思熟虑的。
GL
PW
(4) Input Text: 她见我一进门就骂,吃饭时也骂,骂得我抬不起头。
GL
PW
(5) Input Text: 李述德在离开之前,只说了一句 “柱驼杀父亲了”。
GL
PW

References
[1] Griffin, Daniel, and Jae Lim. Signal estimation from modified short-time Fourier transform. In IEEE Transactions on Acoustics, Speech, and Signal Processing, 1984, pp. 236-243.
[2] Yamamoto, Ryuichi, Eunwoo Song, and Jae-Min Kim. Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020, pp. 6199-6203.
[3] Shen, Jonathan, et al. Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In ICASSP 2018, pp. 4779-4783.
[4] Bengio S, Vinyals O, Jaitly N, Shazeer N. Scheduled sampling for sequence prediction with recurrent Neural networks. In NeurIPS 2015, pp. 1171-1179.
[5] Guo, Haohan, et al. A New GAN-based end-to-end TTS training algorithm. In Interspeech 2019, pp. 1288-1292.
[6] Liu, Rui, et al. Teacher-student training for robust tacotron-based TTS. In ICASSP 2020, pp. 6274-6278.