Abstract:
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years. However, the generated voice is often not perceptually identifiable by its intended emotion category. To address this problem, we propose a new interactive training paradigm for ETTS, denoted as i-ETTS, which seeks to directly improve the emotion discriminability by interacting with a speech emotion recognition (SER) model. Moreover, we formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization. Experimental results demonstrate that the proposed i-ETTS outperforms the state-of-the-art baselines by rendering speech with more accurate emotion style. To our best knowledge, this is the first study of reinforcement learning in emotional text-to-speech synthesis.Fig. 1: Block diagram of the proposed interactive paradigm for emotional text-to-speech synthesis with reinforcement learning.
We implement two baseline systems together with the proposed i-ETTS, as summarized next.:
1. MTL-ETTS [1] : An emotional TTS model that jointly trains an auxiliary SER task with the TTS model;
2. CET-ETTS [2]: An emotional TTS model that uses two reference encoders with SER module and perceptual loss to enhance the emotion-discriminative ability.
3. i-ETTS : the proposed i-ETTS that optimizes the ETTS model with a reward function correlated with the SER accuracy.
Synthesized audios:
[Emotion] | MTL-ETTS | CET-ETTS | I-ETTS | |
---|---|---|---|---|
Neutral | 01 (Speaker: 0020) |
|||
02 (Speaker: 0013) |
||||
03 (Speaker: 0012) |
||||
Happy | 01 (Speaker: 0018) |
|||
02 (Speaker: 0016) |
||||
03 (Speaker: 0012) |
||||
Sad | 01 (Speaker: 0019) |
|||
02 (Speaker: 0018) |
||||
03 (Speaker: 0014) |
||||
Surprise | 01 (Speaker: 0018) |
|||
02 (Speaker: 0017) |
||||
03 (Speaker: 0014) |
||||
Angry | 01 (Speaker: 0020) |
|||
02 (Speaker: 0019) |
||||
03 (Speaker: 0017) |
Audio samples from training dataset:
[Speaker] | Neutral | Happy | Sad | Surprise | Angry |
---|---|---|---|---|---|
0011 [Male] | |||||
0012 [Male] | |||||
0013 [Male] | |||||
0014 [Male] | |||||
0015 [Female] | |||||
0016 [Female] | |||||
0017 [Female] | |||||
0018 [Female] | |||||
0019 [Female] | |||||
0020 [Male] |
References
[1] Cai, Xiong, et al. "Emotion controllable speech synthesis using emotion-unlabeled dataset with the assistance of cross-domain speech emotion recognition." To appear at ICASSP 2021.[2] Li, Tao, et al. "Controllable Emotion Transfer For End-to-End Speech Synthesis." 12th International Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE, pp: 1-5, 2021.