표 3. | Table 3. 원샷 다화자 음성합성 모델의 객관적 평가 결과 | Objective evaluation results of one-shot multi-speaker TTS

[TTS+화자 인코더] 모델 English Korean
Seen Unseen Seen Unseen
P-MOS SECS P-MOS SECS P-MOS SECS P-MOS SECS
Ground truth 3.53±0.06 0.98 3.80±0.02 0.98 4.09±0.04 0.99 3.85±0.03 0.99
FastSpeech2+Speaker ID 2.97±0.05 0.85 ND ND 3.45±0.05 0.93 ND ND
FastSpeech2+GE2ESV 2.94±0.05 0.88 2.44±0.02 0.89 3.57±0.05 0.94 3.43±0.03 0.85
FastSpeech2+ResNet34SE 2.75±0.05 0.78 2.29±0.02 0.80 3.52±0.05 0.93 3.35±0.02 0.83
(Proposed)FastSeech2+RawNet3 3.00±0.05 0.90 2.54±0.03 0.89 3.58±0.05 0.94 3.74±0.02 0.87
TTS, text-to-speech, P-MOS, prediction MOS; SECS, speaker embedding cosine similarity; GE2E, generalized end-to-end; ND, not detected.