표 2. | Table 2. 한국어 원샷 다화자 음성합성 모델의 주관적 평가 결과 | Subjective evaluation results of Korean one-shot multi-speaker TTS

[TTS+화자 인코더] 모델 Seen Unseen
NMOS SMOS NMOS SMOS
Ground truth 4.34±0.06 4.72±0.02 4.32±0.04 4.72±0.02
FastSpeech2+Speaker ID 3.33±0.08 3.58±0.07 ND ND
FastSpeech2+GE2ESV 3.48±0.06 3.67±0.07 3.12±0.05 2.83±0.07
FastSpeech2+ResNet34SE 3.28±0.08 3.43±0.13 3.23±0.05 3.03±0.07
(Proposed)FastSeech2+RawNet3 3.66±0.06 3.97±0.04 3.36±0.04 3.16±0.04
TTS, text-to-speech; NMOS, naturalness mean opinion score; SMOS, similarity MOS; GE2E, generalized end-to-end; ND, not detected.