표 2. | Table 2. 한국어 원샷 다화자 음성합성 모델의 주관적 평가 결과 | Subjective evaluation results of Korean one-shot multi-speaker TTS

[TTS+화자 인코더] 모델	Seen		Unseen
[TTS+화자 인코더] 모델	NMOS	SMOS	NMOS	SMOS
Ground truth	4.34±0.06	4.72±0.02	4.32±0.04	4.72±0.02
FastSpeech2+Speaker ID	3.33±0.08	3.58±0.07	ND	ND
FastSpeech2+GE2ESV	3.48±0.06	3.67±0.07	3.12±0.05	2.83±0.07
FastSpeech2+ResNet34SE	3.28±0.08	3.43±0.13	3.23±0.05	3.03±0.07
(Proposed)FastSeech2+RawNet3	3.66±0.06	3.97±0.04	3.36±0.04	3.16±0.04

TTS, text-to-speech; NMOS, naturalness mean opinion score; SMOS, similarity MOS; GE2E, generalized end-to-end; ND, not detected.