표 3. | Table 3. 원샷 다화자 음성합성 모델의 객관적 평가 결과 | Objective evaluation results of one-shot multi-speaker TTS

[TTS+화자 인코더] 모델	English				Korean
	Seen		Unseen		Seen		Unseen
	P-MOS	SECS	P-MOS	SECS	P-MOS	SECS	P-MOS	SECS
Ground truth	3.53±0.06	0.98	3.80±0.02	0.98	4.09±0.04	0.99	3.85±0.03	0.99
FastSpeech2+Speaker ID	2.97±0.05	0.85	ND	ND	3.45±0.05	0.93	ND	ND
FastSpeech2+GE2ESV	2.94±0.05	0.88	2.44±0.02	0.89	3.57±0.05	0.94	3.43±0.03	0.85
FastSpeech2+ResNet34SE	2.75±0.05	0.78	2.29±0.02	0.80	3.52±0.05	0.93	3.35±0.02	0.83
(Proposed)FastSeech2+RawNet3	3.00±0.05	0.90	2.54±0.03	0.89	3.58±0.05	0.94	3.74±0.02	0.87

TTS, text-to-speech, P-MOS, prediction MOS; SECS, speaker embedding cosine similarity; GE2E, generalized end-to-end; ND, not detected.