표 1 / Table 1 End-to-end TTS 시스템 하이퍼파리미터 설정 / Detailed hyper-parameters of the end-to-end text-to-speech system

Spectral 분석 Pre-emphasis: 0.97, 프레임 길이: 100 ms,오버랩 길이: 25 ms, 윈도우 종류: Hann
사용한 문자 개수 80개
문자 임베딩 128차
인코더 CBHG Conv1D bank: K=5, conv-k-64-ReLUMax pooling: stride=1, width=2Conv1D projections: conv-3-128-ReLU → conv-3-128-linearHighway network: 2 layers of FC-128-ReLUBidirectional GRU: 128 cells
인코더 pre-net FC-128-ReLU → Dropout(0.5) → FC-128-ReLU → Dropout(0.5)
디코더 pre-net FC-128-ReLU → Dropout(0.5) → FC-128-ReLU → Dropout(0.5)
디코더 RNN 2-layer residual GRU(256 cells)
Attention RNN 1-layer GRU(256 cells)
Reduction factor (r) 4
후처리 highway network 2-layers of FC-256-ReLU
전처리에서 제거한 침묵 기준 6 dB 이하
합성음에서 제거한 침묵 기준 -40 dB 이하
CBHG, 1-D convolution bank + highway network + bidirectional gated recurrent unit; Conv1D, 1-D convolution; FC, fully-connected; conv-k-c-ReLU, 1-D convolution with width k and c output channels with ReLU activation; (길이 k의 필터와 c개의 출력 채널을 가지고, ReLU(rectified linear unit)를 비선형 함수로서 사용하는 1차 convolution); GRU, gated recurrent unit; RNN, recurrent neural network; TTS, text-to-speech