Hyperparameters | Values / WER | |||||
---|---|---|---|---|---|---|
MODEL | init | chainer | xavier uniform | xavier normal | kaiming uniform | kaiming normal |
42.0/35.1 | 17.3/14.0 | 17.7/13.1 | 17.6/13.4 | |||
warmup steps | 10,000 | 20,000 | 30,000 | 40,000 | ||
16.0/12.8 | 17.3/12.7 | 17.3/13.6 | ||||
keep nbest model | 5 | 10 | 15 | 20 | ||
17.3/12.7 | 16.9/13.0 | 16.9/13.4 | ||||
ctc weight | 0.1 | 0.2 | 0.3 | 0.4 | ||
17.3/13.6 | 17.0/13.1 | 16.5/13.0 | ||||
lsm weight | 0.1 | 0.2 | 0.3 | 0.4 | ||
17.3/13.2 | 17.5/12.9 | 18.0/13.3 | ||||
length normalized loss | true | false | ||||
17.7/14.0 | ||||||
ENCODER | attention heads | 1 | 2 | 4 | 8 | |
17.6/13.3 | 17.3/12.7 | 17.6/13.4 | ||||
linear units | 512 | 1,024 | 2,048 | 4,096 | ||
18.9/14.8 | 18.0/14.0 | 17.3/12.7 | ||||
num blocks | 2 | 4 | 6 | 8 | 12 | |
24.5/19.7 | 20.2/16.0 | 18.5/15.0 | 17.3/13.5 | |||
dropout rate | 0.0 | 0.1 | 0.2 | 0.3 | 0.4 | |
17.4/14.4 | 17.7/13.4 | 17.8/13.7 | 20.4/15.7 | |||
attention dropout rate | 0.0 | 0.1 | 0.2 | 0.3 | 0.4 | |
17.3/12.7 | 16.5/13.0 | 15.6/12.8 | 15.9/12.7 | |||
normalized before | true | false | ||||
14.1 | ||||||
DECODER | attention heads | 1 | 2 | 4 | 8 | |
17.3/13.0 | 17.1/12.9 | 17.6/12.8 | ||||
linear units | 512 | 1,024 | 2,048 | 4,096 | ||
17.7/13.5 | 17.5/13.4 | 17.2/12.7 | ||||
num blocks | 2 | 4 | 6 | 8 | 12 | |
19.7/16.0 | 17.2/13.6 | 17.3/12.7 | 16.3/12.5 | |||
dropout rate | 0.0 | 0.1 | 0.2 | 0.3 | 0.4 | |
16.5/13.3 | 16.9/13.8 | 16.3/13.6 | 17.5/13.5 | |||
self attention dropout rate | 0.0 | 0.1 | 0.2 | 0.3 | 0.4 | |
16.5/13.8 | 17.0/13.9 | 16.6/13.8 | 16.5/13.5 |