Hyperparameters | Values | ||||
---|---|---|---|---|---|
output size | |||||
input layer | |||||
normalized before | false | ||||
attention heads | 1 | 4 | 8 | ||
linear units | 512 | 1,024 | 4,096 | ||
num blocks | 2 | 4 | 6 | 8 | |
dropout rate | 0.0 | 0.2 | 0.3 | 0.4 | |
positional dropout rate | |||||
attention dropout rate | 0.1 | 0.2 | 0.3 | 0.4 |