TY - GEN
T1 - AutoTrans
T2 - 10th CCF Conference on Natural Language Processing and Chinese Computing, NLPCC 2021
AU - Zhu, Wei
AU - Wang, Xiaoling
AU - Ni, Yuan
AU - Xie, Guotong
N1 - Publisher Copyright:
© 2021, Springer Nature Switzerland AG.
PY - 2021
Y1 - 2021
N2 - Though the transformer architectures have shown dominance in many natural language understanding tasks, there are still unsolved issues for the training of transformer models, especially the need for a principled way of warm-up which has shown importance for stable training of a transformer, as well as whether the task at hand prefer to scale the attention product or not. In this paper, we empirically explore automating the design choices in the transformer model, i.e., how to set layer-norm, whether to scale, number of layers, number of heads, activation function, etc., so that one can obtain a transformer architecture that better suits the tasks at hand. RL is employed to navigate along search space, and special parameter sharing strategies are designed to accelerate the search. It is shown that sampling a proportion of training data per epoch during search help to improve the search quality. Experiments on the CoNLL03, Multi-30k and WMT-14 shows that the searched transformer model can outperform the standard transformers. In particular, we show that our learned model can be trained more robustly with large learning rates without warm-up.
AB - Though the transformer architectures have shown dominance in many natural language understanding tasks, there are still unsolved issues for the training of transformer models, especially the need for a principled way of warm-up which has shown importance for stable training of a transformer, as well as whether the task at hand prefer to scale the attention product or not. In this paper, we empirically explore automating the design choices in the transformer model, i.e., how to set layer-norm, whether to scale, number of layers, number of heads, activation function, etc., so that one can obtain a transformer architecture that better suits the tasks at hand. RL is employed to navigate along search space, and special parameter sharing strategies are designed to accelerate the search. It is shown that sampling a proportion of training data per epoch during search help to improve the search quality. Experiments on the CoNLL03, Multi-30k and WMT-14 shows that the searched transformer model can outperform the standard transformers. In particular, we show that our learned model can be trained more robustly with large learning rates without warm-up.
KW - Neural architecture search
KW - Reinforcement learning
KW - Transformer network
UR - https://www.scopus.com/pages/publications/85118123443
U2 - 10.1007/978-3-030-88480-2_14
DO - 10.1007/978-3-030-88480-2_14
M3 - 会议稿件
AN - SCOPUS:85118123443
SN - 9783030884796
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 169
EP - 182
BT - Natural Language Processing and Chinese Computing - 10th CCF International Conference, NLPCC 2021, Proceedings
A2 - Wang, Lu
A2 - Feng, Yansong
A2 - Hong, Yu
A2 - He, Ruifang
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 13 October 2021 through 17 October 2021
ER -