TY - GEN
T1 - Efficient Routing in Sparse Mixture-of-Experts
AU - Zareapoor, Masoumeh
AU - Shamsolmoali, Pourya
AU - Vesaghati, Fateme
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Sparse Mixture-of-Experts (MoE) architectures provide the distinct benefit of substantially expanding the model's parameter space without proportionally increasing the computational load on individual input tokens or samples. However, the efficacy of these models heavily depends on the routing strategy used to assign tokens to experts. Poor routing can lead to under-trained or overly specialized experts, diminishing the overall model performance. Previous approaches have relied on the Topk router, where each token is assigned to a subset of experts. In this paper, we propose a routing mechanism that replaces the Topk router with regularized optimal transport, leveraging the Sinkhorn algorithm to optimize token-expert matching. We conducted a comprehensive evaluation comparing the pre-training efficiency of our model, using computational resources equivalent to those employed in the GShard and Switch Transformers gating mechanisms. The results demonstrate that our model expedites training convergence, achieving a speedup of over 2× compared to these baseline models. Moreover, under the same computational constraints, our model exhibits superior performance across eleven tasks from the GLUE and SuperGLUE benchmarks. We show that our model contributes to the optimization of token-expert matching in sparsely-activated MoE models, offering substantial gains in both training efficiency and task performance.
AB - Sparse Mixture-of-Experts (MoE) architectures provide the distinct benefit of substantially expanding the model's parameter space without proportionally increasing the computational load on individual input tokens or samples. However, the efficacy of these models heavily depends on the routing strategy used to assign tokens to experts. Poor routing can lead to under-trained or overly specialized experts, diminishing the overall model performance. Previous approaches have relied on the Topk router, where each token is assigned to a subset of experts. In this paper, we propose a routing mechanism that replaces the Topk router with regularized optimal transport, leveraging the Sinkhorn algorithm to optimize token-expert matching. We conducted a comprehensive evaluation comparing the pre-training efficiency of our model, using computational resources equivalent to those employed in the GShard and Switch Transformers gating mechanisms. The results demonstrate that our model expedites training convergence, achieving a speedup of over 2× compared to these baseline models. Moreover, under the same computational constraints, our model exhibits superior performance across eleven tasks from the GLUE and SuperGLUE benchmarks. We show that our model contributes to the optimization of token-expert matching in sparsely-activated MoE models, offering substantial gains in both training efficiency and task performance.
KW - Deep neural networks
KW - Mixture-of-Expert
KW - Optimal transport
KW - Sinkhorn algorithm
UR - https://www.scopus.com/pages/publications/85204956294
U2 - 10.1109/IJCNN60899.2024.10650737
DO - 10.1109/IJCNN60899.2024.10650737
M3 - 会议稿件
AN - SCOPUS:85204956294
T3 - Proceedings of the International Joint Conference on Neural Networks
BT - 2024 International Joint Conference on Neural Networks, IJCNN 2024 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 International Joint Conference on Neural Networks, IJCNN 2024
Y2 - 30 June 2024 through 5 July 2024
ER -