Efficient Routing in Sparse Mixture-of-Experts

Masoumeh Zareapoor, Pourya Shamsolmoali, Fateme Vesaghati

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Sparse Mixture-of-Experts (MoE) architectures provide the distinct benefit of substantially expanding the model's parameter space without proportionally increasing the computational load on individual input tokens or samples. However, the efficacy of these models heavily depends on the routing strategy used to assign tokens to experts. Poor routing can lead to under-trained or overly specialized experts, diminishing the overall model performance. Previous approaches have relied on the Topk router, where each token is assigned to a subset of experts. In this paper, we propose a routing mechanism that replaces the Topk router with regularized optimal transport, leveraging the Sinkhorn algorithm to optimize token-expert matching. We conducted a comprehensive evaluation comparing the pre-training efficiency of our model, using computational resources equivalent to those employed in the GShard and Switch Transformers gating mechanisms. The results demonstrate that our model expedites training convergence, achieving a speedup of over 2× compared to these baseline models. Moreover, under the same computational constraints, our model exhibits superior performance across eleven tasks from the GLUE and SuperGLUE benchmarks. We show that our model contributes to the optimization of token-expert matching in sparsely-activated MoE models, offering substantial gains in both training efficiency and task performance.

Original languageEnglish
Title of host publication2024 International Joint Conference on Neural Networks, IJCNN 2024 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350359312
DOIs
StatePublished - 2024
Externally publishedYes
Event2024 International Joint Conference on Neural Networks, IJCNN 2024 - Yokohama, Japan
Duration: 30 Jun 20245 Jul 2024

Publication series

NameProceedings of the International Joint Conference on Neural Networks

Conference

Conference2024 International Joint Conference on Neural Networks, IJCNN 2024
Country/TerritoryJapan
CityYokohama
Period30/06/245/07/24

Keywords

  • Deep neural networks
  • Mixture-of-Expert
  • Optimal transport
  • Sinkhorn algorithm

Fingerprint

Dive into the research topics of 'Efficient Routing in Sparse Mixture-of-Experts'. Together they form a unique fingerprint.

Cite this