SparseLight: Dynamic gradient-optimized softmax for efficient transformer acceleration

  • Kai Zhang*
  • , Chaoxiang Lan
  • , Yazhang Xu
  • , Zheyang Li
  • , Wenming Tan
  • , Ye Ren
  • , Jilin Hu
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Benefiting from the self-attention mechanism's capability to capture long-range dependencies, Transformer has achieved outstanding performance across various tasks such as NLP, computer vision, and speech tasks. Moreover, transformer-based Large Language Models (LLMs) have driven significant advancements in artificial intelligence. Within the self-attention mechanism, the softmax function plays a critical role in capturing the association of different tokens. However, hardware implementation of softmax is computationally expensive due to its exponential and division operations, especially for edge devices. Specifically, for long sequence inputs (e.g., over 8192 tokens), softmax accounts for over 23.5 % of total computational time. In this paper, we propose SparseLight, a dynamic gradient-optimized sparse softmax method, which can significantly accelerate the inference of Transformers. To mitigate performance degradation caused by direct sparse softmax application, we formulate softmax sparsity as a mathematical optimization problem and solve it via gradient descent algorithm. Regrettably, Transformer often exhibits significant distributional differences between channels and tokens, leading to gradient optimization focusing excessively on outliers and causing performance drops. To tackle this issue, a balanced strategy is introduced to diminish the effect of outliers across channels and tokens. Theoretical analysis based on the condition number of the Fisher information matrix demonstrates the effectiveness of our approach. Extensive experiments on vision and language tasks show that SparseLight serves as an efficient drop-in replacement for standard softmax. Evaluations on GPUs demonstrate that SparseLight achieves a 18 % speedup on LLaMA2-7B, highlighting its potential for real-world deployment. The code will be publicly available soon.

Original languageEnglish
Article number115050
JournalKnowledge-Based Systems
Volume333
DOIs
StatePublished - 30 Jan 2026

Keywords

  • Acceleration
  • Hardware-efficient
  • Neural networks
  • Softmax
  • Sparse
  • Transformers

Fingerprint

Dive into the research topics of 'SparseLight: Dynamic gradient-optimized softmax for efficient transformer acceleration'. Together they form a unique fingerprint.

Cite this