TY - JOUR
T1 - SparseLight
T2 - Dynamic gradient-optimized softmax for efficient transformer acceleration
AU - Zhang, Kai
AU - Lan, Chaoxiang
AU - Xu, Yazhang
AU - Li, Zheyang
AU - Tan, Wenming
AU - Ren, Ye
AU - Hu, Jilin
N1 - Publisher Copyright:
© 2025 Elsevier B.V.
PY - 2026/1/30
Y1 - 2026/1/30
N2 - Benefiting from the self-attention mechanism's capability to capture long-range dependencies, Transformer has achieved outstanding performance across various tasks such as NLP, computer vision, and speech tasks. Moreover, transformer-based Large Language Models (LLMs) have driven significant advancements in artificial intelligence. Within the self-attention mechanism, the softmax function plays a critical role in capturing the association of different tokens. However, hardware implementation of softmax is computationally expensive due to its exponential and division operations, especially for edge devices. Specifically, for long sequence inputs (e.g., over 8192 tokens), softmax accounts for over 23.5 % of total computational time. In this paper, we propose SparseLight, a dynamic gradient-optimized sparse softmax method, which can significantly accelerate the inference of Transformers. To mitigate performance degradation caused by direct sparse softmax application, we formulate softmax sparsity as a mathematical optimization problem and solve it via gradient descent algorithm. Regrettably, Transformer often exhibits significant distributional differences between channels and tokens, leading to gradient optimization focusing excessively on outliers and causing performance drops. To tackle this issue, a balanced strategy is introduced to diminish the effect of outliers across channels and tokens. Theoretical analysis based on the condition number of the Fisher information matrix demonstrates the effectiveness of our approach. Extensive experiments on vision and language tasks show that SparseLight serves as an efficient drop-in replacement for standard softmax. Evaluations on GPUs demonstrate that SparseLight achieves a 18 % speedup on LLaMA2-7B, highlighting its potential for real-world deployment. The code will be publicly available soon.
AB - Benefiting from the self-attention mechanism's capability to capture long-range dependencies, Transformer has achieved outstanding performance across various tasks such as NLP, computer vision, and speech tasks. Moreover, transformer-based Large Language Models (LLMs) have driven significant advancements in artificial intelligence. Within the self-attention mechanism, the softmax function plays a critical role in capturing the association of different tokens. However, hardware implementation of softmax is computationally expensive due to its exponential and division operations, especially for edge devices. Specifically, for long sequence inputs (e.g., over 8192 tokens), softmax accounts for over 23.5 % of total computational time. In this paper, we propose SparseLight, a dynamic gradient-optimized sparse softmax method, which can significantly accelerate the inference of Transformers. To mitigate performance degradation caused by direct sparse softmax application, we formulate softmax sparsity as a mathematical optimization problem and solve it via gradient descent algorithm. Regrettably, Transformer often exhibits significant distributional differences between channels and tokens, leading to gradient optimization focusing excessively on outliers and causing performance drops. To tackle this issue, a balanced strategy is introduced to diminish the effect of outliers across channels and tokens. Theoretical analysis based on the condition number of the Fisher information matrix demonstrates the effectiveness of our approach. Extensive experiments on vision and language tasks show that SparseLight serves as an efficient drop-in replacement for standard softmax. Evaluations on GPUs demonstrate that SparseLight achieves a 18 % speedup on LLaMA2-7B, highlighting its potential for real-world deployment. The code will be publicly available soon.
KW - Acceleration
KW - Hardware-efficient
KW - Neural networks
KW - Softmax
KW - Sparse
KW - Transformers
UR - https://www.scopus.com/pages/publications/105024565961
U2 - 10.1016/j.knosys.2025.115050
DO - 10.1016/j.knosys.2025.115050
M3 - 文章
AN - SCOPUS:105024565961
SN - 0950-7051
VL - 333
JO - Knowledge-Based Systems
JF - Knowledge-Based Systems
M1 - 115050
ER -