TY - GEN
T1 - Length Generalization of Causal Transformers without Position Encoding
AU - Wang, Jie
AU - Ji, Tao
AU - Wu, Yuanbin
AU - Yan, Hang
AU - Gui, Tao
AU - Zhang, Qi
AU - Huang, Xuanjing
AU - Wang, Xiaoling
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - Generalizing to longer sentences is important for recent Transformer-based language models. Besides algorithms manipulating explicit position features, the success of Transformers without position encodings (NoPE) provides a new way to overcome the challenge. In this paper, we study the length generalization property of NoPE. We find that although NoPE can extend to longer sequences than the commonly used explicit position encodings, it still has a limited context length. We identify a connection between the failure of NoPE's generalization and the distraction of attention distributions. We propose a parameter-efficient tuning for searching attention heads' best temperature hyper-parameters, which substantially expands NoPE's context size. Experiments on long sequence language modeling, the synthetic passkey retrieval task and real-world long context tasks show that NoPE can achieve competitive performances with state-of-the-art length generalization algorithms. The source code is publicly accessible.
AB - Generalizing to longer sentences is important for recent Transformer-based language models. Besides algorithms manipulating explicit position features, the success of Transformers without position encodings (NoPE) provides a new way to overcome the challenge. In this paper, we study the length generalization property of NoPE. We find that although NoPE can extend to longer sequences than the commonly used explicit position encodings, it still has a limited context length. We identify a connection between the failure of NoPE's generalization and the distraction of attention distributions. We propose a parameter-efficient tuning for searching attention heads' best temperature hyper-parameters, which substantially expands NoPE's context size. Experiments on long sequence language modeling, the synthetic passkey retrieval task and real-world long context tasks show that NoPE can achieve competitive performances with state-of-the-art length generalization algorithms. The source code is publicly accessible.
UR - https://www.scopus.com/pages/publications/85205283608
U2 - 10.18653/v1/2024.findings-acl.834
DO - 10.18653/v1/2024.findings-acl.834
M3 - 会议稿件
AN - SCOPUS:85205283608
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 14024
EP - 14040
BT - The 62nd Annual Meeting of the Association for Computational Linguistics
A2 - Ku, Lun-Wei
A2 - Martins, Andre
A2 - Srikumar, Vivek
PB - Association for Computational Linguistics (ACL)
T2 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024
Y2 - 11 August 2024 through 16 August 2024
ER -