TY - GEN
T1 - LIFBENCH
T2 - 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
AU - Wu, Xiaodong
AU - Wang, Minhao
AU - Liu, Yichen
AU - Shi, Xiaoming
AU - Yan, He
AU - Lu, Xiangju
AU - Zhu, Junmin
AU - Zhang, Wei
N1 - Publisher Copyright:
© 2025 Association for Computational Linguistics.
PY - 2025
Y1 - 2025
N2 - As Large Language Models (LLMs) evolve in natural language processing (NLP), their ability to stably follow instructions in long-context inputs has become critical for real-world applications. However, existing benchmarks seldom focus on instruction-following in long-context scenarios or stability on different inputs. To bridge this gap, we introduce LIFBENCH, a scalable dataset designed to evaluate LLMs' instruction-following capabilities and stability across long contexts. LIFBENCH comprises three long-context scenarios and eleven diverse tasks, featuring 2,766 instructions generated through an automated expansion method across three dimensions: length, expression, and variables. For evaluation, we propose LIFEVAL, a rubric-based assessment method that enables precise, automated scoring of complex LLM responses without reliance on LLM-assisted assessments or human judgment. This method allows for a comprehensive analysis of model performance and stability from multiple perspectives. We conduct detailed experiments on 20 prominent LLMs across six length intervals. Our work contributes LIFBENCH and LIFEVAL as robust tools for assessing LLM performance in complex and long-context settings, offering valuable insights to guide future advancements in LLM development.
AB - As Large Language Models (LLMs) evolve in natural language processing (NLP), their ability to stably follow instructions in long-context inputs has become critical for real-world applications. However, existing benchmarks seldom focus on instruction-following in long-context scenarios or stability on different inputs. To bridge this gap, we introduce LIFBENCH, a scalable dataset designed to evaluate LLMs' instruction-following capabilities and stability across long contexts. LIFBENCH comprises three long-context scenarios and eleven diverse tasks, featuring 2,766 instructions generated through an automated expansion method across three dimensions: length, expression, and variables. For evaluation, we propose LIFEVAL, a rubric-based assessment method that enables precise, automated scoring of complex LLM responses without reliance on LLM-assisted assessments or human judgment. This method allows for a comprehensive analysis of model performance and stability from multiple perspectives. We conduct detailed experiments on 20 prominent LLMs across six length intervals. Our work contributes LIFBENCH and LIFEVAL as robust tools for assessing LLM performance in complex and long-context settings, offering valuable insights to guide future advancements in LLM development.
UR - https://www.scopus.com/pages/publications/105021049148
M3 - 会议稿件
AN - SCOPUS:105021049148
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 16445
EP - 16468
BT - Long Papers
A2 - Che, Wanxiang
A2 - Nabende, Joyce
A2 - Shutova, Ekaterina
A2 - Pilehvar, Mohammad Taher
PB - Association for Computational Linguistics (ACL)
Y2 - 27 July 2025 through 1 August 2025
ER -