TY - JOUR
T1 - OSATG-GPT
T2 - Instruction-tuning large language models with open-source atomic tasks in Github
AU - Han, Fanyu
AU - Ma, Li
AU - Bi, Fenglin
AU - Wang, Yantong
AU - You, Mingdong
AU - Wang, Wei
AU - Peng, Jiaheng
AU - Xia, Xiaoya
N1 - Publisher Copyright:
© 2025 Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
PY - 2025
Y1 - 2025
N2 - Across numerous application scenarios in Natural Language Processing (NLP), Large Language Models (LLMs) have demonstrated exceptional capabilities in text comprehension and generation. These models exhibit significant potential across various interdisciplinary fields. However, their effectiveness is somewhat constrained by the unique characteristics of the open-source ecosystem. Developing an LLM with generalization capabilities across datasets and tasks, specifically tailored for the open-source ecosystem, is an urgent research need. To address this challenge, this paper introduces open-source atomic tasks, which are defined as intermediate tasks essential for solving complex objectives. These tasks are designed through strategies such as simplification, reversal, decomposition, and composition, enabling models to gradually acquire domain knowledge and understand task interdependencies. By integrating public resources with open-source atomic tasks, we construct OSE-Instruct–an instruction dataset for the open-source ecosystem. We first unify open-source atomic tasks within an instruction-tuning paradigm that reflects real-world developer behavior, and develop OSATG-GPT at various parameter scales by fine-tuning the BLOOMZ backbone model on OSE-Instruct. This enables the model to learn fine-grained developer actions and the underlying task dependencies. Extensive experiments validate the effectiveness of OSATG-GPT compared to other advanced LLMs with larger parameter scales, and highlight its advantages over GPT-4 in specific and complex open-source collaboration tasks.
AB - Across numerous application scenarios in Natural Language Processing (NLP), Large Language Models (LLMs) have demonstrated exceptional capabilities in text comprehension and generation. These models exhibit significant potential across various interdisciplinary fields. However, their effectiveness is somewhat constrained by the unique characteristics of the open-source ecosystem. Developing an LLM with generalization capabilities across datasets and tasks, specifically tailored for the open-source ecosystem, is an urgent research need. To address this challenge, this paper introduces open-source atomic tasks, which are defined as intermediate tasks essential for solving complex objectives. These tasks are designed through strategies such as simplification, reversal, decomposition, and composition, enabling models to gradually acquire domain knowledge and understand task interdependencies. By integrating public resources with open-source atomic tasks, we construct OSE-Instruct–an instruction dataset for the open-source ecosystem. We first unify open-source atomic tasks within an instruction-tuning paradigm that reflects real-world developer behavior, and develop OSATG-GPT at various parameter scales by fine-tuning the BLOOMZ backbone model on OSE-Instruct. This enables the model to learn fine-grained developer actions and the underlying task dependencies. Extensive experiments validate the effectiveness of OSATG-GPT compared to other advanced LLMs with larger parameter scales, and highlight its advantages over GPT-4 in specific and complex open-source collaboration tasks.
KW - Atomic tasks
KW - BLOOMZ
KW - Github
KW - Instruction turning
KW - Large language models
UR - https://www.scopus.com/pages/publications/105020593763
U2 - 10.1016/j.eswa.2025.129819
DO - 10.1016/j.eswa.2025.129819
M3 - 文章
AN - SCOPUS:105020593763
SN - 0957-4174
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 129819
ER -