OSATG-GPT: Instruction-tuning large language models with open-source atomic tasks in Github

Fanyu Han, Li Ma, Fenglin Bi, Yantong Wang, Mingdong You, Wei Wang*, Jiaheng Peng, Xiaoya Xia

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Across numerous application scenarios in Natural Language Processing (NLP), Large Language Models (LLMs) have demonstrated exceptional capabilities in text comprehension and generation. These models exhibit significant potential across various interdisciplinary fields. However, their effectiveness is somewhat constrained by the unique characteristics of the open-source ecosystem. Developing an LLM with generalization capabilities across datasets and tasks, specifically tailored for the open-source ecosystem, is an urgent research need. To address this challenge, this paper introduces open-source atomic tasks, which are defined as intermediate tasks essential for solving complex objectives. These tasks are designed through strategies such as simplification, reversal, decomposition, and composition, enabling models to gradually acquire domain knowledge and understand task interdependencies. By integrating public resources with open-source atomic tasks, we construct OSE-Instruct–an instruction dataset for the open-source ecosystem. We first unify open-source atomic tasks within an instruction-tuning paradigm that reflects real-world developer behavior, and develop OSATG-GPT at various parameter scales by fine-tuning the BLOOMZ backbone model on OSE-Instruct. This enables the model to learn fine-grained developer actions and the underlying task dependencies. Extensive experiments validate the effectiveness of OSATG-GPT compared to other advanced LLMs with larger parameter scales, and highlight its advantages over GPT-4 in specific and complex open-source collaboration tasks.

Original languageEnglish
Article number129819
JournalExpert Systems with Applications
DOIs
StateAccepted/In press - 2025

Keywords

  • Atomic tasks
  • BLOOMZ
  • Github
  • Instruction turning
  • Large language models

Fingerprint

Dive into the research topics of 'OSATG-GPT: Instruction-tuning large language models with open-source atomic tasks in Github'. Together they form a unique fingerprint.

Cite this