Bread: A Hybrid Approach for Instruction Data Mining Through Balanced Retrieval and Dynamic Data Sampling

  • Xinlin Zhuang
  • , Xin Mao
  • , Yuan Hao Jiang
  • , Hongyi Wu
  • , Shangqing Zhao
  • , Li Cai
  • , Shu Liu
  • , Yang Chen
  • , Yuxiang Song
  • , Chenghao Jia
  • , Yuhao Zhou
  • , Man Lan*
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Recent advancements in Instruction Tuning (IT) have shown promise for aligning Large Language Models (LLMs) with users’ intentions, yet its efficacy is often compromised by dependence on high-quality datasets. Previous works have concentrated on the aggregation or production of huge IT datasets through human labor or significant cost-intensive LLM APIs, which lacks adequate mechanisms to guarantee the quality of the resulting data. Moreover, training on such amount of IT data is both time-consuming and costly. To address these issues, we present Bread (Instruction Mining through Balanced REtrieval And Dynamic Data Sampling), a novel approach designed to minimize the requisite volume of IT data. Bread uses a two-stage strategy combining balanced retrieval and dynamic sampling to focus on data diversity and quality, offering a cost-saving solution without relying on any specific LLMs. Experimental results suggest that Bread outperforms baselines and shows great flexibility across various IT datasets and LLMs, thereby marking a step forward in efficient Instruction Tuning. Our code is available at https://github.com/mihara-bot/Bread.

Original languageEnglish
Title of host publicationNatural Language Processing and Chinese Computing - 13th National CCF Conference, NLPCC 2024, Proceedings
EditorsDerek F. Wong, Zhongyu Wei, Muyun Yang
PublisherSpringer Science and Business Media Deutschland GmbH
Pages229-240
Number of pages12
ISBN (Print)9789819794331
DOIs
StatePublished - 2025
Event13th CCF International Conference on Natural Language Processing and Chinese Computing, NLPCC 2024 - Hangzhou, China
Duration: 1 Nov 20243 Nov 2024

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume15360 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference13th CCF International Conference on Natural Language Processing and Chinese Computing, NLPCC 2024
Country/TerritoryChina
CityHangzhou
Period1/11/243/11/24

Keywords

  • Data Selection
  • Instruction Tuning
  • Large Language Models

Fingerprint

Dive into the research topics of 'Bread: A Hybrid Approach for Instruction Data Mining Through Balanced Retrieval and Dynamic Data Sampling'. Together they form a unique fingerprint.

Cite this