EduDCM: A Novel Framework for Automatic Educational Dialogue Classification Dataset Construction via Distant Supervision and Large Language Models

  • Changyong Qi
  • , Longwei Zheng*
  • , Yuang Wei
  • , Haoxin Xu
  • , Peiji Chen
  • , Xiaoqing Gu
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Educational dialogue classification is a critical task for analyzing classroom interactions and fostering effective teaching strategies. However, the scarcity of annotated data and the high cost of manual labeling pose significant challenges, especially in low-resource educational contexts. This article presents the EduDCM framework for the first time, offering an original approach to addressing these challenges. EduDCM innovatively integrates distant supervision with the capabilities of Large Language Models (LLMs) to automate the construction of high-quality educational dialogue classification datasets. EduDCM reduces the noise typically associated with distant supervision by leveraging LLMs for context-aware label generation and incorporating heuristic alignment techniques. To validate the framework, we constructed the EduTalk dataset, encompassing diverse classroom dialogues labeled with pedagogical categories. Extensive experiments on EduTalk and publicly available datasets, combined with expert evaluations, confirm the superior quality of EduDCM-generated datasets. Models trained on EduDCM data achieved a performance comparable to that of manually annotated datasets. Expert evaluations using a 5-point Likert scale show that EduDCM outperforms Template-Based Generation and Few-Shot GPT in terms of annotation accuracy, category coverage, and consistency. These findings emphasize EduDCM’s novelty and its effectiveness in generating high-quality, scalable datasets for low-resource educational NLP tasks, thus reducing manual annotation efforts.

Original languageEnglish
Article number154
JournalApplied Sciences (Switzerland)
Volume15
Issue number1
DOIs
StatePublished - Jan 2025

Keywords

  • distant supervision
  • educational dialogue classification
  • large language models
  • low-resource tasks

Fingerprint

Dive into the research topics of 'EduDCM: A Novel Framework for Automatic Educational Dialogue Classification Dataset Construction via Distant Supervision and Large Language Models'. Together they form a unique fingerprint.

Cite this