TY - JOUR
T1 - Exploring retrieval-augmented generation for multi-label discipline classification of academic short texts
AU - Shang, Duxin
AU - Duan, Yufeng
AU - Bai, Ping
AU - Xie, Jiahong
N1 - Publisher Copyright:
© Akadémiai Kiadó Zrt 2025.
PY - 2025
Y1 - 2025
N2 - The discipline classification of academic short texts can effectively promote bibliometric analysis of academic papers. Traditional classification methods face challenges such as data sparsity and limited annotation resources when handling academic short texts. Additionally, these methods exhibit significant limitations in computational complexity and interpretability. To address these issues, this paper proposes a Retrieval-Augmented Generation (RAG)-based multi-label classification framework for academic short texts. This framework enhances the input to generative models by retrieving relevant information from an external knowledge base, thereby enhancing both classification performance and interpretability. The framework comprises four core modules: knowledge base construction, retriever, prompt engineering, and large language model (LLM) invocation. Under this framework, we construct an academic text knowledge base containing multiple disciplines based on the Semantic Scholar Open Research Corpus (S2ORC) academic paper dataset. We also design targeted prompts to guide the Large Language Model in generating discipline classification labels and their justifications. Experimental results demonstrate that the RAG-based approach offers significant advantages in multi-label classification tasks for academic short texts. Compared to traditional deep learning models and standalone Large Language Models, RAG significantly reduces classification error rates and enhances label coverage and the accuracy of top-1 label predictions.
AB - The discipline classification of academic short texts can effectively promote bibliometric analysis of academic papers. Traditional classification methods face challenges such as data sparsity and limited annotation resources when handling academic short texts. Additionally, these methods exhibit significant limitations in computational complexity and interpretability. To address these issues, this paper proposes a Retrieval-Augmented Generation (RAG)-based multi-label classification framework for academic short texts. This framework enhances the input to generative models by retrieving relevant information from an external knowledge base, thereby enhancing both classification performance and interpretability. The framework comprises four core modules: knowledge base construction, retriever, prompt engineering, and large language model (LLM) invocation. Under this framework, we construct an academic text knowledge base containing multiple disciplines based on the Semantic Scholar Open Research Corpus (S2ORC) academic paper dataset. We also design targeted prompts to guide the Large Language Model in generating discipline classification labels and their justifications. Experimental results demonstrate that the RAG-based approach offers significant advantages in multi-label classification tasks for academic short texts. Compared to traditional deep learning models and standalone Large Language Models, RAG significantly reduces classification error rates and enhances label coverage and the accuracy of top-1 label predictions.
KW - Discipline classification
KW - Large-language models
KW - Multi-label classification
KW - Retrieval-augmented generation
KW - Short text classification
UR - https://www.scopus.com/pages/publications/105022249195
U2 - 10.1007/s11192-025-05472-2
DO - 10.1007/s11192-025-05472-2
M3 - 文章
AN - SCOPUS:105022249195
SN - 0138-9130
JO - Scientometrics
JF - Scientometrics
ER -