TY - GEN
T1 - Aligning Large Language Models to a Domain-specific Graph Database for NL2GQL
AU - Liang, Yuanyuan
AU - Tan, Keren
AU - Xie, Tingyu
AU - Tao, Wenbiao
AU - Wang, Siyuan
AU - Lan, Yunshi
AU - Qian, Weining
N1 - Publisher Copyright:
© 2024 ACM.
PY - 2024/10/21
Y1 - 2024/10/21
N2 - Graph Databases (Graph DB) find extensive application across diverse domains such as finance, social networks, and medicine. Yet, the translation of Natural Language (NL) into the Graph Query Language (GQL), referred to as NL2GQL, poses significant challenges owing to its intricate and specialized nature. Some approaches have sought to utilize Large Language Models (LLMs) to address analogous tasks like text2SQL. Nonetheless, in the realm of NL2GQL tasks tailored to a particular domain, the absence of domain-specific NL-GQL data pairs adds complexity to aligning LLMs with the graph DB. To tackle this challenge, we present a well-defined pipeline. Initially, we use ChatGPT to generate NL-GQL data pairs, leveraging the provided graph DB and two mutual verification self-instruct methods which ensure consistency between NL and GQL. Subsequently, we employ the generated data to fine-tune LLMs, ensuring alignment between LLMs and the graph DB. Moreover, we find the importance of relevant schema in efficiently generating accurate GQLs. Thus, we introduce a method to extract relevant schema as the input context. We evaluate our method using two carefully constructed datasets derived from graph DBs in the finance and medicine domains, named FinGQL and MediGQL. Experimental results reveal that our approach significantly outperforms a set of baseline methods, with improvements of 5.90 and 6.36 absolute points on EM, and 6.00 and 7.09
AB - Graph Databases (Graph DB) find extensive application across diverse domains such as finance, social networks, and medicine. Yet, the translation of Natural Language (NL) into the Graph Query Language (GQL), referred to as NL2GQL, poses significant challenges owing to its intricate and specialized nature. Some approaches have sought to utilize Large Language Models (LLMs) to address analogous tasks like text2SQL. Nonetheless, in the realm of NL2GQL tasks tailored to a particular domain, the absence of domain-specific NL-GQL data pairs adds complexity to aligning LLMs with the graph DB. To tackle this challenge, we present a well-defined pipeline. Initially, we use ChatGPT to generate NL-GQL data pairs, leveraging the provided graph DB and two mutual verification self-instruct methods which ensure consistency between NL and GQL. Subsequently, we employ the generated data to fine-tune LLMs, ensuring alignment between LLMs and the graph DB. Moreover, we find the importance of relevant schema in efficiently generating accurate GQLs. Thus, we introduce a method to extract relevant schema as the input context. We evaluate our method using two carefully constructed datasets derived from graph DBs in the finance and medicine domains, named FinGQL and MediGQL. Experimental results reveal that our approach significantly outperforms a set of baseline methods, with improvements of 5.90 and 6.36 absolute points on EM, and 6.00 and 7.09
KW - graph databases
KW - graph query language
KW - large language models
KW - natural language to graph query language
UR - https://www.scopus.com/pages/publications/85210021053
U2 - 10.1145/3627673.3679713
DO - 10.1145/3627673.3679713
M3 - 会议稿件
AN - SCOPUS:85210021053
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 1367
EP - 1377
BT - CIKM 2024 - Proceedings of the 33rd ACM International Conference on Information and Knowledge Management
PB - Association for Computing Machinery
T2 - 33rd ACM International Conference on Information and Knowledge Management, CIKM 2024
Y2 - 21 October 2024 through 25 October 2024
ER -