TY - GEN
T1 - MCFC
T2 - 13th CCF International Conference on Natural Language Processing and Chinese Computing, NLPCC 2024
AU - Li, Dongyang
AU - Ding, Ruixue
AU - Xie, Pengjun
AU - He, Xiaofeng
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.
PY - 2025
Y1 - 2025
N2 - Information Retrieval (IR) pre-trained language models are trained from large-scale retrieval-based corpora to promote the task-specific knowledge capacity. Previous works focus on general retrieval pre-trained datasets, which cover inter-document data and intra-document data, paying less attention to the important asset of clicked data which is commonly adopted in recommendation domain. However, the utilization of easily accessible clicked data is a non-trivial operation due to its characteristics of large volume and insufficient refinement, which affect model learning efficiency and imply the risk of distorting learning directions. In this paper, we propose a Momentum-Driven Clicked Feature Compressed Pre-trained Language Models for Information Retrieval (MCFC). Specifically, to tackle the effective learning pace on large amounts of data, we generalize multiple similar feature instances and compress the dispersed knowledge together at the query granularity, named Multi-Instance Information Integration. Meanwhile, more relevant detection between queries and documents is eager in coarse clicked data background, we leverage a momentum-driven adjusting mechanism to refine the text representations, named Continuous Debiasing Calibration. Extensive experiments on downstream datasets validate the superiority of our work to other recent strong baselines.
AB - Information Retrieval (IR) pre-trained language models are trained from large-scale retrieval-based corpora to promote the task-specific knowledge capacity. Previous works focus on general retrieval pre-trained datasets, which cover inter-document data and intra-document data, paying less attention to the important asset of clicked data which is commonly adopted in recommendation domain. However, the utilization of easily accessible clicked data is a non-trivial operation due to its characteristics of large volume and insufficient refinement, which affect model learning efficiency and imply the risk of distorting learning directions. In this paper, we propose a Momentum-Driven Clicked Feature Compressed Pre-trained Language Models for Information Retrieval (MCFC). Specifically, to tackle the effective learning pace on large amounts of data, we generalize multiple similar feature instances and compress the dispersed knowledge together at the query granularity, named Multi-Instance Information Integration. Meanwhile, more relevant detection between queries and documents is eager in coarse clicked data background, we leverage a momentum-driven adjusting mechanism to refine the text representations, named Continuous Debiasing Calibration. Extensive experiments on downstream datasets validate the superiority of our work to other recent strong baselines.
KW - Clicked Feature
KW - Information Retrieval
KW - Pre-trained Language Model
UR - https://www.scopus.com/pages/publications/85209795894
U2 - 10.1007/978-981-97-9431-7_6
DO - 10.1007/978-981-97-9431-7_6
M3 - 会议稿件
AN - SCOPUS:85209795894
SN - 9789819794300
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 69
EP - 82
BT - Natural Language Processing and Chinese Computing - 13th National CCF Conference, NLPCC 2024, Proceedings
A2 - Wong, Derek F.
A2 - Wei, Zhongyu
A2 - Yang, Muyun
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 1 November 2024 through 3 November 2024
ER -