Self-Supervised Query Reformulation for Code Search

  • Yuetian Mao
  • , Chengcheng Wan
  • , Yuze Jiang
  • , Xiaodong Gu*
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

7 Scopus citations

Abstract

Automatic query reformulation is a widely utilized technology for enriching user requirements and enhancing the outcomes of code search. It can be conceptualized as a machine translation task, wherein the objective is to rephrase a given query into a more comprehensive alternative. While showing promising results, training such a model typically requires a large parallel corpus of query pairs (i.e., the original query and a reformulated query) that are confidential and unpublished by online code search engines. This restricts its practicality in software development processes. In this paper, we propose SSQR, a self-supervised query reformulation method that does not rely on any parallel query corpus. Inspired by pre-trained models, SSQR treats query reformulation as a masked language modeling task conducted on an extensive unannotated corpus of queries. SSQR extends T5 (a sequence-to-sequence model based on Transformer) with a new pre-training objective named corrupted query completion (CQC), which randomly masks words within a complete query and trains T5 to predict the masked content. Subsequently, for a given query to be reformulated, SSQR identifies potential locations for expansion and leverages the pre-trained T5 model to generate appropriate content to fill these gaps. The selection of expansions is then based on the information gain associated with each candidate. Evaluation results demonstrate that SSQR outperforms unsupervised baselines significantly and achieves competitive performance compared to supervised methods.

Original languageEnglish
Title of host publicationESEC/FSE 2023 - Proceedings of the 31st ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering
EditorsSatish Chandra, Kelly Blincoe, Paolo Tonella
PublisherAssociation for Computing Machinery, Inc
Pages363-374
Number of pages12
ISBN (Electronic)9798400703270
DOIs
StatePublished - 30 Nov 2023
Event31st ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023 - San Francisco, United States
Duration: 3 Dec 20239 Dec 2023

Publication series

NameESEC/FSE 2023 - Proceedings of the 31st ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Conference

Conference31st ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023
Country/TerritoryUnited States
CitySan Francisco
Period3/12/239/12/23

Keywords

  • Code Search
  • Query Reformulation
  • Self-supervised Learning

Fingerprint

Dive into the research topics of 'Self-Supervised Query Reformulation for Code Search'. Together they form a unique fingerprint.

Cite this