VarGAN: Adversarial Learning of Variable Semantic Representations

Yalan Lin, Chengcheng Wan, Shuwen Bai, Xiaodong Gu

Research output: Contribution to journalArticlepeer-review

Abstract

Variable names are of critical importance in code representation learning. However, due to diverse naming conventions, variables often receive arbitrary names, leading to long-tail, out-of-vocabulary (OOV), and other well-known problems. While the Byte-Pair Encoding (BPE) tokenizer has addressed the surface-level recognition of low-frequency tokens, it has not noticed the inadequate training of low-frequency identifiers by code representation models, resulting in an imbalanced distribution of rare and common identifiers. Consequently, code representation models struggle to effectively capture the semantics of low-frequency variable names. In this paper, we propose VarGAN, a novel method for variable name representations. VarGAN strengthens the training of low-frequency variables through adversarial training. Specifically, we regard the code representation model as a generator responsible for producing vectors from source code. Additionally, we employ a discriminator that detects whether the code input to the generator contains low-frequency variables. This adversarial setup regularizes the distribution of rare variables, making them overlap with their corresponding high-frequency counterparts in the vector space. Experimental results demonstrate that VarGAN empowers CodeBERT to generate code vectors that exhibit more uniform distribution for both low-and high-frequency identifiers. There is an improvement of 8% in similarity and relatedness scores compared to VarCLR in the IdBench benchmark. VarGAN is also validated in downstream tasks, where it exhibits enhanced capabilities in capturing token-and code-level semantics.

Original languageEnglish
Pages (from-to)1505-1517
Number of pages13
JournalIEEE Transactions on Software Engineering
Volume50
Issue number6
DOIs
StatePublished - 1 Jun 2024

Keywords

  • Pre-trained language models
  • generative adversarial networks
  • identifier representation
  • variable name representation

Fingerprint

Dive into the research topics of 'VarGAN: Adversarial Learning of Variable Semantic Representations'. Together they form a unique fingerprint.

Cite this