Skip to main navigation Skip to search Skip to main content

A synthetic subsampling and estimation procedure for imbalanced big data

  • Chen Guo
  • , Yang Liu
  • , Yan Fan
  • , Yukun Liu*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Massive datasets with imbalanced binary outcomes are commonly seen in many areas. Existing optimal subsampling strategies largely overlook the binary and imbalance structure, leading to efficiency loss, and are usually built on inverse probability weighting (IPW), which is unstable if some probabilities are close to zero. In this paper, we propose a synthetic sampling and estimation procedure tailored for imbalanced big data. In the sampling stage, we derive the optimal case–control subsampling plan based on IPW. To overcome the instability of IPW for estimation, we propose a novel empirical likelihood weighting method based on a case–control sample. A real-data-based simulation study indicates that our synthetic subsampling and estimation procedure has smaller mean square error than existing estimation procedures.

Original languageEnglish
Article number155
JournalStatistical Papers
Volume66
Issue number7
DOIs
StatePublished - Dec 2025

Keywords

  • Case–control sampling
  • Empirical likelihood weighting
  • Imbalanced data
  • Optimal subsampling

Fingerprint

Dive into the research topics of 'A synthetic subsampling and estimation procedure for imbalanced big data'. Together they form a unique fingerprint.

Cite this