Abstract
Massive datasets with imbalanced binary outcomes are commonly seen in many areas. Existing optimal subsampling strategies largely overlook the binary and imbalance structure, leading to efficiency loss, and are usually built on inverse probability weighting (IPW), which is unstable if some probabilities are close to zero. In this paper, we propose a synthetic sampling and estimation procedure tailored for imbalanced big data. In the sampling stage, we derive the optimal case–control subsampling plan based on IPW. To overcome the instability of IPW for estimation, we propose a novel empirical likelihood weighting method based on a case–control sample. A real-data-based simulation study indicates that our synthetic subsampling and estimation procedure has smaller mean square error than existing estimation procedures.
| Original language | English |
|---|---|
| Article number | 155 |
| Journal | Statistical Papers |
| Volume | 66 |
| Issue number | 7 |
| DOIs | |
| State | Published - Dec 2025 |
Keywords
- Case–control sampling
- Empirical likelihood weighting
- Imbalanced data
- Optimal subsampling
Fingerprint
Dive into the research topics of 'A synthetic subsampling and estimation procedure for imbalanced big data'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver