Retrospecting Available CPU Resources: SMT-Aware Scheduling to Prevent SLA Violations in Data Centers

  • Haoyu Liao
  • , Tong Yu Liu
  • , Jianmei Guo*
  • , Bo Huang
  • , Dingyu Yang
  • , Jonathan Ding
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

The article focuses on an understudied yet fundamental problem: existing methods typically average the utilization of multiple hardware threads to evaluate the available CPU resources. However, the approach could underestimate the actual usage of the underlying physical core for Simultaneous Multi-Threading (SMT) processors, leading to an overestimation of remaining resources. The overestimation propagates from microarchitecture to operating systems and cloud schedulers, which may misguide scheduling decisions, exacerbate CPU overcommitment, and increase Service Level Agreement (SLA) violations. To address the potential overestimation problem, we propose an SMT-aware and purely data-driven approach named Remaining CPU (RCPU) that reserves more CPU resources to restrict CPU overcommitment and prevent SLA violations. RCPU requires only a few modifications to the existing cloud infrastructures and can be scaled up to large data centers. Extensive evaluations in the data center proved that RCPU contributes to a reduction of SLA violations by 18% on average for 98% of all latency-sensitive applications. Under a benchmarking experiment, we prove that RCPU increases the accuracy by 69% in terms of Mean Absolute Error (MAE) compared to the state-of-the-art.

Original languageEnglish
Pages (from-to)67-83
Number of pages17
JournalIEEE Transactions on Parallel and Distributed Systems
Volume36
Issue number1
DOIs
StatePublished - 2025

Keywords

  • Cloud computing
  • QoS
  • SMT interference
  • data center
  • latency-sensitive applications
  • microarchitecture

Fingerprint

Dive into the research topics of 'Retrospecting Available CPU Resources: SMT-Aware Scheduling to Prevent SLA Violations in Data Centers'. Together they form a unique fingerprint.

Cite this