TCSA: Efficient Localization of Busy-Wait Synchronization Bugs for Latency-Critical Applications

Ning Li, Jianmei Guo*, Bo Huang, Yuyang Li, Yilei Zhang, Chengdong Li, Wenxin Huang

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Busy-wait synchronization is often used for latency-critical applications to ensure low latency. Unfortunately, its performance bugs due to thread contention may lead to request failures or even system crashes. Localizing the performance bugs of busy-wait synchronization is not trivial because we have to pinpoint the exact moment of occurrence from a relatively long measurement period and simultaneously identify candidate busy-wait threads from numerous concurrent threads. Existing methods often rely on hotspot-driven analysis of lock-related functions, but they still need extensive manual work to localize busy-wait threads. This paper proposes timing call stack analysis (TCSA), an efficient approach to localizing busy-wait synchronization bugs. The key idea is to time-serialize the function call stacks of applications and identify consecutive identical call stacks to catch busy-wait threads. TCSA can handle any application regardless of its programming language and identify various busy-wait patterns, including spinlocks, chaining spinlocks, futexes, and safepoint checks within the Java Virtual Machine. Compared to the state-of-the-art, TCSA can effectively diminish the quantity of examined records (e.g., threads and functions) by 1 to 3 orders of magnitude. TCSA has been deployed to a large cloud service provider, demonstrating its effectiveness, efficiency, and practicality in four real latency-critical applications.

Original languageEnglish
Pages (from-to)297-309
Number of pages13
JournalIEEE Transactions on Parallel and Distributed Systems
Volume35
Issue number2
DOIs
StatePublished - 1 Feb 2024

Keywords

  • Busy-wait synchronization
  • latency-critical applications
  • performance bug localization
  • timing call stack analysis

Fingerprint

Dive into the research topics of 'TCSA: Efficient Localization of Busy-Wait Synchronization Bugs for Latency-Critical Applications'. Together they form a unique fingerprint.

Cite this