TY - GEN
T1 - Scheduling Data Processing Pipelines for Incremental Training on MLP-based Recommendation Models
AU - Chen, Zihao
AU - Zhang, Chenyang
AU - Xu, Chen
AU - Zhang, Zhao
AU - Wang, Jiaqiang
AU - Qian, Weining
AU - Zhou, Aoying
N1 - Publisher Copyright:
© 2025 ACM.
PY - 2025/6/22
Y1 - 2025/6/22
N2 - Multi-layer Perceptron (MLP)-based models have been widely exploited by modern recommendation applications. In practice, industrial recommendation scenarios frequently launch continuous incremental training jobs with only one epoch to capture real-time user features. This kind of job is shorter than full training and has a larger proportion of feature processing time. To fully utilize fragmentation resources, our model engineering team at Tencent explores resource-constrained CPU clusters to perform such incremental training workloads. To improve the efficiency of such workloads, we notice scheduling optimizations by overlapping feature processing and model training at the level of data processing pipelines. In particular, we propose an intra-pipeline scheduling strategy, which prefetches feature processing operators dynamically to fill the idle time of CPUs during the communication of embedding lookup. Furthermore, we propose an inter-pipeline scheduling strategy, which balances the resource demands of different pipelines. It prioritizes the execution of critical pipelines and overlaps the communication in critical pipelines with the execution of non-critical pipelines. Based on the two scheduling strategies, we implement a novel incremental recommendation training framework called RECS on top of TensorFlow. In our experimental studies, RECS achieves a speedup of 1.36x over existing solutions on industrial workloads.
AB - Multi-layer Perceptron (MLP)-based models have been widely exploited by modern recommendation applications. In practice, industrial recommendation scenarios frequently launch continuous incremental training jobs with only one epoch to capture real-time user features. This kind of job is shorter than full training and has a larger proportion of feature processing time. To fully utilize fragmentation resources, our model engineering team at Tencent explores resource-constrained CPU clusters to perform such incremental training workloads. To improve the efficiency of such workloads, we notice scheduling optimizations by overlapping feature processing and model training at the level of data processing pipelines. In particular, we propose an intra-pipeline scheduling strategy, which prefetches feature processing operators dynamically to fill the idle time of CPUs during the communication of embedding lookup. Furthermore, we propose an inter-pipeline scheduling strategy, which balances the resource demands of different pipelines. It prioritizes the execution of critical pipelines and overlaps the communication in critical pipelines with the execution of non-critical pipelines. Based on the two scheduling strategies, we implement a novel incremental recommendation training framework called RECS on top of TensorFlow. In our experimental studies, RECS achieves a speedup of 1.36x over existing solutions on industrial workloads.
KW - data processing pipeline
KW - incremental training
KW - recommendation model
KW - scheduling
UR - https://www.scopus.com/pages/publications/105010184853
U2 - 10.1145/3722212.3724454
DO - 10.1145/3722212.3724454
M3 - 会议稿件
AN - SCOPUS:105010184853
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 350
EP - 363
BT - SIGMOD-Companion 2025 - Companion of the 2025 International Conference on Management of Data
A2 - Deshpande, Amol
A2 - Aboulnaga, Ashraf
A2 - Salimi, Babak
A2 - Chandramouli, Badrish
A2 - Howe, Bill
A2 - Loo, Boon Thau
A2 - Glavic, Boris
A2 - Curino, Carlo
A2 - Zhe Wang, Daisy
A2 - Suciu, Dan
A2 - Abadi, Daniel
A2 - Srivastava, Divesh
A2 - Wu, Eugene
A2 - Nawab, Faisal
A2 - Ilyas, Ihab
A2 - Naughton, Jeffrey
A2 - Rogers, Jennie
A2 - Patel, Jignesh
A2 - Arulraj, Joy
A2 - Yang, Jun
A2 - Echihabi, Karima
A2 - Ross, Kenneth
A2 - Daudjee, Khuzaima
A2 - Lakshmanan, Laks
A2 - Garofalakis, Minos
A2 - Riedewald, Mirek
A2 - Mokbel, Mohamed
A2 - Ouzzani, Mourad
A2 - Kennedy, Oliver
A2 - Kennedy, Oliver
A2 - Papotti, Paolo
A2 - Alvaro, Peter
A2 - Bailis, Peter
A2 - Miller, Renee
A2 - Roy, Senjuti Basu
A2 - Melnik, Sergey
A2 - Idreos, Stratos
A2 - Roy, Sudeepa
A2 - Rekatsinas, Theodoros
A2 - Leis, Viktor
A2 - Zhou, Wenchao
A2 - Gatterbauer, Wolfgang
A2 - Ives, Zack
PB - Association for Computing Machinery
T2 - 2025 ACM SIGMOD/PODS International Conference on Management of Data, SIGMOD-Companion 2025
Y2 - 22 June 2025 through 27 June 2025
ER -