TY - GEN
T1 - Efficient Matrix Computation for SGD-Based Algorithms on Apache Spark
AU - Han, Baokun
AU - Chen, Zihao
AU - Xu, Chen
AU - Zhou, Aoying
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2022
Y1 - 2022
N2 - With the increasing of matrix size in large-scale data analysis, a series of Spark-based distributed matrix computation systems have emerged. Typically, these systems split a matrix into matrix blocks and save these matrix blocks into a RDD. To implement matrix operations, these systems manipulate the matrices by applying coarse-grained RDD operations. That is, these systems load the entire RDD to get a part of matrix blocks. Hence, it may cause the redundant IO when running SGD-based algorithms, since SGD only samples a min-batch data. Moreover, these systems typically employ a hash scheme to partition matrix blocks, which is oblivious to the sampling semantics. In this work, we propose a sampling-aware data loading which uses fine-grained RDD operation to reduce the partitions without sampled data, so as to decrease the redundant IO. Moreover, we exploit a semantic-based partition scheme, which gathers sampled blocks into the same partitions, to further reduce the number of accessed partitions. We modify SystemDS to implement Emacs, efficient matrix computation for SGD-based algorithms on Apache Spark. Our experimental results show that Emacs outperforms existing Spark-based matrix computation systems by 37%.
AB - With the increasing of matrix size in large-scale data analysis, a series of Spark-based distributed matrix computation systems have emerged. Typically, these systems split a matrix into matrix blocks and save these matrix blocks into a RDD. To implement matrix operations, these systems manipulate the matrices by applying coarse-grained RDD operations. That is, these systems load the entire RDD to get a part of matrix blocks. Hence, it may cause the redundant IO when running SGD-based algorithms, since SGD only samples a min-batch data. Moreover, these systems typically employ a hash scheme to partition matrix blocks, which is oblivious to the sampling semantics. In this work, we propose a sampling-aware data loading which uses fine-grained RDD operation to reduce the partitions without sampled data, so as to decrease the redundant IO. Moreover, we exploit a semantic-based partition scheme, which gathers sampled blocks into the same partitions, to further reduce the number of accessed partitions. We modify SystemDS to implement Emacs, efficient matrix computation for SGD-based algorithms on Apache Spark. Our experimental results show that Emacs outperforms existing Spark-based matrix computation systems by 37%.
KW - Distributed system
KW - Matrix computation
KW - Redundancy IO reduction
UR - https://www.scopus.com/pages/publications/85129868664
U2 - 10.1007/978-3-031-00123-9_25
DO - 10.1007/978-3-031-00123-9_25
M3 - 会议稿件
AN - SCOPUS:85129868664
SN - 9783031001222
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 309
EP - 324
BT - Database Systems for Advanced Applications - 27th International Conference, DASFAA 2022, Proceedings
A2 - Bhattacharya, Arnab
A2 - Lee Mong Li, Janice
A2 - Agrawal, Divyakant
A2 - Reddy, P. Krishna
A2 - Mohania, Mukesh
A2 - Mondal, Anirban
A2 - Goyal, Vikram
A2 - Uday Kiran, Rage
PB - Springer Science and Business Media Deutschland GmbH
T2 - 27th International Conference on Database Systems for Advanced Applications, DASFAA 2022
Y2 - 11 April 2022 through 14 April 2022
ER -