跳到主要导航 跳到搜索 跳到主要内容

Efficient Matrix Computation for SGD-Based Algorithms on Apache Spark

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

With the increasing of matrix size in large-scale data analysis, a series of Spark-based distributed matrix computation systems have emerged. Typically, these systems split a matrix into matrix blocks and save these matrix blocks into a RDD. To implement matrix operations, these systems manipulate the matrices by applying coarse-grained RDD operations. That is, these systems load the entire RDD to get a part of matrix blocks. Hence, it may cause the redundant IO when running SGD-based algorithms, since SGD only samples a min-batch data. Moreover, these systems typically employ a hash scheme to partition matrix blocks, which is oblivious to the sampling semantics. In this work, we propose a sampling-aware data loading which uses fine-grained RDD operation to reduce the partitions without sampled data, so as to decrease the redundant IO. Moreover, we exploit a semantic-based partition scheme, which gathers sampled blocks into the same partitions, to further reduce the number of accessed partitions. We modify SystemDS to implement Emacs, efficient matrix computation for SGD-based algorithms on Apache Spark. Our experimental results show that Emacs outperforms existing Spark-based matrix computation systems by 37%.

源语言英语
主期刊名Database Systems for Advanced Applications - 27th International Conference, DASFAA 2022, Proceedings
编辑Arnab Bhattacharya, Janice Lee Mong Li, Divyakant Agrawal, P. Krishna Reddy, Mukesh Mohania, Anirban Mondal, Vikram Goyal, Rage Uday Kiran
出版商Springer Science and Business Media Deutschland GmbH
309-324
页数16
ISBN(印刷版)9783031001222
DOI
出版状态已出版 - 2022
活动27th International Conference on Database Systems for Advanced Applications, DASFAA 2022 - Virtual, Online
期限: 11 4月 202214 4月 2022

出版系列

姓名Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
13245 LNCS
ISSN(印刷版)0302-9743
ISSN(电子版)1611-3349

会议

会议27th International Conference on Database Systems for Advanced Applications, DASFAA 2022
Virtual, Online
时期11/04/2214/04/22

指纹

探究 'Efficient Matrix Computation for SGD-Based Algorithms on Apache Spark' 的科研主题。它们共同构成独一无二的指纹。

引用此