Efficient Matrix Computation for SGD-Based Algorithms on Apache Spark

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

With the increasing of matrix size in large-scale data analysis, a series of Spark-based distributed matrix computation systems have emerged. Typically, these systems split a matrix into matrix blocks and save these matrix blocks into a RDD. To implement matrix operations, these systems manipulate the matrices by applying coarse-grained RDD operations. That is, these systems load the entire RDD to get a part of matrix blocks. Hence, it may cause the redundant IO when running SGD-based algorithms, since SGD only samples a min-batch data. Moreover, these systems typically employ a hash scheme to partition matrix blocks, which is oblivious to the sampling semantics. In this work, we propose a sampling-aware data loading which uses fine-grained RDD operation to reduce the partitions without sampled data, so as to decrease the redundant IO. Moreover, we exploit a semantic-based partition scheme, which gathers sampled blocks into the same partitions, to further reduce the number of accessed partitions. We modify SystemDS to implement Emacs, efficient matrix computation for SGD-based algorithms on Apache Spark. Our experimental results show that Emacs outperforms existing Spark-based matrix computation systems by 37%.

Original languageEnglish
Title of host publicationDatabase Systems for Advanced Applications - 27th International Conference, DASFAA 2022, Proceedings
EditorsArnab Bhattacharya, Janice Lee Mong Li, Divyakant Agrawal, P. Krishna Reddy, Mukesh Mohania, Anirban Mondal, Vikram Goyal, Rage Uday Kiran
PublisherSpringer Science and Business Media Deutschland GmbH
Pages309-324
Number of pages16
ISBN (Print)9783031001222
DOIs
StatePublished - 2022
Event27th International Conference on Database Systems for Advanced Applications, DASFAA 2022 - Virtual, Online
Duration: 11 Apr 202214 Apr 2022

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13245 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference27th International Conference on Database Systems for Advanced Applications, DASFAA 2022
CityVirtual, Online
Period11/04/2214/04/22

Keywords

  • Distributed system
  • Matrix computation
  • Redundancy IO reduction

Fingerprint

Dive into the research topics of 'Efficient Matrix Computation for SGD-Based Algorithms on Apache Spark'. Together they form a unique fingerprint.

Cite this