Efficient Distributed Smith-Waterman Algorithm Based on Apache Spark

  • Bo Xu
  • , Changlong Li
  • , Hang Zhuang
  • , Jiali Wang
  • , Qingfeng Wang
  • , Xuehai Zhou

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

16 Scopus citations

Abstract

The Smith-Waterman algorithm, which produces the optimal local alignment between pairwise sequences, is universally used as a key component in bioinformatics fields. It is more sensitive than heuristic approaches, but also more time-consuming. To speed up the algorithm, Single-Instruction Multiple-Data (SIMD) instructions have been used to parallelize the algorithm by leveraging data parallel strategy. However, SIMD-based Smith-Waterman (SW) algorithms show limited scalability. Moreover, the recent next-generation sequencing machines generate sequences at an unprecedented rate, so faster implementations of the sequence alignment algorithms are needed to keep pace. In this paper, we present CloudSW, an efficient distributed Smith-Waterman algorithm which leverages Apache Spark and SIMD instructions to accelerate the algorithm. To facilitate easy integration of distributed Smith-Waterman algorithm into third-party software, we provide application programming interfaces (APIs) service in cloud. The experimental results demonstrate that 1) CloudSW has outstanding performance and achieves up to 3.29 times speedup over DSW and 621 times speedup over SparkSW. 2) CloudSW has excellent scalability and achieves up to 529 giga cell updates per second (GCUPS) in protein database search with 50 nodes in Aliyun Cloud, which is the highest performance that has been reported as far as we know.

Original languageEnglish
Title of host publicationProceedings - 2017 IEEE 10th International Conference on Cloud Computing, CLOUD 2017
EditorsGeoffrey C. Fox
PublisherIEEE Computer Society
Pages608-615
Number of pages8
ISBN (Electronic)9781538619933
DOIs
StatePublished - 8 Sep 2017
Externally publishedYes
Event10th IEEE International Conference on Cloud Computing, CLOUD 2017 - Honolulu, United States
Duration: 25 Jun 201730 Jun 2017

Publication series

NameIEEE International Conference on Cloud Computing, CLOUD
Volume2017-June
ISSN (Print)2159-6182
ISSN (Electronic)2159-6190

Conference

Conference10th IEEE International Conference on Cloud Computing, CLOUD 2017
Country/TerritoryUnited States
CityHonolulu
Period25/06/1730/06/17

Keywords

  • Alluxio
  • Apache Spark
  • Distrubuted Smith-Waterman algorithm
  • HDFS
  • SIMD instructions
  • Scalability

Fingerprint

Dive into the research topics of 'Efficient Distributed Smith-Waterman Algorithm Based on Apache Spark'. Together they form a unique fingerprint.

Cite this