Efficient String Sort with Multi-Character Encoding and Adaptive Sampling

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations

Abstract

Sorting plays a fundamental role in computer science. It has far reaching applications in database operations and data science tasks. An important class of sorting keys are strings and among all string sorting methods, radix sort is a simple but effective algorithm. Many works have been studied to accelerate radix string sort. One typical approach is to process multiple characters in each sorting pass. However, this approach incurs the crucial issue of the radix being too large. To address the problem, we introduce a novel multi-character encoding based method that can significantly reduce the radix. This new encoding scheme takes advantage of the sparse alphabet space usage as well as the sparsity of distinguishing prefixes of the inputs which are commonly seen in real-world datasets. Combining the effective encoding scheme with an adaptive sampling process to generate the encoding efficiently, our proposed sorting algorithm essentially blends radix sort with sample sort and achieves substantial improvement over other sorting approaches. The results on both real datasets and synthetic datasets show that our method yields an average 4.85× performance improvement over C++ STL sort[21], 1.47× improvement over the state-of-the-art Radix Sort on strings implementation[19] and 2.55× over the multikey quicksort[6]. Preliminary tests in a multi-core environment also show it is competitive or better than the most recent parallel string sorting algorithm pS5[8] which demonstrates the scalability of our method.

Original languageEnglish
Pages (from-to)872-884
Number of pages13
JournalProceedings of the ACM SIGMOD International Conference on Management of Data
DOIs
StatePublished - 2021
Externally publishedYes
Event2021 International Conference on Management of Data, SIGMOD 2021 - Virtual, Online, China
Duration: 20 Jun 202125 Jun 2021

Keywords

  • radix sort
  • sample sort
  • sorting
  • string dataset

Fingerprint

Dive into the research topics of 'Efficient String Sort with Multi-Character Encoding and Adaptive Sampling'. Together they form a unique fingerprint.

Cite this