Efficient Mining Multi-Mers in a Variety of Biological Sequences

  • Jingsong Zhang
  • , Jianmei Guo
  • , Ming Zhang
  • , Xiangtian Yu
  • , Xiaoqing Yu
  • , Weifeng Guo
  • , Tao Zeng*
  • , Luonan Chen*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

6 Scopus citations

Abstract

Counting the occurrence frequency of each kk-mer in a biological sequence is a preliminary yet important step in many bioinformatics applications. However, most kk-mer counting algorithms rely on a given kk to produce single-length kk-mers, which is inefficient for sequence analysis for different kk. Moreover, existing kk-mer counters focus more on DNA and RNA sequences and less on protein ones. In practice, the analysis of kk-mers in protein sequences can provide substantial biological insights in structure, function, and evolution. To this end, an efficient algorithm, called MulMer (Multiple-Mer mining), is proposed to mine kk-mers of various lengths termed multi-mers via inverted-index technique, which is orders of magnitude faster than the conventional forward-index methods. Moreover, to the best of our knowledge, MulMer is the first able to mine multi-mers in a variety of sequences, including DNA, RNA, and protein sequences.

Original languageEnglish
Article number8341507
Pages (from-to)949-958
Number of pages10
JournalIEEE/ACM Transactions on Computational Biology and Bioinformatics
Volume17
Issue number3
DOIs
StatePublished - 1 May 2020
Externally publishedYes

Keywords

  • Sequential pattern mining
  • and biological sequence analysis
  • inverted index
  • κ-mer counting
  • κ-mers of various lengths

Fingerprint

Dive into the research topics of 'Efficient Mining Multi-Mers in a Variety of Biological Sequences'. Together they form a unique fingerprint.

Cite this