Learning to hash for indexing big data - A survey

Jun Wang, Wei Liu, Sanjiv Kumar, Shih Fu Chang

Research output: Contribution to journalReview articlepeer-review

446 Scopus citations

Abstract

The explosive growth in Big Data has attracted much attention in designing efficient indexing and search methods recently. In many critical applications such as large-scale search and pattern matching, finding the nearest neighbors to a query is a fundamental research problem. However, the straightforward solution using exhaustive comparison is infeasible due to the prohibitive computational complexity and memory requirement. In response, approximate nearest neighbor (ANN) search based on hashing techniques has become popular due to its promising performance in both efficiency and accuracy. Prior randomized hashing methods, e.g., locality-sensitive hashing (LSH), explore data-independent hash functions with random projections or permutations. Although having elegant theoretic guarantees on the search quality in certain metric spaces, performance of randomized hashing has been shown insufficient in many real-world applications. As a remedy, new approaches incorporating data-driven learning methods in development of advanced hash functions have emerged. Such learning-to-hash methods exploit information such as data distributions or class labels when optimizing the hash codes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in the original feature spaces in the hash code spaces. The goal of this paper is to provide readers with systematic understanding of insights, pros, and cons of the emerging techniques. We provide a comprehensive survey of the learning-to-hash framework and representative techniques of various types, including unsupervised, semisupervised, and supervised. In addition, we also summarize recent hashing approaches utilizing the deep learning models. Finally, we discuss the future direction and trends of research in this area.

Original languageEnglish
Article number7360966
Pages (from-to)34-57
Number of pages24
JournalProceedings of the IEEE
Volume104
Issue number1
DOIs
StatePublished - Jan 2016

Keywords

  • Artificial neural networks
  • Big data
  • Binary codes
  • Indexing
  • Semantics
  • Tagging
  • Twitter

Fingerprint

Dive into the research topics of 'Learning to hash for indexing big data - A survey'. Together they form a unique fingerprint.

Cite this