TY - GEN
T1 - Columnar Formatted Inverted Index for Highly-Paralleled, Vectorized Query Processing
AU - Zhao, Weichen
AU - Zhao, Minghao
AU - Hu, Huiqi
AU - Qian, Weining
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Inverted index is a basic tool in many data-intensive applications. Though numerous efforts have been made on efficient inverted index-based query processing, existing schemes do not achieve the expected performance for modern data centers, in which servers are equipped with powerful CPUs and relatively large memory. Through comprehensive measurement studies, we identify the root course is that the data formats for index representation make it unfeasible to design efficient query execution approaches on top of it, which results in poor parallel query support and waste CPU computation. Driven by the findings, we propose to reconcile the in-memory index as columnar structures. To enable this idea, we construct the compact columnar format (i.e., Cocoa) that achieves both desirable space efficiency and maintains the capability for efficient searching support. With Cocoa, we design an efficient query executing scheme that utilizes vectorized batch processing to avoid frequent branch prediction, as well as clause enumeration with pruning to save the overhead of intermediate batch materialization. We build an open-source system VeloSearch to embody our design; experimental results show that VeloSearch achieves 30× better performance compared with state-of-the-art search libraries such as Lucene and Tantivy.
AB - Inverted index is a basic tool in many data-intensive applications. Though numerous efforts have been made on efficient inverted index-based query processing, existing schemes do not achieve the expected performance for modern data centers, in which servers are equipped with powerful CPUs and relatively large memory. Through comprehensive measurement studies, we identify the root course is that the data formats for index representation make it unfeasible to design efficient query execution approaches on top of it, which results in poor parallel query support and waste CPU computation. Driven by the findings, we propose to reconcile the in-memory index as columnar structures. To enable this idea, we construct the compact columnar format (i.e., Cocoa) that achieves both desirable space efficiency and maintains the capability for efficient searching support. With Cocoa, we design an efficient query executing scheme that utilizes vectorized batch processing to avoid frequent branch prediction, as well as clause enumeration with pruning to save the overhead of intermediate batch materialization. We build an open-source system VeloSearch to embody our design; experimental results show that VeloSearch achieves 30× better performance compared with state-of-the-art search libraries such as Lucene and Tantivy.
KW - columnar format
KW - indexing
KW - query processing
UR - https://www.scopus.com/pages/publications/105015498595
U2 - 10.1109/ICDE65448.2025.00138
DO - 10.1109/ICDE65448.2025.00138
M3 - 会议稿件
AN - SCOPUS:105015498595
T3 - Proceedings - International Conference on Data Engineering
SP - 1800
EP - 1813
BT - Proceedings - 2025 IEEE 41st International Conference on Data Engineering, ICDE 2025
PB - IEEE Computer Society
T2 - 41st IEEE International Conference on Data Engineering, ICDE 2025
Y2 - 19 May 2025 through 23 May 2025
ER -