TY - GEN
T1 - Chronos
T2 - 2015 31st IEEE International Conference on Data Engineering, ICDE 2015
AU - Gu, Ling
AU - Zhou, Minqi
AU - Zhang, Zhenjie
AU - Shan, Ming Chien
AU - Zhou, Aoying
AU - Winslett, Marianne
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/5/26
Y1 - 2015/5/26
N2 - In the coming big data era, stress test to IT systems under extreme data volume is crucial to the adoption of computing technologies in every corner of the cyber world. Appropriately generated benchmark datasets provide the possibility for administrators to evaluate the capacity of the systems when real datasets hard obtained have not extreme cases. Traditional benchmark data generators, however, mainly target at producing relation tables of arbitrary size following fixed distributions. The output of such generators are insufficient when it is used to measure the stability of the architecture with extremely dynamic and heavy workloads, caused by complicated/hiden factors in the generation mechanism of real world, e.g. dependency between stocks in the trading market and collaborative human behaviors on the social network. In this paper, we present a new framework, called Chronos, to support new demands on streaming data benchmarking, by generating and simulating realistic and fast data streams in an elastic manner. Given a small group of samples with timestamps, Chronos reproduces new data streams with similar characteristics of the samples, preserving column-wise correlations, temporal dependency and order statistics of the snapshot distributions at the same time. To achieve such realistic requirements, we propose 1) a column decomposition optimization technique to partition the original relation table into small sub-tables with minimal correlation information loss, 2) a generative and extensible model based on Latent Dirichlet Allocation to capture temporal dependency while preserving order statistics of the snapshot distribution, and 3) a new generation and assembling method to efficiently build tuples following the expected distribution on the snapshots. To fulfill the vision of elasticity, we also present a new parallel stream data generation mechanism, facilitating distributed nodes to collaboratively generate tuples with minimal synchronization overhead and excellent load balancing. Our extensive experimental studies on real world data domains confirm the efficiency and effectiveness of Chronos on stream benchmark generation and simulation.
AB - In the coming big data era, stress test to IT systems under extreme data volume is crucial to the adoption of computing technologies in every corner of the cyber world. Appropriately generated benchmark datasets provide the possibility for administrators to evaluate the capacity of the systems when real datasets hard obtained have not extreme cases. Traditional benchmark data generators, however, mainly target at producing relation tables of arbitrary size following fixed distributions. The output of such generators are insufficient when it is used to measure the stability of the architecture with extremely dynamic and heavy workloads, caused by complicated/hiden factors in the generation mechanism of real world, e.g. dependency between stocks in the trading market and collaborative human behaviors on the social network. In this paper, we present a new framework, called Chronos, to support new demands on streaming data benchmarking, by generating and simulating realistic and fast data streams in an elastic manner. Given a small group of samples with timestamps, Chronos reproduces new data streams with similar characteristics of the samples, preserving column-wise correlations, temporal dependency and order statistics of the snapshot distributions at the same time. To achieve such realistic requirements, we propose 1) a column decomposition optimization technique to partition the original relation table into small sub-tables with minimal correlation information loss, 2) a generative and extensible model based on Latent Dirichlet Allocation to capture temporal dependency while preserving order statistics of the snapshot distribution, and 3) a new generation and assembling method to efficiently build tuples following the expected distribution on the snapshots. To fulfill the vision of elasticity, we also present a new parallel stream data generation mechanism, facilitating distributed nodes to collaboratively generate tuples with minimal synchronization overhead and excellent load balancing. Our extensive experimental studies on real world data domains confirm the efficiency and effectiveness of Chronos on stream benchmark generation and simulation.
UR - https://www.scopus.com/pages/publications/84940826802
U2 - 10.1109/ICDE.2015.7113276
DO - 10.1109/ICDE.2015.7113276
M3 - 会议稿件
AN - SCOPUS:84940826802
T3 - Proceedings - International Conference on Data Engineering
SP - 101
EP - 112
BT - 2015 IEEE 31st International Conference on Data Engineering, ICDE 2015
PB - IEEE Computer Society
Y2 - 13 April 2015 through 17 April 2015
ER -