Abstract
The traditional OLAP is pushed into large scale analysis era by rapidly expending big data volume. The major features are high storage density, heavy workload, large scale storage and processing capacity. Both traditional parallel database and the hot topic MapReduce technique have to face the critical issues of performance and parallel processing efficiency of big data analytical processing in large scale parallel processing framework. The performance of star schema based OLAP with star-join is limited by processing complexity and network transmission cost in parallel processing. This paper makes a deep analysis of features of storage model and workload of OLAP, proposes the optimization mechanisms and implementation technologies for the most fundamental SPJGA-OLAP subset in storage, processing, distribution, network transmission, and distributed buffering. The technical feasibility is evaluated with the commonly accepted TPC-H industrial benchmark and SSB academic benchmark. This paper proposes the predicate-vector DDTA-JOIN centric parallel OLAP framework, replacing the diverse join execution plans with normalized predicate-vector processing, and enables one-size-fits-all OLAP model for both central processing and large scale parallel processing by making advantage of nowadays hardware, minimizing network transmission cost and processing cost. The analysis of the storage cost and network transmission cost for distribution mechanism with datasets of 1TB and 100TB is given. The technical feasibility and parallel processing efficiency are verified by OLAP cost model analysis and real data experiments.
| Original language | English |
|---|---|
| Pages (from-to) | 1936-1946 |
| Number of pages | 11 |
| Journal | Jisuanji Xuebao/Chinese Journal of Computers |
| Volume | 34 |
| Issue number | 10 |
| DOIs | |
| State | Published - Oct 2011 |
| Externally published | Yes |
Keywords
- Big data analytical processing
- OLAP
- Predicate-vector
- Star schema