Optimizations and Heuristics to improve Compression in Columnar Database Systems

Published in Arxiv Preprint 1609.07823 on 2016

Recommended citation: Jayanth, Jayanth. "Optimizations and Heuristics to improve Compression in Columnar Database Systems." arXiv preprint arXiv:1609.07823 (2016). http://arxiv.org/pdf/1609.07823v1

In-memory columnar databases have become mainstream over the last decade and have vastly improved the fast processing of large volumes of data through multi-core parallelism and in-memory compression thereby eliminating the usual bottlenecks associated with disk-based databases. For scenarios, where the data volume grows into terabytes and petabytes, keeping all the data in memory is exorbitantly expensive. Hence, the data is compressed efficiently using different algorithms to exploit the multi-core parallelization technologies for query processing. Several compression methods are studied for compressing the column array, post Dictionary Encoding. In this paper, we will present two novel optimizations in compression techniques - Block Size Optimized Cluster Encoding and Block Size Optimized Indirect Encoding - which perform better than their predecessors. In the end, we also propose heuristics to choose the best encoding amongst common compression schemes.

Download paper here