TY - GEN
T1 - Learning Bit Allocations for Z-Order Layouts in Analytic Data Systems
AU - Gao, Jenny
AU - Ding, Jialin
AU - Sudhir, Sivaprasad
AU - Madden, Samuel
N1 - Publisher Copyright:
© 2024 Owner/Author.
PY - 2024/6/9
Y1 - 2024/6/9
N2 - To improve the performance of scanning and filtering, modern analytic data systems such as Amazon Redshift and Databricks Delta Lake give users the ability to sort a table using a Z-order, which maps each row to a "Z-value"by interleaving the binary representations of the row's attributes, then sorts rows by their Z-values. These Z-order layouts essentially sort the table by multiple columns simultaneously and can achieve superior performance to single-column sort orders when the user's queries filter over multiple columns. However, the user shoulders the burden of manually selecting the columns to include in the Z-order, and a poor choice of columns can significantly degrade performance. Furthermore, these systems treat all columns included in the Z-order as equally important, which often does not result in the best performance due to the unequal impact that different columns have on query performance. In this work, we investigate the performance impact of using Z-orders that place unequal importance on columns: instead of using an equal number of bits from each column in the Z-value interleaving, we allow unequal bit allocation. We introduce a technique that uses Bayesian optimization to automatically learn the best bit allocation for a Z-order layout on a given dataset and query workload. Z-order layouts using our learned bit allocations outperform equal-bit Z-orders by up to 1.6× in query runtime and up to 2× in rows scanned.
AB - To improve the performance of scanning and filtering, modern analytic data systems such as Amazon Redshift and Databricks Delta Lake give users the ability to sort a table using a Z-order, which maps each row to a "Z-value"by interleaving the binary representations of the row's attributes, then sorts rows by their Z-values. These Z-order layouts essentially sort the table by multiple columns simultaneously and can achieve superior performance to single-column sort orders when the user's queries filter over multiple columns. However, the user shoulders the burden of manually selecting the columns to include in the Z-order, and a poor choice of columns can significantly degrade performance. Furthermore, these systems treat all columns included in the Z-order as equally important, which often does not result in the best performance due to the unequal impact that different columns have on query performance. In this work, we investigate the performance impact of using Z-orders that place unequal importance on columns: instead of using an equal number of bits from each column in the Z-value interleaving, we allow unequal bit allocation. We introduce a technique that uses Bayesian optimization to automatically learn the best bit allocation for a Z-order layout on a given dataset and query workload. Z-order layouts using our learned bit allocations outperform equal-bit Z-orders by up to 1.6× in query runtime and up to 2× in rows scanned.
UR - https://www.scopus.com/pages/publications/85198471749
UR - https://www.scopus.com/inward/citedby.url?scp=85198471749&partnerID=8YFLogxK
U2 - 10.1145/3663742.3663975
DO - 10.1145/3663742.3663975
M3 - Conference contribution
AN - SCOPUS:85198471749
T3 - Proceedings of the 7th International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, aiDM 2024, In conjunction with the 2024 ACM SIGMOD/PODS Conference
BT - Proceedings of the 7th International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, aiDM 2024, In conjunction with the 2024 ACM SIGMOD/PODS Conference
PB - Association for Computing Machinery, Inc
T2 - 7th International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, aiDM 2024
Y2 - 14 June 2024 through 14 June 2024
ER -