TY - GEN
T1 - Automated Multidimensional Data Layouts in Amazon Redshift
AU - Ding, Jialin
AU - Abrams, Matt
AU - Bandyopadhyay, Sanghita
AU - Di Palma, Luciano
AU - Ji, Yanzhu
AU - Pagano, Davide
AU - Paliwal, Gopal
AU - Parchas, Panos
AU - Pfeil, Pascal
AU - Polychroniou, Orestis
AU - Saxena, Gaurav
AU - Shah, Aamer
AU - Voloder, Amina
AU - Xiao, Sherry
AU - Zhang, Davis
AU - Kraska, Tim
N1 - Publisher Copyright:
© 2024 Owner/Author.
PY - 2024/6/9
Y1 - 2024/6/9
N2 - Analytic data systems typically use data layouts to improve the performance of scanning and filtering data. Common data layout techniques include single-column sort keys, compound sort keys, and more complex multidimensional data layouts such as the Z-order. An appropriately-selected data layout over a table, in combination with metadata such as zone maps, enables the system to skip irrelevant data blocks when scanning the table, which reduces the amount of data scanned and improves query performance. In this paper, we introduce Multidimensional Data Layouts (MDDL), a new data layout technique which outperforms existing data layout techniques for query workloads with repetitive scan filters. Unlike existing data layout approaches, which typically sort tables based on columns, MDDL sorts tables based on a collection of predicates, which enables a much higher degree of specialization to the user's workload. We additionally introduce an algorithm for automatically learning the best MDDL for each table based on telemetry collected from the historical workload. We implemented MDDL within Amazon Redshift. Benchmarks on internal datasets and workloads show that MDDL achieves up to 85% reduction in end-to-end workload runtime compared to using traditional column-based data layout techniques. MDDL is, to the best of our knowledge, the first data layout technique in a commercial product that sorts based on predicates and automatically learns the best predicates.
AB - Analytic data systems typically use data layouts to improve the performance of scanning and filtering data. Common data layout techniques include single-column sort keys, compound sort keys, and more complex multidimensional data layouts such as the Z-order. An appropriately-selected data layout over a table, in combination with metadata such as zone maps, enables the system to skip irrelevant data blocks when scanning the table, which reduces the amount of data scanned and improves query performance. In this paper, we introduce Multidimensional Data Layouts (MDDL), a new data layout technique which outperforms existing data layout techniques for query workloads with repetitive scan filters. Unlike existing data layout approaches, which typically sort tables based on columns, MDDL sorts tables based on a collection of predicates, which enables a much higher degree of specialization to the user's workload. We additionally introduce an algorithm for automatically learning the best MDDL for each table based on telemetry collected from the historical workload. We implemented MDDL within Amazon Redshift. Benchmarks on internal datasets and workloads show that MDDL achieves up to 85% reduction in end-to-end workload runtime compared to using traditional column-based data layout techniques. MDDL is, to the best of our knowledge, the first data layout technique in a commercial product that sorts based on predicates and automatically learns the best predicates.
KW - analytic database
KW - data warehouse
KW - machine learning
KW - sort key
UR - https://www.scopus.com/pages/publications/85196374604
UR - https://www.scopus.com/pages/publications/85196374604#tab=citedBy
U2 - 10.1145/3626246.3653379
DO - 10.1145/3626246.3653379
M3 - Conference contribution
AN - SCOPUS:85196374604
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 55
EP - 67
BT - SIGMOD-Companion 2024 - Companion of the 2024 International Conferaence on Management of Data
PB - Association for Computing Machinery
T2 - 2024 International Conference on Management of Data, SIGMOD 2024
Y2 - 9 June 2024 through 15 June 2024
ER -