TY - JOUR
T1 - Foundations of empirical memory consistency testing
AU - Kirkham, Jake
AU - Sorensen, Tyler
AU - Tureci, Esin
AU - Martonosi, Margaret
N1 - Funding Information:
We thank the anonymous reviewers for their valuable feedback which greatly improved the clarity of the paper. We thank Sreepathi Pai for the use of the GPU Zoo at the University of Rochester, which was our main source of GPUs for this work. We thank Mariusz Merecki and Jacek Jankowski at Intel for their feedback and insights around the the Iris. We thank the Khronos SPIR Memory Model TSG, (especially Rob Simpson, Alan Baker, David Neto, Jeff Bolz, Nicolai Hähnle, Graeme Leese, Brian Sumner) for their support and feedback on this work over the last year. The early-stage equipment for this project was partially funded by the MacCracken Independent Work/Senior Thesis Fund from the School of Engineering and Applied Science (SEAS) at Princeton University. This material is based upon work supported by the National Science Foundation under Grant No. 1739674. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Publisher Copyright:
© 2020 Owner/Author.
PY - 2020/11/13
Y1 - 2020/11/13
N2 - Modern memory consistency models are complex, and it is difficult to reason about the relaxed behaviors that current systems allow. Programming languages, such as C and OpenCL, offer a memory model interface that developers can use to safely write concurrent applications. This abstraction provides functional portability across any platform that implements the interface, regardless of differences in the underlying systems. This powerful abstraction hinges on the ability of the system to correctly implement the interface. Many techniques for memory consistency model validation use empirical testing, which has been effective at uncovering undocumented behaviors and even finding bugs in trusted compilation schemes. Memory model testing consists of small concurrent unit tests called "litmus tests". In these tests, certain observations, including potential bugs, are exceedingly rare, as they may only be triggered by precise interleaving of system steps in a complex processor, which is probabilistic in nature. Thus, each test must be run many times in order to provide a high level of confidence in its coverage. In this work, we rigorously investigate empirical memory model testing. In particular, we propose methodologies for navigating complex stressing routines and analyzing large numbers of testing observations. Using these insights, we can more efficiently tune stressing parameters, which can lead to higher confidence results at a faster rate. We emphasize the need for such approaches by performing a meta-study of prior work, which reveals results with low reproducibility and inefficient use of testing time. Our investigation is presented alongside empirical data. We believe that OpenCL targeting GPUs is a pragmatic choice in this domain as there exists a variety of different platforms to test, from large HPC servers to power-efficient edge devices. The tests presented in the work span 3 GPUs from 3 different vendors. We show that our methodologies are applicable across the GPUs, despite significant variances in the results. Concretely, our results show: lossless speedups of more than 5× in tuning using data peeking; a definition of portable stressing parameters which loses only 12% efficiency when generalized across our domain; a priority order of litmus tests for tuning. We stress test a conformance test suite for the OpenCL 2.0 memory model and discover a bug in Intel's compiler. Our methods are evaluated on the other two GPUs using mutation testing. We end with recommendations for official memory model conformance tests.
AB - Modern memory consistency models are complex, and it is difficult to reason about the relaxed behaviors that current systems allow. Programming languages, such as C and OpenCL, offer a memory model interface that developers can use to safely write concurrent applications. This abstraction provides functional portability across any platform that implements the interface, regardless of differences in the underlying systems. This powerful abstraction hinges on the ability of the system to correctly implement the interface. Many techniques for memory consistency model validation use empirical testing, which has been effective at uncovering undocumented behaviors and even finding bugs in trusted compilation schemes. Memory model testing consists of small concurrent unit tests called "litmus tests". In these tests, certain observations, including potential bugs, are exceedingly rare, as they may only be triggered by precise interleaving of system steps in a complex processor, which is probabilistic in nature. Thus, each test must be run many times in order to provide a high level of confidence in its coverage. In this work, we rigorously investigate empirical memory model testing. In particular, we propose methodologies for navigating complex stressing routines and analyzing large numbers of testing observations. Using these insights, we can more efficiently tune stressing parameters, which can lead to higher confidence results at a faster rate. We emphasize the need for such approaches by performing a meta-study of prior work, which reveals results with low reproducibility and inefficient use of testing time. Our investigation is presented alongside empirical data. We believe that OpenCL targeting GPUs is a pragmatic choice in this domain as there exists a variety of different platforms to test, from large HPC servers to power-efficient edge devices. The tests presented in the work span 3 GPUs from 3 different vendors. We show that our methodologies are applicable across the GPUs, despite significant variances in the results. Concretely, our results show: lossless speedups of more than 5× in tuning using data peeking; a definition of portable stressing parameters which loses only 12% efficiency when generalized across our domain; a priority order of litmus tests for tuning. We stress test a conformance test suite for the OpenCL 2.0 memory model and discover a bug in Intel's compiler. Our methods are evaluated on the other two GPUs using mutation testing. We end with recommendations for official memory model conformance tests.
KW - GPUs
KW - OpenCL
KW - autotuning
KW - conformance testing
KW - memory consistency
UR - http://www.scopus.com/inward/record.url?scp=85097581502&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85097581502&partnerID=8YFLogxK
U2 - 10.1145/3428294
DO - 10.1145/3428294
M3 - Article
AN - SCOPUS:85097581502
SN - 2475-1421
VL - 4
JO - Proceedings of the ACM on Programming Languages
JF - Proceedings of the ACM on Programming Languages
IS - OOPSLA
M1 - 226
ER -