TY - GEN
T1 - FlyMC
T2 - 14th European Conference on Computer Systems, EuroSys 2019
AU - Lukman, Jeffrey F.
AU - Ke, Huan
AU - Stuardo, Cesar A.
AU - Suminto, Riza O.
AU - Kurniawan, Daniar H.
AU - Simon, Dikaimin
AU - Priambada, Satria
AU - Tian, Chen
AU - Ye, Feng
AU - Leesatapornwongsa, Tanakorn
AU - Gupta, Aarti
AU - Lu, Shan
AU - Gunawi, Haryadi S.
N1 - Funding Information:
We thank Peter Druschel, our shepherd, and the anonymous reviewers for their tremendous feedback and helpful comments. This material was supported by funding from the NSF (grant Nos. CNS-1350499, CNS-1526304, CNS-1405959, CNS-1563956) as well as generous donations from Huawei, Dell EMC, Google Faculty Research Award, NetApp Faculty Fellowship, and CERES Center for Unstoppable Computing. The experiments in this paper were performed in the Utah Emulab1 [78], the University of Chicago River [28] and Chameleon [15] testbeds.
Publisher Copyright:
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2019/3/25
Y1 - 2019/3/25
N2 - We present a fast and scalable testing approach for datacenter/cloud systems such as Cassandra, Hadoop, Spark, and ZooKeeper. The uniqueness of our approach is in its ability to overcome the path/state-space explosion problem in testing workloads with complex interleavings of messages and faults. We introduce three powerful algorithms: state symmetry, event independence, and parallel flips, which collectively makes our approach on average 16× (up to 78×) faster than other state-of-the-art solutions. We have integrated our techniques with 8 popular datacenter systems, successfully reproduced 12 old bugs, and found 10 new bugs — all were done without random walks or manual checkpoints.
AB - We present a fast and scalable testing approach for datacenter/cloud systems such as Cassandra, Hadoop, Spark, and ZooKeeper. The uniqueness of our approach is in its ability to overcome the path/state-space explosion problem in testing workloads with complex interleavings of messages and faults. We introduce three powerful algorithms: state symmetry, event independence, and parallel flips, which collectively makes our approach on average 16× (up to 78×) faster than other state-of-the-art solutions. We have integrated our techniques with 8 popular datacenter systems, successfully reproduced 12 old bugs, and found 10 new bugs — all were done without random walks or manual checkpoints.
KW - Availability
KW - Distributed Concurrency Bugs
KW - Distributed Systems
KW - Reliability
KW - Software Model Checking
UR - http://www.scopus.com/inward/record.url?scp=85063859202&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85063859202&partnerID=8YFLogxK
U2 - 10.1145/3302424.3303986
DO - 10.1145/3302424.3303986
M3 - Conference contribution
AN - SCOPUS:85063859202
T3 - Proceedings of the 14th EuroSys Conference 2019
BT - Proceedings of the 14th EuroSys Conference 2019
PB - Association for Computing Machinery, Inc
Y2 - 25 March 2019 through 28 March 2019
ER -