Modern scalable distributed systems are designed to be partition-tolerant. They are often required to support increasing load in service requests elastically, and to provide seamless services even when some servers malfunction. Partition-tolerance enables such systems to withstand arbitrary loss of messages as "perceived" by the communicating nodes. However, partition-tolerance and robustness are not tested rigorously in practice. Often severe system-level design defects stay hidden even after deployment, possibly resulting in loss of revenue or customer satisfaction. We propose a novel perturbation-based rigorous testing framework, named SETSUDO 1, especially targeted to expose system-level defects in scalable distributed systems. It applies perturbations (i.e., controlled changes) from the environment of a system during testing, and leverages awareness of system-internal states to precisely control their timing. It uses a flexible instrumentation framework to select relevant internal states and to implement the system code for perturbations. It also provides a test policy language framework, where sequences of perturbation scenarios at a high level are converted automatically to system-level test code. This test code is weaved-in automatically with application code during testing, and any observed defects are reported. We have implemented our perturbation testing framework and demonstrate its evaluation on several open source projects, where it was successful in exposing known, as well as some unknown, defects. Our framework leverages small-scale testing, and avoids upfront infrastructure costs typically needed for large-scale stress testing.