We present a scalable and memory-efficient framework for kernel ridge regression. We exploit the inherent rank deficiency of the kernel ridge regression matrix by constructing an approximation that relies on a hierarchy of low-rank factorizations of tunable accuracy, rather than leverage scores or other subsampling techniques. Without ever decompressing the kernel matrix approximation, we propose factorization and solve methods to compute the weight(s) for a given set of training and test data. We show that our method performs an optimal number of operations O (r2n) with respect to the number of training samples (n) due to the underlying numerical low-rank (r) structure of the kernel matrix. Furthermore, each algorithm is also presented in the context of a massively parallel computer system, exploiting two levels of concurrency that take into account both shared-memory and distributed-memory inter-node parallelism. In addition, we present a variety of experiments using popular datasets-small, and large-to show that our approach provides sufficient accuracy in comparison with state-of-the-art methods and with the exact (i.e. non-approximated) kernel ridge regression method. For datasets, in the order of 106 data points, we show that our framework strong-scales to 103 cores. Finally, we provide a Python interface to the scikit-learn library so that scikit-learn can leverage our high-performance solver library to achieve much-improved performance and memory footprint.