SLAQ: Quality-driven scheduling for distributed machine learning

Haoyu Zhang, Logan Stafman, Andrew Or, Michael J. Freedman

Research output: Chapter in Book/Report/Conference proceedingConference contribution

118 Scopus citations

Abstract

Training machine learning (ML) models with large datasets can incur significant resource contention on shared clusters. This training typically involves many iterations that continually improve the quality of the model. Yet in exploratory settings, better models can be obtained faster by directing resources to jobs with the most potential for improvement. We describe SLAQ, a cluster scheduling system for approximate ML training jobs that aims to maximize the overall job quality. When allocating cluster resources, SLAQ explores the quality-runtime trade-offs across multiple jobs to maximize system-wide quality improvement. To do so, SLAQ leverages the iterative nature of ML training algorithms, by collecting quality and resource usage information from concurrent jobs, and then generating highlytailored quality-improvement predictions for future iterations. Experiments show that SLAQ achieves an average quality improvement of up to 73% and an average delay reduction of up to 44% on a large set of ML training jobs, compared to resource fairness schedulers.

Original languageEnglish (US)
Title of host publicationSoCC 2017 - Proceedings of the 2017 Symposium on Cloud Computing
PublisherAssociation for Computing Machinery, Inc
Pages390-404
Number of pages15
ISBN (Electronic)9781450350280
DOIs
StatePublished - Sep 24 2017
Event2017 Symposium on Cloud Computing, SoCC 2017 - Santa Clara, United States
Duration: Sep 24 2017Sep 27 2017

Publication series

NameSoCC 2017 - Proceedings of the 2017 Symposium on Cloud Computing

Other

Other2017 Symposium on Cloud Computing, SoCC 2017
Country/TerritoryUnited States
CitySanta Clara
Period9/24/179/27/17

All Science Journal Classification (ASJC) codes

  • Computational Theory and Mathematics
  • Theoretical Computer Science

Keywords

  • Approximate computing
  • Machine learning
  • Quality
  • Resource management
  • Scheduling

Fingerprint

Dive into the research topics of 'SLAQ: Quality-driven scheduling for distributed machine learning'. Together they form a unique fingerprint.

Cite this