SLAQ: Quality-driven scheduling for distributed machine learning

Haoyu Zhang, Logan Stafman, Andrew Or, Michael J. Freedman

Research output: Chapter in Book/Report/Conference proceedingConference contribution

11 Citations (Scopus)

Abstract

Training machine learning (ML) models with large datasets can incur significant resource contention on shared clusters. This training typically involves many iterations that continually improve the quality of the model. Yet in exploratory settings, better models can be obtained faster by directing resources to jobs with the most potential for improvement. We describe SLAQ, a cluster scheduling system for approximate ML training jobs that aims to maximize the overall job quality. When allocating cluster resources, SLAQ explores the quality-runtime trade-offs across multiple jobs to maximize system-wide quality improvement. To do so, SLAQ leverages the iterative nature of ML training algorithms, by collecting quality and resource usage information from concurrent jobs, and then generating highlytailored quality-improvement predictions for future iterations. Experiments show that SLAQ achieves an average quality improvement of up to 73% and an average delay reduction of up to 44% on a large set of ML training jobs, compared to resource fairness schedulers.

Original languageEnglish (US)
Title of host publicationSoCC 2017 - Proceedings of the 2017 Symposium on Cloud Computing
PublisherAssociation for Computing Machinery, Inc
Pages390-404
Number of pages15
ISBN (Electronic)9781450350280
DOIs
StatePublished - Sep 24 2017
Event2017 Symposium on Cloud Computing, SoCC 2017 - Santa Clara, United States
Duration: Sep 24 2017Sep 27 2017

Publication series

NameSoCC 2017 - Proceedings of the 2017 Symposium on Cloud Computing

Other

Other2017 Symposium on Cloud Computing, SoCC 2017
CountryUnited States
CitySanta Clara
Period9/24/179/27/17

Fingerprint

Learning systems
Machine Learning
Scheduling
Quality Improvement
Resources
Maximise
Iteration
Training Algorithm
Contention
Scheduler
Fairness
Large Data Sets
Leverage
Large Set
Concurrent
Learning Algorithm
Trade-offs
Model
Training
Prediction

All Science Journal Classification (ASJC) codes

  • Computational Theory and Mathematics
  • Theoretical Computer Science

Keywords

  • Approximate computing
  • Machine learning
  • Quality
  • Resource management
  • Scheduling

Cite this

Zhang, H., Stafman, L., Or, A., & Freedman, M. J. (2017). SLAQ: Quality-driven scheduling for distributed machine learning. In SoCC 2017 - Proceedings of the 2017 Symposium on Cloud Computing (pp. 390-404). (SoCC 2017 - Proceedings of the 2017 Symposium on Cloud Computing). Association for Computing Machinery, Inc. https://doi.org/10.1145/3127479.3127490
Zhang, Haoyu ; Stafman, Logan ; Or, Andrew ; Freedman, Michael J. / SLAQ : Quality-driven scheduling for distributed machine learning. SoCC 2017 - Proceedings of the 2017 Symposium on Cloud Computing. Association for Computing Machinery, Inc, 2017. pp. 390-404 (SoCC 2017 - Proceedings of the 2017 Symposium on Cloud Computing).
@inproceedings{68d67f5fe6814d1ab87c48924bcd75b9,
title = "SLAQ: Quality-driven scheduling for distributed machine learning",
abstract = "Training machine learning (ML) models with large datasets can incur significant resource contention on shared clusters. This training typically involves many iterations that continually improve the quality of the model. Yet in exploratory settings, better models can be obtained faster by directing resources to jobs with the most potential for improvement. We describe SLAQ, a cluster scheduling system for approximate ML training jobs that aims to maximize the overall job quality. When allocating cluster resources, SLAQ explores the quality-runtime trade-offs across multiple jobs to maximize system-wide quality improvement. To do so, SLAQ leverages the iterative nature of ML training algorithms, by collecting quality and resource usage information from concurrent jobs, and then generating highlytailored quality-improvement predictions for future iterations. Experiments show that SLAQ achieves an average quality improvement of up to 73{\%} and an average delay reduction of up to 44{\%} on a large set of ML training jobs, compared to resource fairness schedulers.",
keywords = "Approximate computing, Machine learning, Quality, Resource management, Scheduling",
author = "Haoyu Zhang and Logan Stafman and Andrew Or and Freedman, {Michael J.}",
year = "2017",
month = "9",
day = "24",
doi = "10.1145/3127479.3127490",
language = "English (US)",
series = "SoCC 2017 - Proceedings of the 2017 Symposium on Cloud Computing",
publisher = "Association for Computing Machinery, Inc",
pages = "390--404",
booktitle = "SoCC 2017 - Proceedings of the 2017 Symposium on Cloud Computing",

}

Zhang, H, Stafman, L, Or, A & Freedman, MJ 2017, SLAQ: Quality-driven scheduling for distributed machine learning. in SoCC 2017 - Proceedings of the 2017 Symposium on Cloud Computing. SoCC 2017 - Proceedings of the 2017 Symposium on Cloud Computing, Association for Computing Machinery, Inc, pp. 390-404, 2017 Symposium on Cloud Computing, SoCC 2017, Santa Clara, United States, 9/24/17. https://doi.org/10.1145/3127479.3127490

SLAQ : Quality-driven scheduling for distributed machine learning. / Zhang, Haoyu; Stafman, Logan; Or, Andrew; Freedman, Michael J.

SoCC 2017 - Proceedings of the 2017 Symposium on Cloud Computing. Association for Computing Machinery, Inc, 2017. p. 390-404 (SoCC 2017 - Proceedings of the 2017 Symposium on Cloud Computing).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - SLAQ

T2 - Quality-driven scheduling for distributed machine learning

AU - Zhang, Haoyu

AU - Stafman, Logan

AU - Or, Andrew

AU - Freedman, Michael J.

PY - 2017/9/24

Y1 - 2017/9/24

N2 - Training machine learning (ML) models with large datasets can incur significant resource contention on shared clusters. This training typically involves many iterations that continually improve the quality of the model. Yet in exploratory settings, better models can be obtained faster by directing resources to jobs with the most potential for improvement. We describe SLAQ, a cluster scheduling system for approximate ML training jobs that aims to maximize the overall job quality. When allocating cluster resources, SLAQ explores the quality-runtime trade-offs across multiple jobs to maximize system-wide quality improvement. To do so, SLAQ leverages the iterative nature of ML training algorithms, by collecting quality and resource usage information from concurrent jobs, and then generating highlytailored quality-improvement predictions for future iterations. Experiments show that SLAQ achieves an average quality improvement of up to 73% and an average delay reduction of up to 44% on a large set of ML training jobs, compared to resource fairness schedulers.

AB - Training machine learning (ML) models with large datasets can incur significant resource contention on shared clusters. This training typically involves many iterations that continually improve the quality of the model. Yet in exploratory settings, better models can be obtained faster by directing resources to jobs with the most potential for improvement. We describe SLAQ, a cluster scheduling system for approximate ML training jobs that aims to maximize the overall job quality. When allocating cluster resources, SLAQ explores the quality-runtime trade-offs across multiple jobs to maximize system-wide quality improvement. To do so, SLAQ leverages the iterative nature of ML training algorithms, by collecting quality and resource usage information from concurrent jobs, and then generating highlytailored quality-improvement predictions for future iterations. Experiments show that SLAQ achieves an average quality improvement of up to 73% and an average delay reduction of up to 44% on a large set of ML training jobs, compared to resource fairness schedulers.

KW - Approximate computing

KW - Machine learning

KW - Quality

KW - Resource management

KW - Scheduling

UR - http://www.scopus.com/inward/record.url?scp=85032447493&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85032447493&partnerID=8YFLogxK

U2 - 10.1145/3127479.3127490

DO - 10.1145/3127479.3127490

M3 - Conference contribution

AN - SCOPUS:85032447493

T3 - SoCC 2017 - Proceedings of the 2017 Symposium on Cloud Computing

SP - 390

EP - 404

BT - SoCC 2017 - Proceedings of the 2017 Symposium on Cloud Computing

PB - Association for Computing Machinery, Inc

ER -

Zhang H, Stafman L, Or A, Freedman MJ. SLAQ: Quality-driven scheduling for distributed machine learning. In SoCC 2017 - Proceedings of the 2017 Symposium on Cloud Computing. Association for Computing Machinery, Inc. 2017. p. 390-404. (SoCC 2017 - Proceedings of the 2017 Symposium on Cloud Computing). https://doi.org/10.1145/3127479.3127490