Effectively Using Public Data in Privacy Preserving Machine Learning

Milad Nasr, Saeed Mahloujifar, Xinyu Tang, Prateek Mittal, Amir Houmansadr

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations


Differentially private (DP) machine learning techniques are notorious for their degradation of model utility (e.g., they degrade classification accuracy). A recent line of work has demonstrated that leveraging public data can improve the trade-off between privacy and utility when training models with DP guarantee. In this work, we further explore the potential of using public data in DP models, showing that utility gains can in fact be significantly higher than what shown in prior works. Specifically, we introduce DOPE-SGD, a modified DP-SGD algorithm that leverages public data during its training. DOPE-SGD uses public data in two complementary ways: (1) it uses advance augmentation techniques that leverages public data to generate synthetic data that is effectively embedded in multiple steps of the training pipeline; (2) it uses a modified gradient clipping mechanism in DP-SGD to change the origin of gradient vectors using the information inferred from available public data, therefore boosting utility. We also introduce a technique to ensemble intermediate DP models by leveraging the post processing property of differential privacy to further improve the accuracy of the predictions. Our experimental results demonstrate the effectiveness of our approach in improving the state-of-the-art in DP machine learning across multiple datasets, network architectures, and application domains. For instance, assuming access to 2, 000 public images, and for a privacy budget of ε = 2, δ = 10−5, our technique achieves an accuracy of 75.1% on CIFAR10, significantly higher than 68.1% achieved by the state of the art.

Original languageEnglish (US)
Pages (from-to)25718-25732
Number of pages15
JournalProceedings of Machine Learning Research
StatePublished - 2023
Event40th International Conference on Machine Learning, ICML 2023 - Honolulu, United States
Duration: Jul 23 2023Jul 29 2023

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Statistics and Probability


Dive into the research topics of 'Effectively Using Public Data in Privacy Preserving Machine Learning'. Together they form a unique fingerprint.

Cite this