TY - JOUR
T1 - Machine Learning in Environmental Research
T2 - Common Pitfalls and Best Practices
AU - Zhu, Jun Jie
AU - Yang, Meiqi
AU - Ren, Zhiyong Jason
N1 - Funding Information:
The authors are thankful for the financial support from the Paul L. Busch Award from the Water Research Foundation, Bioenergy Technologies Office of the U.S. Department of Energy (Project No. EE0009269), and Andlinger Center for Energy and the Environment at Princeton University.
Publisher Copyright:
© 2023 American Chemical Society
PY - 2023
Y1 - 2023
N2 - Machine learning (ML) is increasingly used in environmental research to process large data sets and decipher complex relationships between system variables. However, due to the lack of familiarity and methodological rigor, inadequate ML studies may lead to spurious conclusions. In this study, we synthesized literature analysis with our own experience and provided a tutorial-like compilation of common pitfalls along with best practice guidelines for environmental ML research. We identified more than 30 key items and provided evidence-based data analysis based on 148 highly cited research articles to exhibit the misconceptions of terminologies, proper sample size and feature size, data enrichment and feature selection, randomness assessment, data leakage management, data splitting, method selection and comparison, model optimization and evaluation, and model explainability and causality. By analyzing good examples on supervised learning and reference modeling paradigms, we hope to help researchers adopt more rigorous data preprocessing and model development standards for more accurate, robust, and practicable model uses in environmental research and applications.
AB - Machine learning (ML) is increasingly used in environmental research to process large data sets and decipher complex relationships between system variables. However, due to the lack of familiarity and methodological rigor, inadequate ML studies may lead to spurious conclusions. In this study, we synthesized literature analysis with our own experience and provided a tutorial-like compilation of common pitfalls along with best practice guidelines for environmental ML research. We identified more than 30 key items and provided evidence-based data analysis based on 148 highly cited research articles to exhibit the misconceptions of terminologies, proper sample size and feature size, data enrichment and feature selection, randomness assessment, data leakage management, data splitting, method selection and comparison, model optimization and evaluation, and model explainability and causality. By analyzing good examples on supervised learning and reference modeling paradigms, we hope to help researchers adopt more rigorous data preprocessing and model development standards for more accurate, robust, and practicable model uses in environmental research and applications.
KW - Machine learning
KW - causality
KW - data leakage
KW - data preprocessing
KW - environmental research
KW - hyperparameter optimization
KW - model explainability
KW - supervised learning
UR - http://www.scopus.com/inward/record.url?scp=85164670060&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85164670060&partnerID=8YFLogxK
U2 - 10.1021/acs.est.3c00026
DO - 10.1021/acs.est.3c00026
M3 - Review article
C2 - 37384597
AN - SCOPUS:85164670060
SN - 0013-936X
JO - Environmental Science and Technology
JF - Environmental Science and Technology
ER -