TY - GEN
T1 - Marvolo
T2 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2023
AU - Wong, Mike
AU - Raff, Edward
AU - Holt, James
AU - Netravali, Ravi
N1 - Publisher Copyright:
© 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2023
Y1 - 2023
N2 - Data acquisition for ML-driven malware detection is challenging. While large commercial datasets exist, they are prohibitively expensive. On the other hand, an entity (e.g., a bank or government), may be targeted with unique malware, but the data samples available will never be sufficient to train a bespoke ML-based detector. While data augmentation has been a key component in improving deep learning models by providing requisite diversity for generalization, it has proven far more challenging for malware detection. The main challenges are that (1) determining the augmentations to make is not straightforward, (2) operations are on binaries rather than source code (which is not available), complicating correctness and understanding, and (3) labeling new files mandates expensive binary reverse engineering. We present Marvolo for creating realistic, semantics preserving transformations that mimic the code alterations made by malware authors in practice, allowing us to generate augmented data on raw binary files. This also enables Marvolo to safely propagate labels to newly-generated data. Across several malware datasets and recent ML-based detectors, Marvolo improves accuracy and AUC by up to 5% and 10% respectively, while boosting efficiency by 79x by avoiding redundant computation.
AB - Data acquisition for ML-driven malware detection is challenging. While large commercial datasets exist, they are prohibitively expensive. On the other hand, an entity (e.g., a bank or government), may be targeted with unique malware, but the data samples available will never be sufficient to train a bespoke ML-based detector. While data augmentation has been a key component in improving deep learning models by providing requisite diversity for generalization, it has proven far more challenging for malware detection. The main challenges are that (1) determining the augmentations to make is not straightforward, (2) operations are on binaries rather than source code (which is not available), complicating correctness and understanding, and (3) labeling new files mandates expensive binary reverse engineering. We present Marvolo for creating realistic, semantics preserving transformations that mimic the code alterations made by malware authors in practice, allowing us to generate augmented data on raw binary files. This also enables Marvolo to safely propagate labels to newly-generated data. Across several malware datasets and recent ML-based detectors, Marvolo improves accuracy and AUC by up to 5% and 10% respectively, while boosting efficiency by 79x by avoiding redundant computation.
UR - http://www.scopus.com/inward/record.url?scp=85174449842&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85174449842&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-43412-9_16
DO - 10.1007/978-3-031-43412-9_16
M3 - Conference contribution
AN - SCOPUS:85174449842
SN - 9783031434112
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 270
EP - 285
BT - Machine Learning and Knowledge Discovery in Databases
A2 - Koutra, Danai
A2 - Plant, Claudia
A2 - Gomez Rodriguez, Manuel
A2 - Baralis, Elena
A2 - Bonchi, Francesco
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 18 September 2023 through 22 September 2023
ER -