Marvolo: Programmatic Data Augmentation for Deep Malware Detection

Mike Wong, Edward Raff, James Holt, Ravi Netravali

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Data acquisition for ML-driven malware detection is challenging. While large commercial datasets exist, they are prohibitively expensive. On the other hand, an entity (e.g., a bank or government), may be targeted with unique malware, but the data samples available will never be sufficient to train a bespoke ML-based detector. While data augmentation has been a key component in improving deep learning models by providing requisite diversity for generalization, it has proven far more challenging for malware detection. The main challenges are that (1) determining the augmentations to make is not straightforward, (2) operations are on binaries rather than source code (which is not available), complicating correctness and understanding, and (3) labeling new files mandates expensive binary reverse engineering. We present Marvolo for creating realistic, semantics preserving transformations that mimic the code alterations made by malware authors in practice, allowing us to generate augmented data on raw binary files. This also enables Marvolo to safely propagate labels to newly-generated data. Across several malware datasets and recent ML-based detectors, Marvolo improves accuracy and AUC by up to 5% and 10% respectively, while boosting efficiency by 79x by avoiding redundant computation.

Original languageEnglish (US)
Title of host publicationMachine Learning and Knowledge Discovery in Databases
Subtitle of host publicationResearch Track - European Conference, ECML PKDD 2023, Proceedings
EditorsDanai Koutra, Claudia Plant, Manuel Gomez Rodriguez, Elena Baralis, Francesco Bonchi
PublisherSpringer Science and Business Media Deutschland GmbH
Pages270-285
Number of pages16
ISBN (Print)9783031434112
DOIs
StatePublished - 2023
EventEuropean Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2023 - Turin, Italy
Duration: Sep 18 2023Sep 22 2023

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume14169 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

ConferenceEuropean Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2023
Country/TerritoryItaly
CityTurin
Period9/18/239/22/23

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Marvolo: Programmatic Data Augmentation for Deep Malware Detection'. Together they form a unique fingerprint.

Cite this