De-anonymizing programmers via code stylometry

Aylin Caliskan-Islam, Richard Harang, Andrew Liu, Arvind Narayanan, Clare Voss, Fabian Yamaguchi, Rachel Greenstadt

Research output: Chapter in Book/Report/Conference proceedingConference contribution

167 Scopus citations

Abstract

Source code authorship attribution is a significant privacy threat to anonymous code contributors. However, it may also enable attribution of successful attacks from code left behind on an infected system, or aid in resolving copyright, copyleft, and plagiarism issues in the programming fields. In this work, we investigate machine learning methods to de-anonymize source code authors of C/C++ using coding style. Our Code Stylometry Feature Set is a novel representation of coding style found in source code that reflects coding style from properties derived from abstract syntax trees. Our random forest and abstract syntax tree-based approach attributes more authors (1,600 and 250) with significantly higher accuracy (94% and 98%) on a larger data set (Google Code Jam) than has been previously achieved. Furthermore, these novel features are robust, difficult to obfuscate, and can be used in other programming languages, such as Python. We also find that (i) the code resulting from difficult programming tasks is easier to attribute than easier tasks and (ii) skilled programmers (who can complete the more difficult tasks) are easier to attribute than less skilled programmers.

Original languageEnglish (US)
Title of host publicationProceedings of the 24th USENIX Security Symposium
PublisherUSENIX Association
Pages255-270
Number of pages16
ISBN (Electronic)9781931971232
StatePublished - Jan 1 2015
Event24th USENIX Security Symposium - Washington, United States
Duration: Aug 12 2015Aug 14 2015

Publication series

NameProceedings of the 24th USENIX Security Symposium

Conference

Conference24th USENIX Security Symposium
Country/TerritoryUnited States
CityWashington
Period8/12/158/14/15

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Information Systems
  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'De-anonymizing programmers via code stylometry'. Together they form a unique fingerprint.

Cite this