From dirt to shovels fully automatic tool generation from ad hoc data

Kathleen Fisher, David Walker, Kenny Q. Zhu, Peter White

Research output: Contribution to journalArticlepeer-review

25 Scopus citations

Abstract

An ad hoc data source is any semistructured data source for which useful data analysis and transformation tools are not readily available. Such data must be queried, transformed and displayed by systems administrators, computational biologists, financial analysts and hosts of others on a regular basis. In this paper, we demonstrate that it is possible to generate a suite of useful data processing tools, including a semi-structured query engine, several format converters, a statistical analyzer and data visualization routines directly from the ad hoc data itself, without any human intervention. The key technical contribution of the work is a multiphase algorithm that automatically infers the structure of an ad hoc data source and produces a format specification in the PADS data description language. Programmers wishing to implement custom data analysis tools can use such descriptions to generate printing and parsing libraries for the data. Alternatively, our software infrastructure will push these descriptions through the PADS compiler, creating format-dependent modules that, when linked with format-independent algorithms for analysis and transformation, result in fully functional tools. We evaluate the performance of our inference algorithm, showing it scales linearly in the size of the training data - completing in seconds, as opposed to the hours or days it takes to write a description by hand. We also evaluate the correctness of the algorithm, demonstrating that generating accurate descriptions often requires less than 5% of the available data.

Original languageEnglish (US)
Pages (from-to)421-434
Number of pages14
JournalACM SIGPLAN Notices
Volume43
Issue number1
StatePublished - Jan 2008

All Science Journal Classification (ASJC) codes

  • General Computer Science

Keywords

  • Ad hoc data
  • Data description languages
  • Grammar induction
  • Tool generation

Fingerprint

Dive into the research topics of 'From dirt to shovels fully automatic tool generation from ad hoc data'. Together they form a unique fingerprint.

Cite this