TY - GEN
T1 - From dirt to shovels
T2 - 35th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL'08
AU - Fisher, Kathleen
AU - Walker, David
AU - Zhu, Kenny Q.
AU - White, Peter
PY - 2008
Y1 - 2008
N2 - An ad hoc data source is any semistructured data source for which useful data analysis and transformation tools are not readily available. Such data must be queried, transformed and displayed by systems administrators, computational biologists, financial analysts and hosts of others on a regular basis. In this paper, we demonstrate that it is possible to generate a suite of useful data processing tools, including a semi-structured query engine, several format converters, a statistical analyzer and data visualization routines directly from the ad hoc data itself, without any human intervention. The key technical contribution of the work is a multi-phase algorithm that automatically infers the structure of an ad hoc data source and produces a format specification in the PADS data description language. Programmers wishing to implement custom data analysis tools can use such descriptions to generate printing and parsing libraries for the data. Alternatively, our software infrastructure will push these descriptions through the PADS compiler, creating format-dependent modules that, when linked with format-independent algorithms for analysis and transformation, result infully functional tools. We evaluate the performance of our inference algorithm, showing it scales linearlyin the size of the training data - completing in seconds, as opposed to the hours or days it takes to write a description by hand. We also evaluate the correctness of the algorithm, demonstrating that generating accurate descriptions often requires less than 5% of theavailable data.
AB - An ad hoc data source is any semistructured data source for which useful data analysis and transformation tools are not readily available. Such data must be queried, transformed and displayed by systems administrators, computational biologists, financial analysts and hosts of others on a regular basis. In this paper, we demonstrate that it is possible to generate a suite of useful data processing tools, including a semi-structured query engine, several format converters, a statistical analyzer and data visualization routines directly from the ad hoc data itself, without any human intervention. The key technical contribution of the work is a multi-phase algorithm that automatically infers the structure of an ad hoc data source and produces a format specification in the PADS data description language. Programmers wishing to implement custom data analysis tools can use such descriptions to generate printing and parsing libraries for the data. Alternatively, our software infrastructure will push these descriptions through the PADS compiler, creating format-dependent modules that, when linked with format-independent algorithms for analysis and transformation, result infully functional tools. We evaluate the performance of our inference algorithm, showing it scales linearlyin the size of the training data - completing in seconds, as opposed to the hours or days it takes to write a description by hand. We also evaluate the correctness of the algorithm, demonstrating that generating accurate descriptions often requires less than 5% of theavailable data.
KW - ad hoc data
KW - data description languages
KW - grammar induction
KW - tool generation
UR - http://www.scopus.com/inward/record.url?scp=84865636979&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84865636979&partnerID=8YFLogxK
U2 - 10.1145/1328438.1328488
DO - 10.1145/1328438.1328488
M3 - Conference contribution
AN - SCOPUS:84865636979
SN - 9781595936899
T3 - Conference Record of the Annual ACM Symposium on Principles of Programming Languages
SP - 421
EP - 434
BT - POPL'08 - Proceedings of the 35th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages
Y2 - 7 January 2008 through 12 January 2008
ER -