TY - GEN
T1 - Ad hoc data and the token ambiguity problem
AU - Xi, Qian
AU - Fisher, Kathleen
AU - Walker, David
AU - Zhu, Kenny Q.
PY - 2009
Y1 - 2009
N2 - pads is a declarative language used to describe the syntax and semantic properties of ad hoc data sources such as financial transactions, server logs and scientific data sets. The pads compiler reads these descriptions and generates a suite of useful data processing tools such as format translators, parsers, printers and even a query engine, all customized to the ad hoc data format in question. Recently, however, to further improve the productivity of programmers that manage ad hoc data sources, we have turned to using pads as an intermediate language in a system that first infers a pads description directly from example data and then passes that description to the original compiler for tool generation. A key subproblem in the inference engine is the token ambiguity problem - the problem of determining which substrings in the example data correspond to complex tokens such as dates, URLs, or comments. In order to solve the token ambiguity problem, the paper studies the relative effectiveness of three different statistical models for tokenizing ad hoc data. It also shows how to incorporate these models into a general and effective format inference algorithm. In addition to using a declarative language (pads) as a key intermediate form, we have implemented the system as a whole in ml.
AB - pads is a declarative language used to describe the syntax and semantic properties of ad hoc data sources such as financial transactions, server logs and scientific data sets. The pads compiler reads these descriptions and generates a suite of useful data processing tools such as format translators, parsers, printers and even a query engine, all customized to the ad hoc data format in question. Recently, however, to further improve the productivity of programmers that manage ad hoc data sources, we have turned to using pads as an intermediate language in a system that first infers a pads description directly from example data and then passes that description to the original compiler for tool generation. A key subproblem in the inference engine is the token ambiguity problem - the problem of determining which substrings in the example data correspond to complex tokens such as dates, URLs, or comments. In order to solve the token ambiguity problem, the paper studies the relative effectiveness of three different statistical models for tokenizing ad hoc data. It also shows how to incorporate these models into a general and effective format inference algorithm. In addition to using a declarative language (pads) as a key intermediate form, we have implemented the system as a whole in ml.
UR - http://www.scopus.com/inward/record.url?scp=70350686658&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=70350686658&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-92995-6_7
DO - 10.1007/978-3-540-92995-6_7
M3 - Conference contribution
AN - SCOPUS:70350686658
SN - 3540929940
SN - 9783540929949
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 91
EP - 106
BT - Practical Aspects of Declarative Languages - 11th International Symposium, PADL 2009, Proceedings
T2 - 11th International Symposium on Practical Aspects of Declarative Languages, PADL 2009
Y2 - 19 January 2009 through 20 January 2009
ER -