Ad hoc data and the token ambiguity problem

Qian Xi, Kathleen Fisher, David Walker, Kenny Q. Zhu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations

Abstract

pads is a declarative language used to describe the syntax and semantic properties of ad hoc data sources such as financial transactions, server logs and scientific data sets. The pads compiler reads these descriptions and generates a suite of useful data processing tools such as format translators, parsers, printers and even a query engine, all customized to the ad hoc data format in question. Recently, however, to further improve the productivity of programmers that manage ad hoc data sources, we have turned to using pads as an intermediate language in a system that first infers a pads description directly from example data and then passes that description to the original compiler for tool generation. A key subproblem in the inference engine is the token ambiguity problem - the problem of determining which substrings in the example data correspond to complex tokens such as dates, URLs, or comments. In order to solve the token ambiguity problem, the paper studies the relative effectiveness of three different statistical models for tokenizing ad hoc data. It also shows how to incorporate these models into a general and effective format inference algorithm. In addition to using a declarative language (pads) as a key intermediate form, we have implemented the system as a whole in ml.

Original languageEnglish (US)
Title of host publicationPractical Aspects of Declarative Languages - 11th International Symposium, PADL 2009, Proceedings
Pages91-106
Number of pages16
DOIs
StatePublished - 2009
Event11th International Symposium on Practical Aspects of Declarative Languages, PADL 2009 - Savannah, GA, United States
Duration: Jan 19 2009Jan 20 2009

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume5418 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other11th International Symposium on Practical Aspects of Declarative Languages, PADL 2009
Country/TerritoryUnited States
CitySavannah, GA
Period1/19/091/20/09

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Ad hoc data and the token ambiguity problem'. Together they form a unique fingerprint.

Cite this