TY - GEN
T1 - Semantics and algorithms for data-dependent grammars
AU - Jim, Trevor
AU - Mandelbaum, Yitzhak
AU - Walker, David
N1 - Copyright:
Copyright 2010 Elsevier B.V., All rights reserved.
PY - 2010
Y1 - 2010
N2 - We present the design and theory of a new parsing engine, YAKKER, capable of satisfying the many needs of modern programmers and modern data processing applications. In particular, our new parsing engine handles (1) full scannerless context-free grammars with (2) regular expressions as right-hand sides for defining nonterminals. YAKKER also includes (3) facilities for binding variables to intermediate parse results and (4) using such bindings within arbitrary constraints to control parsing. These facilities allow the kind of data-dependent parsing commonly needed in systems applications, particularly those that operate over binary data. In addition, (5) nonterminals may be parameterized by arbitrary values, which gives the system good modularity and abstraction properties in the presence of data-dependent parsing. Finally, (6) legacy parsing libraries,such as sophisticated libraries for dates and times, may be directly incorporated into parser specifications. We illustrate the importance and utility of this rich collection of features by presenting its use on examples ranging from difficult programming language grammars to web server logs to binary data specification. We also show that our grammars have important compositionality properties and explain why such properties areimportant in modern applications such as automatic grammar induction. In terms of technical contributions, we provide a traditional high-level semantics for our new grammar formalization and show how to compile grammars into non deterministic automata. These automata are stack-based, somewhat like conventional push-down automata,but are also equipped with environments to track data-dependent parsing state. We prove the correctness of our translation of data-dependent grammars into these new automata and then show how to implement the automata efficiently using a variation of Earley's parsing algorithm.
AB - We present the design and theory of a new parsing engine, YAKKER, capable of satisfying the many needs of modern programmers and modern data processing applications. In particular, our new parsing engine handles (1) full scannerless context-free grammars with (2) regular expressions as right-hand sides for defining nonterminals. YAKKER also includes (3) facilities for binding variables to intermediate parse results and (4) using such bindings within arbitrary constraints to control parsing. These facilities allow the kind of data-dependent parsing commonly needed in systems applications, particularly those that operate over binary data. In addition, (5) nonterminals may be parameterized by arbitrary values, which gives the system good modularity and abstraction properties in the presence of data-dependent parsing. Finally, (6) legacy parsing libraries,such as sophisticated libraries for dates and times, may be directly incorporated into parser specifications. We illustrate the importance and utility of this rich collection of features by presenting its use on examples ranging from difficult programming language grammars to web server logs to binary data specification. We also show that our grammars have important compositionality properties and explain why such properties areimportant in modern applications such as automatic grammar induction. In terms of technical contributions, we provide a traditional high-level semantics for our new grammar formalization and show how to compile grammars into non deterministic automata. These automata are stack-based, somewhat like conventional push-down automata,but are also equipped with environments to track data-dependent parsing state. We prove the correctness of our translation of data-dependent grammars into these new automata and then show how to implement the automata efficiently using a variation of Earley's parsing algorithm.
KW - Ambiguous grammars
KW - Automata
KW - Context-sensitive grammars
KW - Data-dependent grammars
KW - EBNF
KW - Earley parsing
KW - L-attributed grammars
KW - Regular expressions
KW - Regular right-sides
KW - Scannerless parsing
KW - Semantic predicates
KW - Transducers
UR - http://www.scopus.com/inward/record.url?scp=77950871101&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77950871101&partnerID=8YFLogxK
U2 - 10.1145/1706299.1706347
DO - 10.1145/1706299.1706347
M3 - Conference contribution
AN - SCOPUS:77950871101
SN - 9781605584799
T3 - Conference Record of the Annual ACM Symposium on Principles of Programming Languages
SP - 417
EP - 430
BT - POPL'10 - Proceedings of the 37th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages
T2 - 37th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL'10
Y2 - 17 January 2010 through 23 January 2010
ER -