10,844 research outputs found
Efficient learning of context-free grammars from positive structural examples
AbstractIn this paper, we introduce a new normal form for context-free grammars, called reversible context-free grammars, for the problem of learning context-free grammars from positive-only examples. A context-free grammar G = (N, Σ, P, S) is said to be reversible if (1) A → α and B → α in P implies A = B and (2) A → αBβ and A → αCβ in P implies B = C. We show that the class of reversible context-free grammars can be identified in the limit from positive samples of structural descriptions and there exists an efficient algorithm to identify them from positive samples of structural descriptions, where a structural description of a context-free grammar is an unlabelled derivation tree of the grammar. This implies that if positive structural examples of a reversible context-free grammar for the target language are available to the learning algorithm, the full class of context-free languages can be learned efficiently from positive samples
Synthesizing Program Input Grammars
We present an algorithm for synthesizing a context-free grammar encoding the
language of valid program inputs from a set of input examples and blackbox
access to the program. Our algorithm addresses shortcomings of existing grammar
inference algorithms, which both severely overgeneralize and are prohibitively
slow. Our implementation, GLADE, leverages the grammar synthesized by our
algorithm to fuzz test programs with structured inputs. We show that GLADE
substantially increases the incremental coverage on valid inputs compared to
two baseline fuzzers
Inducing Probabilistic Grammars by Bayesian Model Merging
We describe a framework for inducing probabilistic grammars from corpora of
positive samples. First, samples are {\em incorporated} by adding ad-hoc rules
to a working grammar; subsequently, elements of the model (such as states or
nonterminals) are {\em merged} to achieve generalization and a more compact
representation. The choice of what to merge and when to stop is governed by the
Bayesian posterior probability of the grammar given the data, which formalizes
a trade-off between a close fit to the data and a default preference for
simpler models (`Occam's Razor'). The general scheme is illustrated using three
types of probabilistic grammars: Hidden Markov models, class-based -grams,
and stochastic context-free grammars.Comment: To appear in Grammatical Inference and Applications, Second
International Colloquium on Grammatical Inference; Springer Verlag, 1994. 13
page
Wrapper Maintenance: A Machine Learning Approach
The proliferation of online information sources has led to an increased use
of wrappers for extracting data from Web sources. While most of the previous
research has focused on quick and efficient generation of wrappers, the
development of tools for wrapper maintenance has received less attention. This
is an important research problem because Web sources often change in ways that
prevent the wrappers from extracting data correctly. We present an efficient
algorithm that learns structural information about data from positive examples
alone. We describe how this information can be used for two wrapper maintenance
applications: wrapper verification and reinduction. The wrapper verification
system detects when a wrapper is not extracting correct data, usually because
the Web source has changed its format. The reinduction algorithm automatically
recovers from changes in the Web source by identifying data on Web pages so
that a new wrapper may be generated for this source. To validate our approach,
we monitored 27 wrappers over a period of a year. The verification algorithm
correctly discovered 35 of the 37 wrapper changes, and made 16 mistakes,
resulting in precision of 0.73 and recall of 0.95. We validated the reinduction
algorithm on ten Web sources. We were able to successfully reinduce the
wrappers, obtaining precision and recall values of 0.90 and 0.80 on the data
extraction task
Overfitting in Synthesis: Theory and Practice (Extended Version)
In syntax-guided synthesis (SyGuS), a synthesizer's goal is to automatically
generate a program belonging to a grammar of possible implementations that
meets a logical specification. We investigate a common limitation across
state-of-the-art SyGuS tools that perform counterexample-guided inductive
synthesis (CEGIS). We empirically observe that as the expressiveness of the
provided grammar increases, the performance of these tools degrades
significantly.
We claim that this degradation is not only due to a larger search space, but
also due to overfitting. We formally define this phenomenon and prove
no-free-lunch theorems for SyGuS, which reveal a fundamental tradeoff between
synthesizer performance and grammar expressiveness.
A standard approach to mitigate overfitting in machine learning is to run
multiple learners with varying expressiveness in parallel. We demonstrate that
this insight can immediately benefit existing SyGuS tools. We also propose a
novel single-threaded technique called hybrid enumeration that interleaves
different grammars and outperforms the winner of the 2018 SyGuS competition
(Inv track), solving more problems and achieving a mean speedup.Comment: 24 pages (5 pages of appendices), 7 figures, includes proofs of
theorem
- …