11,474 research outputs found
The Unsupervised Acquisition of a Lexicon from Continuous Speech
We present an unsupervised learning algorithm that acquires a
natural-language lexicon from raw speech. The algorithm is based on the optimal
encoding of symbol sequences in an MDL framework, and uses a hierarchical
representation of language that overcomes many of the problems that have
stymied previous grammar-induction procedures. The forward mapping from symbol
sequences to the speech stream is modeled using features based on articulatory
gestures. We present results on the acquisition of lexicons and language models
from raw speech, text, and phonetic transcripts, and demonstrate that our
algorithm compares very favorably to other reported results with respect to
segmentation performance and statistical efficiency.Comment: 27 page technical repor
Synthesizing Program Input Grammars
We present an algorithm for synthesizing a context-free grammar encoding the
language of valid program inputs from a set of input examples and blackbox
access to the program. Our algorithm addresses shortcomings of existing grammar
inference algorithms, which both severely overgeneralize and are prohibitively
slow. Our implementation, GLADE, leverages the grammar synthesized by our
algorithm to fuzz test programs with structured inputs. We show that GLADE
substantially increases the incremental coverage on valid inputs compared to
two baseline fuzzers
- …