196 research outputs found
The Unsupervised Acquisition of a Lexicon from Continuous Speech
We present an unsupervised learning algorithm that acquires a
natural-language lexicon from raw speech. The algorithm is based on the optimal
encoding of symbol sequences in an MDL framework, and uses a hierarchical
representation of language that overcomes many of the problems that have
stymied previous grammar-induction procedures. The forward mapping from symbol
sequences to the speech stream is modeled using features based on articulatory
gestures. We present results on the acquisition of lexicons and language models
from raw speech, text, and phonetic transcripts, and demonstrate that our
algorithm compares very favorably to other reported results with respect to
segmentation performance and statistical efficiency.Comment: 27 page technical repor
Unsupervised Language Acquisition
This thesis presents a computational theory of unsupervised language
acquisition, precisely defining procedures for learning language from ordinary
spoken or written utterances, with no explicit help from a teacher. The theory
is based heavily on concepts borrowed from machine learning and statistical
estimation. In particular, learning takes place by fitting a stochastic,
generative model of language to the evidence. Much of the thesis is devoted to
explaining conditions that must hold for this general learning strategy to
arrive at linguistically desirable grammars. The thesis introduces a variety of
technical innovations, among them a common representation for evidence and
grammars, and a learning strategy that separates the ``content'' of linguistic
parameters from their representation. Algorithms based on it suffer from few of
the search problems that have plagued other computational approaches to
language acquisition.
The theory has been tested on problems of learning vocabularies and grammars
from unsegmented text and continuous speech, and mappings between sound and
representations of meaning. It performs extremely well on various objective
criteria, acquiring knowledge that causes it to assign almost exactly the same
structure to utterances as humans do. This work has application to data
compression, language modeling, speech recognition, machine translation,
information retrieval, and other tasks that rely on either structural or
stochastic descriptions of language.Comment: PhD thesis, 133 page
Methods for Parallelizing Search Paths in Phrasing
Many search problems are commonly solved with combinatoric algorithms that unnecessarily duplicate and serialize work at considerable computational expense. There are techniques available that can eliminate redundant computations and perform remaining operations concurrently, effectively reducing the branching factors of these algorithms. This thesis applies these techniques to the problem of parsing natural language. The result is an efficient programming language that can reduce some of the expense associated with principle-based parsing and other search problems. The language is used to implement various natural language parsers, and the improvements are compared to those that result from implementing more deterministic theories of language processing
On Hilberg's Law and Its Links with Guiraud's Law
Hilberg (1990) supposed that finite-order excess entropy of a random human
text is proportional to the square root of the text length. Assuming that
Hilberg's hypothesis is true, we derive Guiraud's law, which states that the
number of word types in a text is greater than proportional to the square root
of the text length. Our derivation is based on some mathematical conjecture in
coding theory and on several experiments suggesting that words can be defined
approximately as the nonterminals of the shortest context-free grammar for the
text. Such operational definition of words can be applied even to texts
deprived of spaces, which do not allow for Mandelbrot's ``intermittent
silence'' explanation of Zipf's and Guiraud's laws. In contrast to
Mandelbrot's, our model assumes some probabilistic long-memory effects in human
narration and might be capable of explaining Menzerath's law.Comment: To appear in Journal of Quantitative Linguistic
Recommended from our members
Search for physics beyond the standard model in events with τ leptons, jets, and large transverse momentum imbalance in pp collisions at [Formula: see text].
A search for physics beyond the standard model is performed with events having one or more hadronically decaying τ leptons, highly energetic jets, and large transverse momentum imbalance. The data sample corresponds to an integrated luminosity of 4.98 fb-1 of proton-proton collisions at [Formula: see text] collected with the CMS detector at the LHC in 2011. The number of observed events is consistent with predictions for standard model processes. Lower limits on the mass of the gluino in supersymmetric models are determined
Combined search for the quarks of a sequential fourth generation
Results are presented from a search for a fourth generation of quarks
produced singly or in pairs in a data set corresponding to an integrated
luminosity of 5 inverse femtobarns recorded by the CMS experiment at the LHC in
2011. A novel strategy has been developed for a combined search for quarks of
the up and down type in decay channels with at least one isolated muon or
electron. Limits on the mass of the fourth-generation quarks and the relevant
Cabibbo-Kobayashi-Maskawa matrix elements are derived in the context of a
simple extension of the standard model with a sequential fourth generation of
fermions. The existence of mass-degenerate fourth-generation quarks with masses
below 685 GeV is excluded at 95% confidence level for minimal off-diagonal
mixing between the third- and the fourth-generation quarks. With a mass
difference of 25 GeV between the quark masses, the obtained limit on the masses
of the fourth-generation quarks shifts by about +/- 20 GeV. These results
significantly reduce the allowed parameter space for a fourth generation of
fermions.Comment: Replaced with published version. Added journal reference and DO
Search for supersymmetry in events with b-quark jets and missing transverse energy in pp collisions at 7 TeV
Results are presented from a search for physics beyond the standard model
based on events with large missing transverse energy, at least three jets, and
at least one, two, or three b-quark jets. The study is performed using a sample
of proton-proton collision data collected at sqrt(s) = 7 TeV with the CMS
detector at the LHC in 2011. The integrated luminosity of the sample is 4.98
inverse femtobarns. The observed number of events is found to be consistent
with the standard model expectation, which is evaluated using control samples
in the data. The results are used to constrain cross sections for the
production of supersymmetric particles decaying to b-quark-enriched final
states in the context of simplified model spectra.Comment: Submitted to Physical Review
- …