196 research outputs found

    The Unsupervised Acquisition of a Lexicon from Continuous Speech

    Get PDF
    We present an unsupervised learning algorithm that acquires a natural-language lexicon from raw speech. The algorithm is based on the optimal encoding of symbol sequences in an MDL framework, and uses a hierarchical representation of language that overcomes many of the problems that have stymied previous grammar-induction procedures. The forward mapping from symbol sequences to the speech stream is modeled using features based on articulatory gestures. We present results on the acquisition of lexicons and language models from raw speech, text, and phonetic transcripts, and demonstrate that our algorithm compares very favorably to other reported results with respect to segmentation performance and statistical efficiency.Comment: 27 page technical repor

    Unsupervised Language Acquisition

    Full text link
    This thesis presents a computational theory of unsupervised language acquisition, precisely defining procedures for learning language from ordinary spoken or written utterances, with no explicit help from a teacher. The theory is based heavily on concepts borrowed from machine learning and statistical estimation. In particular, learning takes place by fitting a stochastic, generative model of language to the evidence. Much of the thesis is devoted to explaining conditions that must hold for this general learning strategy to arrive at linguistically desirable grammars. The thesis introduces a variety of technical innovations, among them a common representation for evidence and grammars, and a learning strategy that separates the ``content'' of linguistic parameters from their representation. Algorithms based on it suffer from few of the search problems that have plagued other computational approaches to language acquisition. The theory has been tested on problems of learning vocabularies and grammars from unsegmented text and continuous speech, and mappings between sound and representations of meaning. It performs extremely well on various objective criteria, acquiring knowledge that causes it to assign almost exactly the same structure to utterances as humans do. This work has application to data compression, language modeling, speech recognition, machine translation, information retrieval, and other tasks that rely on either structural or stochastic descriptions of language.Comment: PhD thesis, 133 page

    Methods for Parallelizing Search Paths in Phrasing

    Get PDF
    Many search problems are commonly solved with combinatoric algorithms that unnecessarily duplicate and serialize work at considerable computational expense. There are techniques available that can eliminate redundant computations and perform remaining operations concurrently, effectively reducing the branching factors of these algorithms. This thesis applies these techniques to the problem of parsing natural language. The result is an efficient programming language that can reduce some of the expense associated with principle-based parsing and other search problems. The language is used to implement various natural language parsers, and the improvements are compared to those that result from implementing more deterministic theories of language processing

    On Hilberg's Law and Its Links with Guiraud's Law

    Full text link
    Hilberg (1990) supposed that finite-order excess entropy of a random human text is proportional to the square root of the text length. Assuming that Hilberg's hypothesis is true, we derive Guiraud's law, which states that the number of word types in a text is greater than proportional to the square root of the text length. Our derivation is based on some mathematical conjecture in coding theory and on several experiments suggesting that words can be defined approximately as the nonterminals of the shortest context-free grammar for the text. Such operational definition of words can be applied even to texts deprived of spaces, which do not allow for Mandelbrot's ``intermittent silence'' explanation of Zipf's and Guiraud's laws. In contrast to Mandelbrot's, our model assumes some probabilistic long-memory effects in human narration and might be capable of explaining Menzerath's law.Comment: To appear in Journal of Quantitative Linguistic

    Combined search for the quarks of a sequential fourth generation

    Get PDF
    Results are presented from a search for a fourth generation of quarks produced singly or in pairs in a data set corresponding to an integrated luminosity of 5 inverse femtobarns recorded by the CMS experiment at the LHC in 2011. A novel strategy has been developed for a combined search for quarks of the up and down type in decay channels with at least one isolated muon or electron. Limits on the mass of the fourth-generation quarks and the relevant Cabibbo-Kobayashi-Maskawa matrix elements are derived in the context of a simple extension of the standard model with a sequential fourth generation of fermions. The existence of mass-degenerate fourth-generation quarks with masses below 685 GeV is excluded at 95% confidence level for minimal off-diagonal mixing between the third- and the fourth-generation quarks. With a mass difference of 25 GeV between the quark masses, the obtained limit on the masses of the fourth-generation quarks shifts by about +/- 20 GeV. These results significantly reduce the allowed parameter space for a fourth generation of fermions.Comment: Replaced with published version. Added journal reference and DO

    Search for supersymmetry in events with b-quark jets and missing transverse energy in pp collisions at 7 TeV

    Get PDF
    Results are presented from a search for physics beyond the standard model based on events with large missing transverse energy, at least three jets, and at least one, two, or three b-quark jets. The study is performed using a sample of proton-proton collision data collected at sqrt(s) = 7 TeV with the CMS detector at the LHC in 2011. The integrated luminosity of the sample is 4.98 inverse femtobarns. The observed number of events is found to be consistent with the standard model expectation, which is evaluated using control samples in the data. The results are used to constrain cross sections for the production of supersymmetric particles decaying to b-quark-enriched final states in the context of simplified model spectra.Comment: Submitted to Physical Review
    corecore