190 research outputs found

    Grasp: Randomised Semiring Parsing

    Get PDF
    We present a suite of algorithms for inference tasks over (finite and infinite) context-free sets. For generality and clarity, we have chosen the framework of semiring parsing with support to the most common semirings (e.g. Forest, Viterbi, k-best and Inside). We see parsing from the more general viewpoint of weighted deduction allowing for arbitrary weighted finite-state input and provide implementations of both bottom-up (CKY-inspired) and top-down (Earley-inspired) algorithms. We focus on approximate inference by Monte Carlo methods and provide implementations of ancestral sampling and slice sampling. In principle, sampling methods can deal with models whose independence assumptions are weaker than what is feasible by standard dynamic programming. We envision applications such as monolingual constituency parsing, synchronous parsing, context-free models of reordering for machine translation, and machine translation decoding

    Unsupervised Structure Induction for Natural Language Processing

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Probabilistic Modelling of Morphologically Rich Languages

    Full text link
    This thesis investigates how the sub-structure of words can be accounted for in probabilistic models of language. Such models play an important role in natural language processing tasks such as translation or speech recognition, but often rely on the simplistic assumption that words are opaque symbols. This assumption does not fit morphologically complex language well, where words can have rich internal structure and sub-word elements are shared across distinct word forms. Our approach is to encode basic notions of morphology into the assumptions of three different types of language models, with the intention that leveraging shared sub-word structure can improve model performance and help overcome data sparsity that arises from morphological processes. In the context of n-gram language modelling, we formulate a new Bayesian model that relies on the decomposition of compound words to attain better smoothing, and we develop a new distributed language model that learns vector representations of morphemes and leverages them to link together morphologically related words. In both cases, we show that accounting for word sub-structure improves the models' intrinsic performance and provides benefits when applied to other tasks, including machine translation. We then shift the focus beyond the modelling of word sequences and consider models that automatically learn what the sub-word elements of a given language are, given an unannotated list of words. We formulate a novel model that can learn discontiguous morphemes in addition to the more conventional contiguous morphemes that most previous models are limited to. This approach is demonstrated on Semitic languages, and we find that modelling discontiguous sub-word structures leads to improvements in the task of segmenting words into their contiguous morphemes.Comment: DPhil thesis, University of Oxford, submitted and accepted 2014. http://ora.ox.ac.uk/objects/uuid:8df7324f-d3b8-47a1-8b0b-3a6feb5f45c

    Acta Cybernetica : Volume 15. Number 2.

    Get PDF

    Substring-based Machine Translation

    Get PDF
    Abstract Machine translation is traditionally formulated as the transduction of strings of words from the source to the target language. As a result, additional lexical processing steps such as morphological analysis, transliteration, and tokenization are required to process the internal structure of words to help cope with data-sparsity issues that occur when simply dividing words according to white spaces. In this paper, we take a different approach: not dividing lexical processing and translation into two steps, but simply viewing translation as a single transduction between character strings in the source and target languages. In particular, we demonstrate that the key to achieving accuracies on a par with word-based translation in the character-based framework is the use of a many-to-many alignment strategy that can accurately capture correspondences between arbitrary substrings. We build on the alignment method proposed in Neubig et al (2011), improving its efficiency and accuracy with a focus on character-based translation. Using a many-to-many aligner imbued with these improvements, we demonstrate that the traditional framework of phrase-based machine translation sees large gains in accuracy over character-based translation with more naive alignment methods, and achieves comparable results to word-based translation for two distant language pairs

    Fundamental Approaches to Software Engineering

    Get PDF
    This open access book constitutes the proceedings of the 23rd International Conference on Fundamental Approaches to Software Engineering, FASE 2020, which took place in Dublin, Ireland, in April 2020, and was held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2020. The 23 full papers, 1 tool paper and 6 testing competition papers presented in this volume were carefully reviewed and selected from 81 submissions. The papers cover topics such as requirements engineering, software architectures, specification, software quality, validation, verification of functional and non-functional properties, model-driven development and model transformation, software processes, security and software evolution

    Framework for Human Computer Interaction for Learning Dialogue Strategies using Controlled Natural Language in Information Systems

    Get PDF
    Spoken Language systems are going to have a tremendous impact in all the real world applications, be it healthcare enquiry, public transportation system or airline booking system maintaining the language ethnicity for interaction among users across the globe. These system have the capability of interacting with the user in di erent languages that the system supports. Normally when a person interacts with another person there are many non-verbal clues which guide the dialogue and all the utterances have a contextual relationship, which manage the dialogue as its mixed by the two speakers. Human Computer Interaction has a wide impact on the design of the applications and has become one of the emerging interest area of the researchers. All of us are witness to an explosive electronic revolution where lots of gadgets and gizmo's have surrounded us, advanced not only in power, design, applications but the ease of access or what we call user friendly interfaces are designed that we can easily use and control all the functionality of the devices. Since speech is one of the most intuitive form of interaction that humans use. It provides potential bene ts such as handfree access to machines, ergonomics and greater e ciency of interaction. Yet, speech-based interfaces design has been an expert job for a long time. Lot of research has been done in building real spoken Dialogue Systems which can interact with humans using voice interactions and help in performing various tasks as are done by humans. Last two decades have seen utmost advanced research in the automatic speech recognition, dialogue management, text to speech synthesis and Natural Language Processing for various applications which have shown positive results. This dissertation proposes to apply machine learning (ML) techniques to the problem of optimizing the dialogue management strategy selection in the Spoken Dialogue system prototype design. Although automatic speech recognition and system initiated dialogues where the system expects an answer in the form of `yes' or `no' have already been applied to Spoken Dialogue Systems( SDS), no real attempt to use those techniques in order to design a new system from scratch has been made. In this dissertation, we propose some novel ideas in order to achieve the goal of easing the design of Spoken Dialogue Systems and allow novices to have access to voice technologies. A framework for simulating and evaluating dialogues and learning optimal dialogue strategies in a controlled Natural Language is proposed. The simulation process is based on a probabilistic description of a dialogue and on the stochastic modelling of both arti cial NLP modules composing a SDS and the user. This probabilistic model is based on a set of parameters that can be tuned from the prior knowledge from the discourse or learned from data. The evaluation is part of the simulation process and is based on objective measures provided by each module. Finally, the simulation environment is connected to a learning agent using the supplied evaluation metrics as an objective function in order to generate an optimal behaviour for the SDS
    corecore