9 research outputs found

    Theoretische Informatica

    Get PDF
    Deze notitie beschrijft in het kort wat theoretische informatica is en hoe dit door de groep Theoretische Invormatica van de Universiteit Twente wordt bedreven in onderwijs en onderzoek

    Supervised Training on Synthetic Languages: A Novel Framework for Unsupervised Parsing

    Get PDF
    This thesis focuses on unsupervised dependency parsing—parsing sentences of a language into dependency trees without accessing the training data of that language. Different from most prior work that uses unsupervised learning to estimate the parsing parameters, we estimate the parameters by supervised training on synthetic languages. Our parsing framework has three major components: Synthetic language generation gives a rich set of training languages by mix-and-match over the real languages; surface-form feature extraction maps an unparsed corpus of a language into a fixed-length vector as the syntactic signature of that language; and, finally, language-agnostic parsing incorporates the syntactic signature during parsing so that the decision on each word token is reliant upon the general syntax of the target language. The fundamental question we are trying to answer is whether some useful information about the syntax of a language could be inferred from its surface-form evidence (unparsed corpus). This is the same question that has been implicitly asked by previous papers on unsupervised parsing, which only assumes an unparsed corpus to be available for the target language. We show that, indeed, useful features of the target language can be extracted automatically from an unparsed corpus, which consists only of gold part-of-speech (POS) sequences. Providing these features to our neural parser enables it to parse sequences like those in the corpus. Strikingly, our system has no supervision in the target language. Rather, it is a multilingual system that is trained end-to-end on a variety of other languages, so it learns a feature extractor that works well. This thesis contains several large-scale experiments requiring hundreds of thousands of CPU-hours. To our knowledge, this is the largest study of unsupervised parsing yet attempted. We show experimentally across multiple languages: (1) Features computed from the unparsed corpus improve parsing accuracy. (2) Including thousands of synthetic languages in the training yields further improvement. (3) Despite being computed from unparsed corpora, our learned task-specific features beat previous works’ interpretable typological features that require parsed corpora or expert categorization of the language

    A general scheme for some deterministically parsable grammars and their strong equivalents

    No full text
    In the past years there have been many attempts to fill in the gap between the classes of LL(k) and LR(k) grammars with new classes of deterministically parsable grammars. Almost always the introduction of a new class was accompanied by a parsing method and/or a grammatical transformation fitting the following scheme. If parsers were at the centre of the investigation the new method used to be designed to possess certain advantages with respect to already existing ones. As far as transformations were concerned the intention was to produce methods of transforming grammars into "more easily" parsable ones. The problem of finding classes of context-free grammars which can be transformed to LL(k) grammars has received much attention. Parsing strategies and associated classes of grammars generating LL(k) languages have been extensively studied. An equally interesting class of grammars is the class of strict deterministic grammars, a subclass of the LR(O) grammars with elegant theoretical properties. Generalizations of this concept have been introduced by Friede and Pittl. The purpose of this paper is to show how the above mentioned classes of grammars can be dealt with within a general framework

    Optimizing scientific communication : the role of relative clauses as markers of complexity in English and German scientific writing between 1650 and 1900

    Get PDF
    The aim of this thesis is to show that both scientific English and German have become increasingly optimized for scientific communication from 1650 to 1900 by adapting the usage of relative clauses as markers of grammatical complexity. While the lexico-grammatical changes in terms of features and their frequency distribution in scientific writing during this period are well documented, in the present work we are interested in the underlying factors driving these changes and how they affect efficient scientific communication. As the scientific register emerges and evolves, it continuously adapts to the changing communicative needs posed by extra-linguistic pressures arising from the scientific community and its achievements. We assume that, over time, scientific language maintains communicative efficiency by balancing lexico-semantic expansion with a reduction in (lexico-)grammatical complexity on different linguistic levels. This is based on the idea that linguistic complexity affects processing difficulty and, in turn, communicative efficiency. To achieve optimization, complexity is adjusted on the level of lexico-grammar, which is related to expectation-based processing cost, and syntax, which is linked to working memory-based processing cost. We conduct five corpus-based studies comparing English and German scientific writing to general language. The first two investigate the development of relative clauses in terms of lexico-grammar, measuring the paradigmatic richness and syntagmatic predictability of relativizers as indicators of expectation-based processing cost. The results confirm that both levels undergo a reduction in complexity over time. The other three studies focus on the syntactic complexity of relative clauses, investigating syntactic intricacy, locality, and accessibility. Results show that intricacy and locality decrease, leading to lower grammatical complexity and thus mitigating memory-based processing cost. However, accessibility is not a factor of complexity reduction over time. Our studies reveal a register-specific diachronic complexity reduction in scientific language both in lexico-grammar and syntax. The cross-linguistic comparison shows that English is more advanced in its register-specific development while German lags behind due to a later establishment of the vernacular as a language of scientific communication.This work is supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 232722074 – SFB 110

    Meaning versus Grammar

    Get PDF
    This volume investigates the complicated relationship between grammar, computation, and meaning in natural languages. It details conditions under which meaning-driven processing of natural language is feasible, discusses an operational and accessible implementation of the grammatical cycle for Dutch, and offers analyses of a number of further conjectures about constituency and entailment in natural language

    An evaluation of Lolita and related natural language processing systems

    Get PDF
    This research addresses the question, "how do we evaluate systems like LOLITA?" LOLITA is the Natural Language Processing (NLP) system under development at the University of Durham. It is intended as a platform for building NL applications. We are therefore interested in questions of evaluation for such general NLP systems. The thesis has two, parts. The first, and main, part concerns the participation of LOLITA in the Sixth Message Understanding Conference (MUC-6). The MUC-relevant portion of LOLITA is described in detail. The adaptation of LOLITA for MUC-6 is discussed, including work undertaken by the author. Performance on a specimen article is analysed qualitatively, and in detail, with anonymous comparisons to competitors' output. We also examine current LOLITA performance. A template comparison tool was implemented to aid these analyses. The overall scores are then considered. A methodology for analysis is discussed, and a comparison made with current scores. The comparison tool is used to analyse how systems performed relative to each-other. One method, Correctness Analysis, was particularly interesting. It provides a characterisation of task difficulty, and indicates how systems approached a task. Finally, MUC-6 is analysed. In particular, we consider the methodology and ways of interpreting the results. Several criticisms of MUC-6 are made, along with suggestions for future MUC-style events. The second part considers evaluation from the point of view of general systems. A literature review shows a lack of serious work on this aspect of evaluation. A first principles discussion of evaluation, starting from a view of NL systems as a particular kind of software, raises several interesting points for single task evaluation. No evaluations could be suggested for general systems; their value was seen as primarily economic. That is, we are unable to analyse their linguistic capability directly

    Modelling Incremental Self-Repair Processing in Dialogue.

    Get PDF
    PhDSelf-repairs, where speakers repeat themselves, reformulate or restart what they are saying, are pervasive in human dialogue. These phenomena provide a window into real-time human language processing. For explanatory adequacy, a model of dialogue must include mechanisms that account for them. Artificial dialogue agents also need this capability for more natural interaction with human users. This thesis investigates the structure of self-repair and its function in the incremental construction of meaning in interaction. A corpus study shows how the range of self-repairs seen in dialogue cannot be accounted for by looking at surface form alone. More particularly it analyses a string-alignment approach and shows how it is insufficient, provides requirements for a suitable model of incremental context and an ontology of self-repair function. An information-theoretic model is developed which addresses these issues along with a system that automatically detects self-repairs and edit terms on transcripts incrementally with minimal latency, achieving state-of-the-art results. Additionally it is shown to have practical use in the psychiatric domain. The thesis goes on to present a dialogue model to interpret and generate repaired utterances incrementally. When processing repaired rather than fluent utterances, it achieves the same degree of incremental interpretation and incremental representation. Practical implementation methods are presented for an existing dialogue system. Finally, a more pragmatically oriented approach is presented to model self-repairs in a psycholinguistically plausible way. This is achieved through extending the dialogue model to include a probabilistic semantic framework to perform incremental inference in a reference resolution domain. The thesis concludes that at least as fine-grained a model of context as word-by-word is required for realistic models of self-repair, and context must include linguistic action sequences and information update effects. The way dialogue participants process self-repairs to make inferences in real time, rather than filter out their disfluency effects, has been modelled formally and in practical systems.Engineering and Physical Sciences Research Council (EPSRC) Doctoral Training Account (DTA) scholarship from the School of Electronic Engineering and Computer Science at Queen Mary University of London

    Profiling large-scale lazy functional programs

    Get PDF
    The LOLITA natural language processing system is an example of one of the ever increasing number of large-scale systems written entirely in a functional programming language. The system consists of over 50,000 lines of Haskell code and is able to perform a number of tasks such as semantic and pragmatic analysis of text, context scanning and query analysis. Such a system is more useful if the results are calculated in real-time, therefore the efficiency of such a system is paramount. For the past three years we have used profiling tools supplied with the Haskell compilers GHC and HBC to analyse and reason about our programming solutions and have achieved good results; however, our experience has shown that the profiling life-cycle is often too long to make a detailed analysis of a large system possible, and the profiling results are often misleading. A profiling system is developed which allows three types of functionality not previously found in a profiler for lazy functional programs. Firstly, the profiler is able to produce results based on an accurate method of cost inheritance. We have found that this reduces the possibility of the programmer obtaining misleading profiling results. Secondly, the programmer is able to explore the results after the execution of the program. This is done by selecting and deselecting parts of the program using a post-processor. This greatly reduces the analysis time as no further compilation, execution or profiling of the program is needed. Finally, the new profiling system allows the user to examine aspects of the run-time call structure of the program. This is useful in the analysis of the run-time behaviour of the program. Previous attempts at extending the results produced by a profiler in such a way have failed due to the exceptionally high overheads. Exploration of the overheads produced by the new profiling scheme show that typical overheads in profiling the LOLITA system are: a 10% increase in compilation time; a 7% increase in executable size and a 70% run-time overhead. These overheads mean a considerable saving in time in the detailed analysis of profiling a large, lazy functional program
    corecore