3,202 research outputs found

    Elimination of Spurious Ambiguity in Transition-Based Dependency Parsing

    Get PDF
    We present a novel technique to remove spurious ambiguity from transition systems for dependency parsing. Our technique chooses a canonical sequence of transition operations (computation) for a given dependency tree. Our technique can be applied to a large class of bottom-up transition systems, including for instance Nivre (2004) and Attardi (2006)

    A syntactic language model based on incremental CCG parsing

    Get PDF
    Syntactically-enriched language models (parsers) constitute a promising component in applications such as machine translation and speech-recognition. To maintain a useful level of accuracy, existing parsers are non-incremental and must span a combinatorially growing space of possible structures as every input word is processed. This prohibits their incorporation into standard linear-time decoders. In this paper, we present an incremental, linear-time dependency parser based on Combinatory Categorial Grammar (CCG) and classification techniques. We devise a deterministic transform of CCGbank canonical derivations into incremental ones, and train our parser on this data. We discover that a cascaded, incremental version provides an appealing balance between efficiency and accuracy

    Generalizing input-driven languages: theoretical and practical benefits

    Get PDF
    Regular languages (RL) are the simplest family in Chomsky's hierarchy. Thanks to their simplicity they enjoy various nice algebraic and logic properties that have been successfully exploited in many application fields. Practically all of their related problems are decidable, so that they support automatic verification algorithms. Also, they can be recognized in real-time. Context-free languages (CFL) are another major family well-suited to formalize programming, natural, and many other classes of languages; their increased generative power w.r.t. RL, however, causes the loss of several closure properties and of the decidability of important problems; furthermore they need complex parsing algorithms. Thus, various subclasses thereof have been defined with different goals, spanning from efficient, deterministic parsing to closure properties, logic characterization and automatic verification techniques. Among CFL subclasses, so-called structured ones, i.e., those where the typical tree-structure is visible in the sentences, exhibit many of the algebraic and logic properties of RL, whereas deterministic CFL have been thoroughly exploited in compiler construction and other application fields. After surveying and comparing the main properties of those various language families, we go back to operator precedence languages (OPL), an old family through which R. Floyd pioneered deterministic parsing, and we show that they offer unexpected properties in two fields so far investigated in totally independent ways: they enable parsing parallelization in a more effective way than traditional sequential parsers, and exhibit the same algebraic and logic properties so far obtained only for less expressive language families

    A Formal View on Training of Weighted Tree Automata by Likelihood-Driven State Splitting and Merging

    Get PDF
    The use of computers and algorithms to deal with human language, in both spoken and written form, is summarized by the term natural language processing (nlp). Modeling language in a way that is suitable for computers plays an important role in nlp. One idea is to use formalisms from theoretical computer science for that purpose. For example, one can try to find an automaton to capture the valid written sentences of a language. Finding such an automaton by way of examples is called training. In this work, we also consider the structure of sentences by making use of trees. We use weighted tree automata (wta) in order to deal with such tree structures. Those devices assign weights to trees in order to, for example, distinguish between good and bad structures. The well-known expectation-maximization algorithm can be used to train the weights for a wta while the state behavior stays fixed. As a way to adapt the state behavior of a wta, state splitting, i.e. dividing a state into several new states, and state merging, i.e. replacing several states by a single new state, can be used. State splitting, state merging, and the expectation maximization algorithm already were combined into the state splitting and merging algorithm, which was successfully applied in practice. In our work, we formalized this approach in order to show properties of the algorithm. We also examined a new approach – the count-based state merging algorithm – which exclusively relies on state merging. When dealing with trees, another important tool is binarization. A binarization is a strategy to code arbitrary trees by binary trees. For each of three different binarizations we showed that wta together with the binarization are as powerful as weighted unranked tree automata (wuta). We also showed that this is still true if only probabilistic wta and probabilistic wuta are considered.:How to Read This Thesis 1. Introduction 1.1. The Contributions and the Structure of This Work 2. Preliminaries 2.1. Sets, Relations, Functions, Families, and Extrema 2.2. Algebraic Structures 2.3. Formal Languages 3. Language Formalisms 3.1. Context-Free Grammars (CFGs) 3.2. Context-Free Grammars with Latent Annotations (CFG-LAs) 3.3. Weighted Tree Automata (WTAs) 3.4. Equivalences of WCFG-LAs and WTAs 4. Training of WTAs 4.1. Probability Distributions 4.2. Maximum Likelihood Estimation 4.3. Probabilities and WTAs 4.4. The EM Algorithm for WTAs 4.5. Inside and Outside Weights 4.6. Adaption of the Estimation of Corazza and Satta [CS07] to WTAs 5. State Splitting and Merging 5.1. State Splitting and Merging for Weighted Tree Automata 5.1.1. Splitting Weights and Probabilities 5.1.2. Merging Probabilities 5.2. The State Splitting and Merging Algorithm 5.2.1. Finding a Good π-Distributor 5.2.2. Notes About the Berkeley Parser 5.3. Conclusion and Further Research 6. Count-Based State Merging 6.1. Preliminaries 6.2. The Likelihood of the Maximum Likelihood Estimate and Its Behavior While Merging 6.3. The Count-Based State Merging Algorithm 6.3.1. Further Adjustments for Practical Implementations 6.4. Implementation of Count-Based State Merging 6.5. Experiments with Artificial Automata and Corpora 6.5.1. The Artificial Automata 6.5.2. Results 6.6. Experiments with the Penn Treebank 6.7. Comparison to the Approach of Carrasco, Oncina, and Calera-Rubio [COC01] 6.8. Conclusion and Further Research 7. Binarization 7.1. Preliminaries 7.2. Relating WSTAs and WUTAs via Binarizations 7.2.1. Left-Branching Binarization 7.2.2. Right-Branching Binarization 7.2.3. Mixed Binarization 7.3. The Probabilistic Case 7.3.1. Additional Preliminaries About WSAs 7.3.2. Constructing an Out-Probabilistic WSA from a Converging WSA 7.3.3. Binarization and Probabilistic Tree Automata 7.4. Connection to the Training Methods in Previous Chapters 7.5. Conclusion and Further Research A. Proofs for Preliminaries B. Proofs for Training of WTAs C. Proofs for State Splitting and Merging D. Proofs for Count-Based State Merging Bibliography List of Algorithms List of Figures List of Tables Index Table of Variable Name

    A New Approach to LL and LR Parsing

    Get PDF
    Cílem této práce je vytvořit nový efektivní způsob syntaktické analýzy propojením LL a LR přístupů. Pro demonstrační účely je zhotoven nový programovací jazyk podle vzoru programovacího jazyka PHP. Tento jazyk je rozdělen na části, kde pro každou část je použita ta nejvhodnejší ze zmíněných metod. Jednotlivé metody jsou zde podrobněji popsané v kontextu dvou typů přístupů. Jedním z nich je syntaktická analýza shora dolů a tím druhým opačná verze, syntaktická analýza zdola nahoru. Pro každou separovanou část je vytvořen samostatný syntaktický analyzátor. Táto práce poskytuje kompletní teoretický základ k sestrojení všech zde použitých syntaktických analyzátorů a rozkladových tabulek. Nakonec jsou sestrojené analyzátory společne propojeny, což je úspěšné zakončení praktické demonstrace naší metody. V závěru jsou diskutovány dosažené výsledky práce jako efektivnejší druh syntaktické analýzy, modularita přístupu a podobně. Je zde také diskutovaná použitelnost navržené metody za účelem zefektivnení vývoje a rychlosti překladu. Jako poslední jsou uvedeny náměty pro další výzkum v této oblasti.The aim of this thesis is to create a new effective parsing method via connection of LL and LR approaches. For demonstration purpose is made a new programming language according to the pattern of PHP. The language is separated into the sections and for constituent sections is chosen the most appropriate from the mentioned methods. For every section is created its own syntax analyser. The thesis provides a complete theoretical basis to construct every syntax analyser that has been used here. Finally, the syntax analysers are connected together and new method is practically presented. In conclusion, contributions of this work are discussed, such as the faster parser or the improved development. It also discusses usability of the designed method and suggestions for the next possible research in this area.

    Formal Languages and Compilation

    Full text link

    Tabulation for multi-purpose partial parsing

    Get PDF
    Efficient partial parsing systems (chunkers) are urgently required by various natural language application areas as these parsers always produce partially parsed text even when the text does not fully fit existing lexica and grammars. Availability of partially parsed corpora is absolutely necessary for extracting various kinds of information that may then be fed into those systems, increasing their processing power. In this paper, we propose an efficient partial parsing scheme based on chart parsing that is flexible enough to support both normal parsing tasks and diagnosis in previously obtained partial parses of possible causes (kinds of faults) that led to those partial parses instead of complete parses. Through the use of the built-in tabulation capabilites of the DyALog system, we implemented a partial parser that runs as fast as the best non-deterministic parsers. In this paper we ellaborate on the implementation of two different grammar formalisms: Definite Clause Grammars (DCG) extended with head declarations and Bound Movement Grammars (BMG)

    SR(s, k) parsers: A class of shift-reduce bounded-context parsers

    Get PDF
    AbstractWe introduce a class of bottom-up parsers called SR(s, k) parsers, where s denotes a stack bound and k the lookahead bound. The importance of this class of parsers is the fact that states are formed by the union of item-sets associated with a canonical LR(k) parser. One of our major results states that the class of SR(s, k) grammars properly includes the (s, k)-weak precedence grammars but is properly included within the class of (s, k)-bounded right-context grammars
    corecore