8 research outputs found

    Weakly supervised parsing with rules

    Get PDF
    International audienceThis work proposes a new research direction to address the lack of structures in traditional n-gram models. It is based on a weakly supervised dependency parser that can model speech syntax without relying on any annotated training corpus. La- beled data is replaced by a few hand-crafted rules that encode basic syntactic knowledge. Bayesian inference then samples the rules, disambiguating and combining them to create complex tree structures that maximize a discriminative model's posterior on a target unlabeled corpus. This posterior encodes sparse se- lectional preferences between a head word and its dependents. The model is evaluated on English and Czech newspaper texts, and is then validated on French broadcast news transcriptions

    Automatic grammar induction from free text using insights from cognitive grammar

    Get PDF
    Automatic identification of the grammatical structure of a sentence is useful in many Natural Language Processing (NLP) applications such as Document Summarisation, Question Answering systems and Machine Translation. With the availability of syntactic treebanks, supervised parsers have been developed successfully for many major languages. However, for low-resourced minority languages with fewer digital resources, this poses more of a challenge. Moreover, there are a number of syntactic annotation schemes motivated by different linguistic theories and formalisms which are sometimes language specific and they cannot always be adapted for developing syntactic parsers across different language families. This project aims to develop a linguistically motivated approach to the automatic induction of grammatical structures from raw sentences. Such an approach can be readily adapted to different languages including low-resourced minority languages. We draw the basic approach to linguistic analysis from usage-based, functional theories of grammar such as Cognitive Grammar, Computational Paninian Grammar and insights from psycholinguistic studies. Our approach identifies grammatical structure of a sentence by recognising domain-independent, general, cognitive patterns of conceptual organisation that occur in natural language. It also reflects some of the general psycholinguistic properties of parsing by humans - such as incrementality, connectedness and expectation. Our implementation has three components: Schema Definition, Schema Assembly and Schema Prediction. Schema Definition and Schema Assembly components were implemented algorithmically as a dictionary and rules. An Artificial Neural Network was trained for Schema Prediction. By using Parts of Speech tags to bootstrap the simplest case of token level schema definitions, a sentence is passed through all the three components incrementally until all the words are exhausted and the entire sentence is analysed as an instance of one final construction schema. The order in which all intermediate schemas are assembled to form the final schema can be viewed as the parse of the sentence. Parsers for English and Welsh (a low-resource minority language) were developed using the same approach with some changes to the Schema Definition component. We evaluated the parser performance by (a) Quantitative evaluation by comparing the parsed chunks against the constituents in a phrase structure tree (b) Manual evaluation by listing the range of linguistic constructions covered by the parser and by performing error analysis on the parser outputs (c) Evaluation by identifying the number of edits required for a correct assembly (d) Qualitative evaluation based on Likert scales in online surveys

    Posterior Sparsity in Unsupervised Dependency Parsing

    Get PDF
    A strong inductive bias is essential in unsupervised grammar induction. In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 different languages, we achieve significant gains in directed accuracy over the standard expectation maximization (EM) baseline for 9 of the languages, with an average accuracy improvement of 6%. Further, we show that for 8 out of 12 languages, the new method outperforms models based on standard Bayesian sparsity-inducing parameter priors, with an average improvement of 4%. On English text in particular, we show that our approach improves performance over other state of the art techniques

    Unsupervised grammar induction with Combinatory Categorial Grammars

    Get PDF
    Language is a highly structured medium for communication. An idea starts in the speaker's mind (semantics) and is transformed into a well formed, intelligible, sentence via the specific syntactic rules of a language. We aim to discover the fingerprints of this process in the choice and location of words used in the final utterance. What is unclear is how much of this latent process can be discovered from the linguistic signal alone and how much requires shared non-linguistic context, knowledge, or cues. Unsupervised grammar induction is the task of analyzing strings in a language to discover the latent syntactic structure of the language without access to labeled training data. Successes in unsupervised grammar induction shed light on the amount of syntactic structure that is discoverable from raw or part-of-speech tagged text. In this thesis, we present a state-of-the-art grammar induction system based on Combinatory Categorial Grammars. Our choice of syntactic formalism enables the first labeled evaluation of an unsupervised system. This allows us to perform an in-depth analysis of the system’s linguistic strengths and weaknesses. In order to completely eliminate reliance on any supervised systems, we also examine how performance is affected when we use induced word clusters instead of gold-standard POS tags. Finally, we perform a semantic evaluation of induced grammars, providing unique insights into future directions for unsupervised grammar induction systems
    corecore