8 research outputs found
Weakly supervised parsing with rules
International audienceThis work proposes a new research direction to address the lack of structures in traditional n-gram models. It is based on a weakly supervised dependency parser that can model speech syntax without relying on any annotated training corpus. La- beled data is replaced by a few hand-crafted rules that encode basic syntactic knowledge. Bayesian inference then samples the rules, disambiguating and combining them to create complex tree structures that maximize a discriminative model's posterior on a target unlabeled corpus. This posterior encodes sparse se- lectional preferences between a head word and its dependents. The model is evaluated on English and Czech newspaper texts, and is then validated on French broadcast news transcriptions
Recommended from our members
Cross-Lingual Transfer of Natural Language Processing Systems
Accurate natural language processing systems rely heavily on annotated datasets. In the absence of such datasets, transfer methods can help to develop a model by transferring annotations from one or more rich-resource languages to the target language of interest. These methods are generally divided into two approaches: 1) annotation projection from translation data, aka parallel data, using supervised models in rich-resource languages, and 2) direct model transfer from annotated datasets in rich-resource languages.
In this thesis, we demonstrate different methods for transfer of dependency parsers and sentiment analysis systems. We propose an annotation projection method that performs well in the scenarios for which a large amount of in-domain parallel data is available. We also propose a method which is a combination of annotation projection and direct transfer that can leverage a minimal amount of information from a small out-of-domain parallel dataset to develop highly accurate transfer models. Furthermore, we propose an unsupervised syntactic reordering model to improve the accuracy of dependency parser transfer for non-European languages. Finally, we conduct a diverse set of experiments for the transfer of sentiment analysis systems in different data settings.
A summary of our contributions are as follows:
* We develop accurate dependency parsers using parallel text in an annotation projection framework. We make use of the fact that the density of word alignments is a valuable indicator of reliability in annotation projection.
* We develop accurate dependency parsers in the absence of a large amount of parallel data. We use the Bible data, which is in orders of magnitude smaller than a conventional parallel dataset, to provide minimal cues for creating cross-lingual word representations. Our model is also capable of boosting the performance of annotation projection with a large amount of parallel data. Our model develops cross-lingual word representations for going beyond the traditional delexicalized direct transfer methods. Moreover, we propose a simple but effective word translation approach that brings in explicit lexical features from the target language in our direct transfer method.
* We develop different syntactic reordering models that can change the source treebanks in rich-resource languages, thus preventing learning a wrong model for a non-related language. Our experimental results show substantial improvements over non-European languages.
* We develop transfer methods for sentiment analysis in different data availability scenarios. We show that we can leverage cross-lingual word embeddings to create accurate sentiment analysis systems in the absence of annotated data in the target language of interest.
We believe that the novelties that we introduce in this thesis indicate the usefulness of transfer methods. This is appealing in practice, especially since we suggest eliminating the requirement for annotating new datasets for low-resource languages which is expensive, if not impossible, to obtain
Automatic grammar induction from free text using insights from cognitive grammar
Automatic identification of the grammatical structure of a sentence is useful in many Natural Language
Processing (NLP) applications such as Document Summarisation, Question Answering systems and
Machine Translation. With the availability of syntactic treebanks, supervised parsers have been
developed successfully for many major languages. However, for low-resourced minority languages with
fewer digital resources, this poses more of a challenge. Moreover, there are a number of syntactic
annotation schemes motivated by different linguistic theories and formalisms which are sometimes
language specific and they cannot always be adapted for developing syntactic parsers across different
language families.
This project aims to develop a linguistically motivated approach to the automatic induction of
grammatical structures from raw sentences. Such an approach can be readily adapted to different
languages including low-resourced minority languages. We draw the basic approach to linguistic analysis
from usage-based, functional theories of grammar such as Cognitive Grammar, Computational Paninian
Grammar and insights from psycholinguistic studies. Our approach identifies grammatical structure of a
sentence by recognising domain-independent, general, cognitive patterns of conceptual organisation
that occur in natural language. It also reflects some of the general psycholinguistic properties of parsing
by humans - such as incrementality, connectedness and expectation.
Our implementation has three components: Schema Definition, Schema Assembly and Schema
Prediction. Schema Definition and Schema Assembly components were implemented algorithmically as
a dictionary and rules. An Artificial Neural Network was trained for Schema Prediction. By using Parts of
Speech tags to bootstrap the simplest case of token level schema definitions, a sentence is passed
through all the three components incrementally until all the words are exhausted and the entire
sentence is analysed as an instance of one final construction schema. The order in which all intermediate
schemas are assembled to form the final schema can be viewed as the parse of the sentence. Parsers
for English and Welsh (a low-resource minority language) were developed using the same approach with
some changes to the Schema Definition component. We evaluated the parser performance by (a)
Quantitative evaluation by comparing the parsed chunks against the constituents in a phrase structure
tree (b) Manual evaluation by listing the range of linguistic constructions covered by the parser and by
performing error analysis on the parser outputs (c) Evaluation by identifying the number of edits
required for a correct assembly (d) Qualitative evaluation based on Likert scales in online surveys
Posterior Sparsity in Unsupervised Dependency Parsing
A strong inductive bias is essential in unsupervised grammar induction. In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 different languages, we achieve significant gains in directed accuracy over the standard expectation maximization (EM) baseline for 9 of the languages, with an average accuracy improvement of 6%. Further, we show that for 8 out of 12 languages, the new method outperforms models based on standard Bayesian sparsity-inducing parameter priors, with an average improvement of 4%. On English text in particular, we show that our approach improves performance over other state of the art techniques
Unsupervised grammar induction with Combinatory Categorial Grammars
Language is a highly structured medium for communication. An idea starts in the speaker's mind (semantics) and is transformed into a well formed, intelligible, sentence via the specific syntactic rules of a language. We aim to discover the fingerprints of this process in the choice and location of words used in the final utterance. What is unclear is how much of this latent process can be discovered from the linguistic signal alone and how much requires shared non-linguistic context, knowledge, or cues.
Unsupervised grammar induction is the task of analyzing strings in a language to discover the latent syntactic structure of the language without access to labeled training data. Successes in unsupervised grammar induction shed light on the amount of syntactic structure that is discoverable from raw or part-of-speech tagged text. In this thesis, we present a state-of-the-art grammar induction system based on Combinatory Categorial Grammars. Our choice of syntactic formalism enables the first labeled evaluation of an unsupervised system. This allows us to perform an in-depth analysis of the system’s linguistic strengths and weaknesses. In order to completely eliminate reliance on any supervised systems, we also examine how performance is affected when we use induced word clusters instead of gold-standard POS tags. Finally, we perform a semantic evaluation of induced grammars, providing unique insights into future directions for unsupervised grammar induction systems