1,810 research outputs found
Automatic Extraction of Subcategorization from Corpora
We describe a novel technique and implemented system for constructing a
subcategorization dictionary from textual corpora. Each dictionary entry
encodes the relative frequency of occurrence of a comprehensive set of
subcategorization classes for English. An initial experiment, on a sample of 14
verbs which exhibit multiple complementation patterns, demonstrates that the
technique achieves accuracy comparable to previous approaches, which are all
limited to a highly restricted set of subcategorization classes. We also
demonstrate that a subcategorization dictionary built with the system improves
the accuracy of a parser by an appreciable amount.Comment: 8 pages; requires aclap.sty. To appear in ANLP-9
Le DM, a French Dictionary for NooJ
International audienceThis paper presents the DM, a new dictionary for French. Freely available resources are selectively used to obtain lexical lemmas, from which morphological grammars generate about 538 000 baseforms. Evaluation of the DM on corpus shows that it stands the comparison with the previous NooJ delaf dictionary
Lexical comprehension and production in Alexia system
In language learning, vocabulary is very important. Studies have shown that the dictionary is used very often in a written comprehension task. However, its utility is not always obvious. In this paper we discuss the improvements electronic dictionaries can provide compared to classical paper ones. In lexical access, they help the learner by making the relevant information selection and research easier and then improve the efficiency of usage. Our system, Alexia, contains specific lexical information for learners. In lexical production, computers gives us large possibilities with automatic processing. We will see how we use an analyser and a parser in order to make pedagogical new style activities
Complex Annotations with NooJ
International audienceNooJ associates each text with a Text Annotation Structure, in which each recognized linguistic unit is represented by an annotation. Annotations store the position of the text units to be represented, their length, and linguistic information. NooJ can represent and process complex annotations, such as those that represent units inside word forms, as well as those that are discontinuous. We demonstrate how to use NooJ‟s morphological, lexical, and syntactic tools to formalize and process these complex annotations
Can Subcategorisation Probabilities Help a Statistical Parser?
Research into the automatic acquisition of lexical information from corpora
is starting to produce large-scale computational lexicons containing data on
the relative frequencies of subcategorisation alternatives for individual
verbal predicates. However, the empirical question of whether this type of
frequency information can in practice improve the accuracy of a statistical
parser has not yet been answered. In this paper we describe an experiment with
a wide-coverage statistical grammar and parser for English and
subcategorisation frequencies acquired from ten million words of text which
shows that this information can significantly improve parse accuracy.Comment: 9 pages, uses colacl.st
Learning Language from a Large (Unannotated) Corpus
A novel approach to the fully automated, unsupervised extraction of
dependency grammars and associated syntax-to-semantic-relationship mappings
from large text corpora is described. The suggested approach builds on the
authors' prior work with the Link Grammar, RelEx and OpenCog systems, as well
as on a number of prior papers and approaches from the statistical language
learning literature. If successful, this approach would enable the mining of
all the information needed to power a natural language comprehension and
generation system, directly from a large, unannotated corpus.Comment: 29 pages, 5 figures, research proposa
A knowledge-based approach to multiwords processing in machine translation: the English-Italian dictionary of multiwords
This poster presents a knowledge-based approach to the identification and translation of multiword expressions (MWEs) from English to Italian. The main assumption of the methodology proposed is that the proper treatment of MWEs in MT calls for a computational approach which must be, at least partially, knowledge-based, and in particular should be grounded on an explicit linguistic description of MWEs, both using a dictionary and a set of rules.
Empirical approaches bring interesting complementary robustness-oriented solutions but taken alone, they can hardly cope with this complex linguistic phenomenon for various reasons. For instance, statistical approaches fail to identify and process non high-frequent MWEs in texts or, on the contrary, they are not able to recognise strings of words as single meaning units, even if they are very frequent.
Furthermore, MWEs change continuously both in number and in internal structure with idiosyncratic morphological, syntactic, semantic, pragmatic and translational behaviours.
The hypothesis is that a linguistic approach can complement probabilistic methodologies to help identify and translate MWEs correctly since hand-crafted and linguistically-motivated resources, in the form of electronic dictionaries and local grammars, obtain accurate and reliable results for NLP purposes.
The methodology adopted for this research work is mainly based on the following elements:
• an NLP environment which allows the development and testing of the linguistic resources.
• an electronic E-I MWE dictionary, based on an accurate linguistic description that accounts for different types of MWEs and their semantic properties by means of well-defined steps: identification, interpretation, disambiguation and finally application.
• a set of local grammars
We will provide details about the methodology that can be applied to the identification and translation of MWEs.
1. NooJ: an NLP environment for the development and testing of MWE linguistic resources
NooJ is a freeware linguistic-engineering development platform used to develop large-coverage formalised descriptions of natural languages and apply them to large corpora, in real time.
The knowledge bases used by this tool are: electronic dictionaries (simple words, MWEs and frozen expressions) and grammars represented by organised sets of graphs to formalise various linguistic aspects such as semi-frozen phenomena (local grammars), syntax (grammars for phrases and full sentences) and semantics (named entity recognition, transformational analysis). NooJ’s linguistic engine includes several computational devices used both to formalise linguistic phenomena and parse texts such as FSTs, FSAs, Recursive Transition Networks (RTNs), Enhanced Recursive Transition Networks (ERTNs), Regular Expressions (RegExs), Context Free Grammars (CFGs).
NooJ is a tool that is particularly suitable for processing different types of MWEs and several experiments have already been carried out in this area: for instance, Machonis (2007 and 2008), Anastasiadis, Papadopoulou & Gavriilidou (2011), Aoughlis (2011) and finally Vietri (2008). These are only a few examples of the various analysis performed in the last few years on MWE using NooJ as an NLP development and testing environment.
2. The Dictionary of English-Italian MWEs
The EIMWE.dic is a dictionary used to represent and recognise various types of MWEs.
This dictionary is based on a contrastive English-Italian analysis of continuous and discontinuous MWEs with different degrees of variability of co-occurrence among word compositionality and different syntactic structures.
The translation of MWEs requires the knowledge of the correct equivalent in the target language which is hardly ever the result of a literal translation. Given their arbitrariness, MT has to rely on the availability of ready solutions in both languages in order to perform an accurate translation process.
Each entry of the dictionary is given a coherent linguistic description consisting of:
• the grammatical category for each constituent of the MWE: noun (N), Verb (V), adjective (A), preposition (PREP), determiner (DET), adverb (ADV), conjunction (CONJ);
• one or more inflectional and/or derivational paradigms (e.g. how to conjugate verbs, how to nominalise them), preceded by the tag +FLX;
• one or more syntactic properties (e.g. “+transitive” or +N0VN1PREPN2);
• one or more semantic properties (e.g. distributional classes such as “+Human”, domain classes such as “+Politics”);
• the translation into Italian.
The EIMWE.dic contains different types of MWE POS patterns. The main part of the dictionary consists of phrasal verbs, support verb constructions, idiomatic expressions and collocations. In the poster, the main verb structures are explained with examples extracted from the British National Corpus, from the Internet by means of the WebCorp LSE application or with our own examples together with the Italian translations. Finally, the corresponding dictionary entry for each example of MWE POS pattern is provided
Recommended from our members
Inducing grammars from linguistic universals and realistic amounts of supervision
The best performing NLP models to date are learned from large volumes of manually-annotated data. For tasks like part-of-speech tagging and grammatical parsing, high performance can be achieved with plentiful supervised data. However, such resources are extremely costly to produce, making them an unlikely option for building NLP tools in under-resourced languages or domains. This dissertation is concerned with reducing the annotation required to learn NLP models, with the goal of opening up the range of domains and languages to which NLP technologies may be applied. In this work, we explore the possibility of learning from a degree of supervision that is at or close to the amount that could reasonably be collected from annotators for a particular domain or language that currently has none. We show that just a small amount of annotation input — even that which can be collected in just a few hours — can provide enormous advantages if we have learning algorithms that can appropriately exploit it. This work presents new algorithms, models, and approaches designed to learn grammatical information from weak supervision. In particular, we look at ways of intersecting a variety of different forms of supervision in complementary ways, thus lowering the overall annotation burden. Sources of information include tag dictionaries, morphological analyzers, constituent bracketings, and partial tree annotations, as well as unannotated corpora. For example, we present algorithms that are able to combine faster-to-obtain type-level annotation with unannotated text to remove the need for slower-to-obtain token-level annotation. Much of this dissertation describes work on Combinatory Categorial Grammar (CCG), a grammatical formalism notable for its use of structured, logic-backed categories that describe how each word and constituent fits into the overall syntax of the sentence. This work shows how linguistic universals intrinsic to the CCG formalism itself can be encoded as Bayesian priors to improve learning.Computer Science
- …