1,810 research outputs found

    Automatic Extraction of Subcategorization from Corpora

    Full text link
    We describe a novel technique and implemented system for constructing a subcategorization dictionary from textual corpora. Each dictionary entry encodes the relative frequency of occurrence of a comprehensive set of subcategorization classes for English. An initial experiment, on a sample of 14 verbs which exhibit multiple complementation patterns, demonstrates that the technique achieves accuracy comparable to previous approaches, which are all limited to a highly restricted set of subcategorization classes. We also demonstrate that a subcategorization dictionary built with the system improves the accuracy of a parser by an appreciable amount.Comment: 8 pages; requires aclap.sty. To appear in ANLP-9

    Le DM, a French Dictionary for NooJ

    Get PDF
    International audienceThis paper presents the DM, a new dictionary for French. Freely available resources are selectively used to obtain lexical lemmas, from which morphological grammars generate about 538 000 baseforms. Evaluation of the DM on corpus shows that it stands the comparison with the previous NooJ delaf dictionary

    Lexical comprehension and production in Alexia system

    No full text
    In language learning, vocabulary is very important. Studies have shown that the dictionary is used very often in a written comprehension task. However, its utility is not always obvious. In this paper we discuss the improvements electronic dictionaries can provide compared to classical paper ones. In lexical access, they help the learner by making the relevant information selection and research easier and then improve the efficiency of usage. Our system, Alexia, contains specific lexical information for learners. In lexical production, computers gives us large possibilities with automatic processing. We will see how we use an analyser and a parser in order to make pedagogical new style activities

    Complex Annotations with NooJ

    Get PDF
    International audienceNooJ associates each text with a Text Annotation Structure, in which each recognized linguistic unit is represented by an annotation. Annotations store the position of the text units to be represented, their length, and linguistic information. NooJ can represent and process complex annotations, such as those that represent units inside word forms, as well as those that are discontinuous. We demonstrate how to use NooJ‟s morphological, lexical, and syntactic tools to formalize and process these complex annotations

    Can Subcategorisation Probabilities Help a Statistical Parser?

    Full text link
    Research into the automatic acquisition of lexical information from corpora is starting to produce large-scale computational lexicons containing data on the relative frequencies of subcategorisation alternatives for individual verbal predicates. However, the empirical question of whether this type of frequency information can in practice improve the accuracy of a statistical parser has not yet been answered. In this paper we describe an experiment with a wide-coverage statistical grammar and parser for English and subcategorisation frequencies acquired from ten million words of text which shows that this information can significantly improve parse accuracy.Comment: 9 pages, uses colacl.st

    Learning Language from a Large (Unannotated) Corpus

    Full text link
    A novel approach to the fully automated, unsupervised extraction of dependency grammars and associated syntax-to-semantic-relationship mappings from large text corpora is described. The suggested approach builds on the authors' prior work with the Link Grammar, RelEx and OpenCog systems, as well as on a number of prior papers and approaches from the statistical language learning literature. If successful, this approach would enable the mining of all the information needed to power a natural language comprehension and generation system, directly from a large, unannotated corpus.Comment: 29 pages, 5 figures, research proposa

    A knowledge-based approach to multiwords processing in machine translation: the English-Italian dictionary of multiwords

    Get PDF
    This poster presents a knowledge-based approach to the identification and translation of multiword expressions (MWEs) from English to Italian. The main assumption of the methodology proposed is that the proper treatment of MWEs in MT calls for a computational approach which must be, at least partially, knowledge-based, and in particular should be grounded on an explicit linguistic description of MWEs, both using a dictionary and a set of rules. Empirical approaches bring interesting complementary robustness-oriented solutions but taken alone, they can hardly cope with this complex linguistic phenomenon for various reasons. For instance, statistical approaches fail to identify and process non high-frequent MWEs in texts or, on the contrary, they are not able to recognise strings of words as single meaning units, even if they are very frequent. Furthermore, MWEs change continuously both in number and in internal structure with idiosyncratic morphological, syntactic, semantic, pragmatic and translational behaviours. The hypothesis is that a linguistic approach can complement probabilistic methodologies to help identify and translate MWEs correctly since hand-crafted and linguistically-motivated resources, in the form of electronic dictionaries and local grammars, obtain accurate and reliable results for NLP purposes. The methodology adopted for this research work is mainly based on the following elements: • an NLP environment which allows the development and testing of the linguistic resources. • an electronic E-I MWE dictionary, based on an accurate linguistic description that accounts for different types of MWEs and their semantic properties by means of well-defined steps: identification, interpretation, disambiguation and finally application. • a set of local grammars We will provide details about the methodology that can be applied to the identification and translation of MWEs. 1. NooJ: an NLP environment for the development and testing of MWE linguistic resources NooJ is a freeware linguistic-engineering development platform used to develop large-coverage formalised descriptions of natural languages and apply them to large corpora, in real time. The knowledge bases used by this tool are: electronic dictionaries (simple words, MWEs and frozen expressions) and grammars represented by organised sets of graphs to formalise various linguistic aspects such as semi-frozen phenomena (local grammars), syntax (grammars for phrases and full sentences) and semantics (named entity recognition, transformational analysis). NooJ’s linguistic engine includes several computational devices used both to formalise linguistic phenomena and parse texts such as FSTs, FSAs, Recursive Transition Networks (RTNs), Enhanced Recursive Transition Networks (ERTNs), Regular Expressions (RegExs), Context Free Grammars (CFGs). NooJ is a tool that is particularly suitable for processing different types of MWEs and several experiments have already been carried out in this area: for instance, Machonis (2007 and 2008), Anastasiadis, Papadopoulou & Gavriilidou (2011), Aoughlis (2011) and finally Vietri (2008). These are only a few examples of the various analysis performed in the last few years on MWE using NooJ as an NLP development and testing environment. 2. The Dictionary of English-Italian MWEs The EIMWE.dic is a dictionary used to represent and recognise various types of MWEs. This dictionary is based on a contrastive English-Italian analysis of continuous and discontinuous MWEs with different degrees of variability of co-occurrence among word compositionality and different syntactic structures. The translation of MWEs requires the knowledge of the correct equivalent in the target language which is hardly ever the result of a literal translation. Given their arbitrariness, MT has to rely on the availability of ready solutions in both languages in order to perform an accurate translation process. Each entry of the dictionary is given a coherent linguistic description consisting of: • the grammatical category for each constituent of the MWE: noun (N), Verb (V), adjective (A), preposition (PREP), determiner (DET), adverb (ADV), conjunction (CONJ); • one or more inflectional and/or derivational paradigms (e.g. how to conjugate verbs, how to nominalise them), preceded by the tag +FLX; • one or more syntactic properties (e.g. “+transitive” or +N0VN1PREPN2); • one or more semantic properties (e.g. distributional classes such as “+Human”, domain classes such as “+Politics”); • the translation into Italian. The EIMWE.dic contains different types of MWE POS patterns. The main part of the dictionary consists of phrasal verbs, support verb constructions, idiomatic expressions and collocations. In the poster, the main verb structures are explained with examples extracted from the British National Corpus, from the Internet by means of the WebCorp LSE application or with our own examples together with the Italian translations. Finally, the corresponding dictionary entry for each example of MWE POS pattern is provided

    Natural language software registry (second edition)

    Get PDF
    • …
    corecore