40 research outputs found

    Exploiting Arabic Diacritization for High Quality Automatic Annotation

    Get PDF
    International audienceWe present a novel technique for Arabic morphological annotation. The technique utilizes diacritization to produce morphological annotations of quality comparable to human annotators. Although Arabic text is generally written without diacritics, diacritization is already available for large corpora of Arabic text in several genres. Furthermore, diacritization can be generated at a low cost for new text as it does not require specialized training beyond what educated Arabic typists know. The basic approach is to enrich the input to a state-of-the-art Arabic morphological analyzer with word diacritics (full or partial) to enhance its performance. When applied to fully diacritized text, our approach produces annotations with an accuracy of over 97% on lemma, part-of-speech, and tokenization combined

    Proceedings

    Get PDF
    Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories. Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti. NEALT Proceedings Series, Vol. 9 (2010), 268 pages. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15891

    Nodalida 2005 - proceedings of the 15th NODALIDA conference

    Get PDF

    A tree does not make a well-formed sentence: Improving syntactic string-to-tree statistical machine translation with more linguistic knowledge

    Get PDF
    AbstractSynchronous context-free grammars (SCFGs) can be learned from parallel texts that are annotated with target-side syntax, and can produce translations by building target-side syntactic trees from source strings. Ideally, producing syntactic trees would entail that the translation is grammatically well-formed, but in reality, this is often not the case. Focusing on translation into German, we discuss various ways in which string-to-tree translation models over- or undergeneralise. We show how these problems can be addressed by choosing a suitable parser and modifying its output, by introducing linguistic constraints that enforce morphological agreement and constrain subcategorisation, and by modelling the productive generation of German compounds

    Representation and parsing of multiword expressions

    Get PDF
    This book consists of contributions related to the definition, representation and parsing of MWEs. These reflect current trends in the representation and processing of MWEs. They cover various categories of MWEs such as verbal, adverbial and nominal MWEs, various linguistic frameworks (e.g. tree-based and unification-based grammars), various languages including English, French, Modern Greek, Hebrew, Norwegian), and various applications (namely MWE detection, parsing, automatic translation) using both symbolic and statistical approaches

    Current trends

    Get PDF
    Deep parsing is the fundamental process aiming at the representation of the syntactic structure of phrases and sentences. In the traditional methodology this process is based on lexicons and grammars representing roughly properties of words and interactions of words and structures in sentences. Several linguistic frameworks, such as Headdriven Phrase Structure Grammar (HPSG), Lexical Functional Grammar (LFG), Tree Adjoining Grammar (TAG), Combinatory Categorial Grammar (CCG), etc., offer different structures and combining operations for building grammar rules. These already contain mechanisms for expressing properties of Multiword Expressions (MWE), which, however, need improvement in how they account for idiosyncrasies of MWEs on the one hand and their similarities to regular structures on the other hand. This collaborative book constitutes a survey on various attempts at representing and parsing MWEs in the context of linguistic theories and applications

    An investigation into deviant morphology : issues in the implementation of a deep grammar for Indonesian

    Get PDF
    This thesis investigates deviant morphology in Indonesian for the implementation of a deep grammar. In particular we focus on the implementation of the verbal suffix -kan. This suffix has been described as having many functions, which alter the kinds of arguments and the number of arguments the verb takes (Dardjowidjojo 1971; Chung 1976; Arka 1993; Vamarasi 1999; Kroeger 2007; Son and Cole 2008). Deep grammars or precision grammars (Butt et al. 1999a; Butt et al. 2003; Bender et al. 2011) have been shown to be useful for natural language processing (NLP) tasks, such as machine translation and generation (Oepen et al. 2004; Cahill and Riester 2009; Graham 2011), and information extraction (MacKinlay et al. 2012), demonstrating the need for linguistically rich information to aid NLP tasks. Although these linguistically-motivated grammars are invaluable resources to the NLP community, the biggest drawback is the time required for the manual creation and curation of the lexicon. Our work aims to expedite this process by applying methods to assign syntactic information to kan-affixed verbs automatically. The method we employ exploits the hypothesis that semantic similarity is tightly connected with syntactic behaviour (Levin 1993). Our endeavour in automatically acquiring verbal information for an Indonesian deep grammar poses a number of lingustic challenges. First of all Indonesian verbs exhibit voice marking that is characteristic of the subgrouping of its language family. In order to be able to characterise verbal behaviour in Indonesian, we first need to devise a detailed analysis of voice for implementation. Another challenge we face is the claim that all open class words in Indonesian, at least as it is spoken in some varieties (Gil 1994; Gil 2010), cannot linguistically be analysed as being distinct from each other. That is, there is no distiction between nouns, verbs or adjectives in Indonesian, and all word from the open class categories should be analysed uniformly. This poses difficulties in implementing a grammar in a linguistically motivated way, as well discovering syntactic behaviour of verbs, if verbs cannot be distinguished from nouns. As part of our investigation we conduct experiments to verify the need to employ word class categories, and we find that indeed these are linguistically motivated labels in Indonesian. Through our investigation into deviant morphological behaviour, we gain a better characterisation of the morphosyntactic effects of -kan, and we discover that, although Indonesian has been labelled as a language with no open word class distinctions, word classes can be established as being linguistically-motivated

    Distributional Semantic Models of Attribute Meaning in Adjectives and Nouns

    Get PDF
    Attributes such as SIZE, WEIGHT or COLOR are at the core of conceptualization, i.e., the formal representation of entities or events in the real world. In natural language, formal attributes find their counterpart in attribute nouns which can be used in order to generalize over individual properties (e.g., 'big' or 'small' in case of SIZE, 'blue' or 'red' in case of COLOR). In order to ascribe such properties to entities or events, adjective-noun phrases are a very frequent linguistic pattern (e.g., 'a blue shirt', 'a big lion'). In these constructions, attribute meaning is conveyed only implicitly, i.e., without being overtly realized at the phrasal surface. This thesis is about modeling attribute meaning in adjectives and nouns in a distributional semantics framework. This implies the acquisition of meaning representations for adjectives, nouns and their phrasal combination from corpora of natural language text in an unsupervised manner, without tedious handcrafting or manual annotation efforts. These phrase representations can be used to predict implicit attribute meaning from adjective-noun phrases -- a problem which will be referred to as attribute selection throughout this thesis. The approach to attribute selection proposed in this thesis is framed in structured distributional models. We model adjective and noun meanings as distinct semantic vectors in the same semantic space spanned by attributes as dimensions of meaning. Based on these word representations, we make use of vector composition operations in order to construct a phrase representation from which the most prominent attribute(s) being expressed in the compositional semantics of the adjective-noun phrase can be selected by means of an unsupervised selection function. This approach not only accounts for the linguistic principle of compositionality that underlies adjective-noun phrases, but also avoids inherent sparsity issues that result from the fact that the relationship between an adjective, a noun and a particular attribute is rarely explicitly observed in corpora. The attribute models developed in this thesis aim at a reconciliation of the conflict between specificity and sparsity in distributional semantic models. For this purpose, we compare various instantiations of attribute models capitalizing on pattern-based and dependency-based distributional information as well as attribute-specific latent topics induced from a weakly supervised adaptation of Latent Dirichlet Allocation. Moreover, we propose a novel framework of distributional enrichment in order to enhance structured vector representations by incorporating additional lexical information from complementary distributional sources. In applying distributional enrichment to distributional attribute models, we follow the idea to augment structured representations of adjectives and nouns to centroids of their nearest neighbours in semantic space, while keeping the principle of meaning representation along structured, interpretable dimensions intact. We evaluate our attribute models in several experiments on the attribute selection task framed for various attribute inventories, ranging from a thoroughly confined set of ten core attributes up to a large-scale set of 260 attributes. Our results show that large-scale attribute selection from distributional vector representations that have been acquired in an unsupervised setting is a challenging endeavor that can be rendered more feasible by restricting the semantic space to confined subsets of attributes. Beyond quantitative evaluation, we also provide a thorough analysis of performance factors (based on linear regression) that influence the effectiveness of a distributional attribute model for attribute selection. This investigation reflects strengths and weaknesses of the model and sheds light on the impact of a variety of linguistic factors involved in attribute selection, e.g., the relative contribution of adjective and noun meaning. In conclusion, we consider our work on attribute selection as an instructive showcase for applying methods from distributional semantics in the broader context of knowledge acquisition from text in order to alleviate issues that are related to implicitness and sparsity

    Distributional Semantic Models of Attribute Meaning in Adjectives and Nouns

    Get PDF
    Hartung M. Distributional Semantic Models of Attribute Meaning in Adjectives and Nouns. Heidelberg: Universität Heidelberg; 2015
    corecore