3 research outputs found

    Refining Implicit Argument Annotation for UCCA

    Full text link
    Predicate-argument structure analysis is a central component in meaning representations of text. The fact that some arguments are not explicitly mentioned in a sentence gives rise to ambiguity in language understanding, and renders it difficult for machines to interpret text correctly. However, only few resources represent implicit roles for NLU, and existing studies in NLP only make coarse distinctions between categories of arguments omitted from linguistic form. This paper proposes a typology for fine-grained implicit argument annotation on top of Universal Conceptual Cognitive Annotation's foundational layer. The proposed implicit argument categorisation is driven by theories of implicit role interpretation and consists of six types: Deictic, Generic, Genre-based, Type-identifiable, Non-specific, and Iterated-set. We exemplify our design by revisiting part of the UCCA EWT corpus, providing a new dataset annotated with the refinement layer, and making a comparative analysis with other schemes.Comment: DMR 202

    A Unified Annotation Scheme for the Semantic/Pragmatic Components of Definiteness

    No full text
    <p>We present a definiteness annotation scheme that captures the semantic, pragmatic, and discourse information associated with noun phrases, which we call communicative functions. A survey of the linguistics literature suggests that definiteness does not express a single communicative function but is a grammaticalization of many such functions, for example, identifiability, familiarity, uniqueness, and specificity. Our annotation scheme unifies ideas from previous research on definiteness while attempting to remove redundancy. The scheme encodes the communicative functions of definiteness rather than the grammatical forms of definiteness. We assume that the communicative functions are largely maintained across languages while the grammaticalization of this information may vary. Corpora that are annotated using communicative functions can be used to train classifiers, offering data-driven insights into the grammaticalization of definiteness in different languages. We release our annotated corpora for English and Hindi as well as sample annotations for Hebrew and Russian, together with an annotation manual.</p

    A quantitative and qualitative analysis of competing motivations interacting in the placement of finite relative clauses in Hindi

    Get PDF
    Hindi has an unmarked SOV order (verb-final language), but constituents can be arranged in different orderings. While the focus of earlier studies has been on the rich set of word order variations; alternations at the clausal level have not received much attention (see Manetta 2012). Hindi finite RCs present an ideal case study for investigating clausal ordering because they can optionally occupy three positions: at the left edge of the main clause (left-peripheral or correlatives), at the right edge of the main clause (right-peripheral or extraposed), and immediately after the noun phrase it modifies (adnominal). This dissertation applies quantitative and qualitative methods to corpus data to investigate how grammatical weight, linear distance, and information structure interact with syntactic locality to determine the position of the relative clause at the left and right peripheries. These factors were drawn from previous studies on Hindi RCs (Dayal 1996; Srivastav 1991; Bhatt 2003; among others), as well as studies on different word/clause order phenomena in English and German, especially on relative clause extraposition (Francis 2010; Francis & Michaelis 2011; Strunk 2010). This dissertation argues that regardless of the syntactic analysis of these constructions, i.e. movement or base-generation adjunction, speakers have three main possible constructions to choose from when conveying a message. This selection is not random, but rather motivated by syntactic and non-syntactic factors. In particular, the present corpus study investigates the following questions: what factor(s) influence the choice of one ordering over the other in the production of finite relative clauses in Hindi; (b) what function(s) can clause ordering alternation serve, particularly in the two cases of discontinuous dependencies at the left and right peripheries; (c) can we predict a preference of any of these constructions based on particular factors? The corpus comprised 2,000 sentences containing at least one finite relative construction, extracted from a set of 353 monolingual written Hindi texts from the EMILLE/CIIL Corpus (Lancaster University and the Central Institute of Indian Languages). The data was analysed using a combination of statistical methods in order to determine which factors have an effect in ordering alternations, and whether there were interactions between them. A Multinomial Logistic Regression was selected as the prediction model (cf. Binary Regression Model in Francis & Michaelis 2016 and Strunk 2014). The predictability of the model was also tested by means of a Confusion matrix or Error matrix, using R (R Development Core Team 2017). The results of the corpus study confirmed that several competing factors have an effect in the placement of finite relative clauses in Hindi. The findings confirmed Hawkins’ (1994; 2004) claim that syntactic locality and grammatical weight are stronger predictors than discourse factors in determining ordering variations. Although discourse factors such as definiteness, givenness, and restrictiveness do not have a strong effect in predicting relative clause configurations; the data show interactions between them and syntactic locality and grammatical weight. Furthermore, the Principle of Minimize Domains (Hawkins 1994; 2004) and the Principle of End-weight (Quirk et al. 1972) successfully account for the asymmetries were reported in previous studies (Srivastav 1991; Dayal 1996); particularly, the repetition of the nominal head inside and outside the RC, the demonstrative requirement, availability of multi-heading, stacking/coordination phenomena, and restrictiveness. Another interesting finding was that Hindi, as English, prefers short-before-long sequences, in contrast with other verb-final languages such as Japanese and Korean which prefer long-before-short (cf. Hawkins 1994; 2004). Hindi also tends to place discourse given NPs before discourse new ones (Gupta 1986; Gundel 1989). In terms of predicting the structures that speakers will use, the confusion matrix showed higher success rate in predicting right-peripheral constructions from their discursive and structural characteristics: 370 constructions were correctly matched with the original, whereas 56 were incorrectly predicted as a left-peripheral construction, and zero instances were incorrectly predicted as adnominal. On the other hand, adnominal relatives were incorrectly predicted as a right-peripheral construction in 51 instances, only one correct match. Left-peripheral relatives were correctly matched 154 times, one instance was incorrectly matched with an adnominal construction, and 134 times incorrectly matched with a right-peripheral construction. I argue that there are several possible reasons why the model was more successful predicting right-peripheral relatives than the other two types. For instance, the number of tokens is larger for the right peripheral type, hence the model had more input on this construction. Also, right-peripheral relatives present more distinct differences with the other two types in terms of the quantitative factors considered. In other words, adnominal relatives and left-peripheral do not present significant differences regarding those quantitative factors. Finally, it is possible that the distinction between adnominal and left-peripheral constructions depends more heavily on qualitative factors than the quantitative ones. Because the former factors were not available, the model was not able to correctly predict the occurrence of these constructions. One advantage of a Multinomial Logistic Regression model is that is considers the totality of the independent variables for calculating the risk ratio, emulating a “real life” situation where the speaker has access to all sort of information (syntactic, semantic, processing, etc.). Nevertheless, if there is interaction between some of the independent variables, the model is claimed to overweight some of the probabilities. The sample size, however, did not permit to make stronger claims on the overweight effects, if any. Other approaches to incorporating quantitative data such as clustering or neural networks could be implemented in future research in order to test if the prediction improves for the other two types of constructions. Another interesting contribution of this dissertation is that the corpus data supported locality effects (cf. Kothari 2010). This provides evidence for Hawkins’ (2004) prediction that different methods can lead to different patterns of results in the investigation of grammatical weight and syntactic locality. Finally, the present study contributes to the debate on Hindi relative clauses in presenting evidence of non-syntactic factors intervening in the syntactic phenomena of relativization, and by accounting for the different properties associated to the three types of relatives from a non-syntactic perspective. It provides a systematic analysis of syntactic and non-syntactic factors using production corpus data. This kind of data expanded the range of possible constructions that were included in earlier studies
    corecore