445 research outputs found

    TWITTIRÒ: an Italian Twitter Corpus with a Multi-layered Annotation for Irony

    Get PDF
    Provided the difficulties that still affect a correct identification of irony within the context of Sentiment Analysis tasks, in this paper we describe the main issues emerged during the development of a novel resource for Italian annotated for irony. The project mainly consists in the application on the Twitter corpus TWITTIRĂ’ of a multi-layered scheme for the fine-grained annotation of irony, as proposed in a multilingual setting and previously applied also on French and English datasets (Karoui et al. 2017). In applying the annotation on this corpus, we outline and discuss the issues and peculiarities emerged about the exploitation of the semantic scheme for Twitter textual messages in Italian, thus shedding some lights on the future directions that can be followed in the multilingual and cross-language perspective too. We present, in particular, an analysis of the annotation process and distribution of the labels of each layer involved in the scheme. This is supported by a discussion of the outcome of the annotation carried on by native Italian speakers in the development of the corpus. In particular, an in-depth discussion of the inter-annotator agreement and of the sources of disagreement is included. The result is a novel gold standard corpus for irony detection in Italian, which enriches the scenario of multilingual datasets available for this challenging task and is ready to be used as a benchmark in automatic irony detection experiments and evaluation campaigns

    Argumentation Mining in User-Generated Web Discourse

    Full text link
    The goal of argumentation mining, an evolving research field in computational linguistics, is to design methods capable of analyzing people's argumentation. In this article, we go beyond the state of the art in several ways. (i) We deal with actual Web data and take up the challenges given by the variety of registers, multiple domains, and unrestricted noisy user-generated Web discourse. (ii) We bridge the gap between normative argumentation theories and argumentation phenomena encountered in actual data by adapting an argumentation model tested in an extensive annotation study. (iii) We create a new gold standard corpus (90k tokens in 340 documents) and experiment with several machine learning methods to identify argument components. We offer the data, source codes, and annotation guidelines to the community under free licenses. Our findings show that argumentation mining in user-generated Web discourse is a feasible but challenging task.Comment: Cite as: Habernal, I. & Gurevych, I. (2017). Argumentation Mining in User-Generated Web Discourse. Computational Linguistics 43(1), pp. 125-17

    To Normalize, or Not to Normalize: The Impact of Normalization on Part-of-Speech Tagging

    Full text link
    Does normalization help Part-of-Speech (POS) tagging accuracy on noisy, non-canonical data? To the best of our knowledge, little is known on the actual impact of normalization in a real-world scenario, where gold error detection is not available. We investigate the effect of automatic normalization on POS tagging of tweets. We also compare normalization to strategies that leverage large amounts of unlabeled data kept in its raw form. Our results show that normalization helps, but does not add consistently beyond just word embedding layer initialization. The latter approach yields a tagging model that is competitive with a Twitter state-of-the-art tagger.Comment: In WNUT 201

    Domain transfer for deep natural language generation from abstract meaning representations

    Get PDF
    Stochastic natural language generation systems that are trained from labelled datasets are often domainspecific in their annotation and in their mapping from semantic input representations to lexical-syntactic outputs. As a result, learnt models fail to generalize across domains, heavily restricting their usability beyond single applications. In this article, we focus on the problem of domain adaptation for natural language generation. We show how linguistic knowledge from a source domain, for which labelled data is available, can be adapted to a target domain by reusing training data across domains. As a key to this, we propose to employ abstract meaning representations as a common semantic representation across domains. We model natural language generation as a long short-term memory recurrent neural network encoderdecoder, in which one recurrent neural network learns a latent representation of a semantic input, and a second recurrent neural network learns to decode it to a sequence of words. We show that the learnt representations can be transferred across domains and can be leveraged effectively to improve training on new unseen domains. Experiments in three different domains and with six datasets demonstrate that the lexical-syntactic constructions learnt in one domain can be transferred to new domains and achieve up to 75-100% of the performance of in-domain training. This is based on objective metrics such as BLEU and semantic error rate and a subjective human rating study. Training a policy from prior knowledge from a different domain is consistently better than pure in-domain training by up to 10%

    TWITTIRĂ’: an Italian Twitter Corpus with a Multi-layered Annotation for Irony

    Get PDF
    Provided the difficulties that still affect a correct identification of irony within the context of Sentiment Analysis tasks, in this paper we describe the main issues emerged during the development of a novel resource for Italian annotated for irony. The project mainly consists in the application on the Twitter corpus TWITTIRĂ’ of a multi-layered scheme for the fine-grained annotation of irony, as proposed in a multilingual setting and previously applied also on French and English datasets (Karoui et al. 2017). In applying the annotation on this corpus, we outline and discuss the issues and peculiarities emerged about the exploitation of the semantic scheme for Twitter textual messages in Italian, thus shedding some lights on the future directions that can be followed in the multilingual and cross-language perspective too. We present, in particular, an analysis of the annotation process and distribution of the labels of each layer involved in the scheme. This is supported by a discussion of the outcome of the annotation carried on by native Italian speakers in the development of the corpus. In particular, an in-depth discussion of the inter-annotator agreement and of the sources of disagreement is included. The result is a novel gold standard corpus for irony detection in Italian, which enriches the scenario of multilingual datasets available for this challenging task and is ready to be used as a benchmark in automatic irony detection experiments and evaluation campaigns

    Webbasierte linguistische Forschung: Möglichkeiten und Begrenzungen beim Umgang mit Massendaten

    Get PDF
    Over the past ten to fifteen years, web-based methods of sociological research have emerged alongside classical methods such as interviews, observations and experiments, and linguistic research is increasingly relying upon them as well. This paper provides an overview of three web-based approaches, i.e. online surveys, crowd-sourcing and web-based corpus analyses. Examples from specific projects serve to reflect upon these methods, address their potential and limitations, and make a critical appraisal. Internet-based empirical research produces vast and highly diverse quantities of (speaker-based or textual) data, presenting linguistic research with new opportunities and challenges. New procedures are required to make effective use of these resources

    Getting Past the Language Gap: Innovations in Machine Translation

    Get PDF
    In this chapter, we will be reviewing state of the art machine translation systems, and will discuss innovative methods for machine translation, highlighting the most promising techniques and applications. Machine translation (MT) has benefited from a revitalization in the last 10 years or so, after a period of relatively slow activity. In 2005 the field received a jumpstart when a powerful complete experimental package for building MT systems from scratch became freely available as a result of the unified efforts of the MOSES international consortium. Around the same time, hierarchical methods had been introduced by Chinese researchers, which allowed the introduction and use of syntactic information in translation modeling. Furthermore, the advances in the related field of computational linguistics, making off-the-shelf taggers and parsers readily available, helped give MT an additional boost. Yet there is still more progress to be made. For example, MT will be enhanced greatly when both syntax and semantics are on board: this still presents a major challenge though many advanced research groups are currently pursuing ways to meet this challenge head-on. The next generation of MT will consist of a collection of hybrid systems. It also augurs well for the mobile environment, as we look forward to more advanced and improved technologies that enable the working of Speech-To-Speech machine translation on hand-held devices, i.e. speech recognition and speech synthesis. We review all of these developments and point out in the final section some of the most promising research avenues for the future of MT

    Are We There Yet?: The Development of a Corpus Annotated for Social Acts in Multilingual Online Discourse

    Get PDF
    We present the AAWD and AACD corpora, a collection of discussions drawn from Wikipedia talk pages and small group IRC discussions in English, Russian and Mandarin. Our datasets are annotated with labels capturing two kinds of social acts: alignment moves and authority claims. We describe these social acts, describe our annotation process, highlight challenges we encountered and strategies we employed during annotation, and present some analyses of resulting data set which illustrate the utility of our corpus and identify interactions among social acts and between participant status and social acts and in online discourse

    Inducing Implicit Arguments via Cross-document Alignment: A Framework and its Applications

    Get PDF
    Natural language texts frequently contain related information in different positions in discourse. As human readers, we can recognize such information across sentence boundaries and correctly infer relations between them. Given this inference capability, we understand texts that describe complex dependencies even if central aspects are not repeated in every sentence. In linguistics, certain omissions of redundant information are known under the term ellipsis and have been studied as cohesive devices in discourse (Halliday and Hasan, 1976). For computational approaches to semantic processing, such cohesive devices are problematic because methods are traditionally applied on the sentence level and barely take surrounding context into account. In this dissertation, we investigate omission phenomena on the level of predicate-argument structures. In particular, we examine instances of structures involving arguments that are not locally realized but inferable from context. The goal of this work is to automatically acquire and process such instances, which we also refer to as implicit arguments, to improve natural language processing applications. Our main contribution is a framework that identifies implicit arguments by aligning and comparing predicate-argument structures across pairs of comparable texts. As part of this framework, we develop a novel graph-based clustering approach, which detects corresponding predicate-argument structures using pairwise similarity metrics. To find discourse antecedents of implicit arguments, we further design a heuristic method that utilizes automatic annotations from various linguistic pre-processing tools. We empirically validate the utility of automatically induced instances of implicit arguments and discourse antecedents in three extrinsic evaluation scenarios. In the first scenario, we show that our induced pairs of arguments and antecedents can successfully be applied to improve a pre-existing model for linking implicit arguments in discourse. In two further evaluation settings, we show that induced instances of implicit arguments, together with their aligned explicit counterparts, can be used as training material for a novel model of local coherence. Given discourse-level and semantic features, this model can predict whether a specific argument should be explicitly realized to establish local coherence or whether it is inferable and hence redundant in context
    • …
    corecore