197 research outputs found

    Lexical Variability and Compositionality: Investigating Idiomaticity with Distributional Semantic Models

    Get PDF
    In this work we carried out an idiom type identification task on a set of 90 Italian V-NP and V-PP constructions comprising both idioms and non-idioms. Lexical variants were generated from these expressions by replacing their components with semantically related words extracted distributionally and from the Italian section of MultiWordNet. Idiomatic phrases turned out to be less similar to their lexical variants with respect to non-idiomatic ones in distributional semantic spaces. Different variant-based distributional measures of idiomaticity were tested. Our indices proved reliable in identifying also those idioms whose lexical variants are poorly or not at all attested in our corpus

    Cross-domain analysis of discourse markers in European Portuguese

    Get PDF
    This paper presents an analysis of discourse markers in two spontaneous speech corpora for European Portuguese - university lectures and map-task dialogues - and also in a collection of tweets, aiming at contributing to their categorization, scarcely existent for European Portuguese. Our results show that the selection of discourse markers is domain and speaker dependent. We also found that the most frequent discourse markers are similar in all three corpora, despite tweets containing discourse markers not found in the other two corpora. In this multidisciplinary study, comprising both a linguistic perspective and a computational approach, discourse markers are also automatically discriminated from other structural metadata events, namely sentence-like units and disfluencies. Our results show that discourse markers and disfluencies tend to co-occur in the dialogue corpus, but have a complementary distribution in the university lectures. We used three acoustic-prosodic feature sets and machine learning to automatically distinguish between discourse markers, disfluencies and sentence-like units. Our in-domain experiments achieved an accuracy of about 87% in university lectures and 84% in dialogues, in line with our previous results. The eGeMAPS features, commonly used for other paralinguistic tasks, achieved a considerable performance on our data, especially considering the small size of the feature set. Our results suggest that turn-initial discourse markers are usually easier to classify than disfluencies, a result also previously reported in the literature. We conducted a cross-domain evaluation in order to evaluate the robustness of the models across domains. The results achieved are about 11%-12% lower, but we conclude that data from one domain can still be used to classify the same events in the other. Overall, despite the complexity of this task, these are very encouraging state-of-the-art results. Ultimately, using exclusively acoustic-prosodic cues, discourse markers can be fairly discriminated from disfluencies and SUs. In order to better understand the contribution of each feature, we have also reported the impact of the features in both the dialogues and the university lectures. Pitch features are the most relevant ones for the distinction between discourse markers and disfluencies, namely pitch slopes. These features are in line with the wide pitch range of discourse markers, in a continuum from a very compressed pitch range to a very wide one, expressed by total deaccented material or H+L* L* contours, with upstep H tones

    Cross-domain analysis of discourse markers in European Portuguese

    Get PDF
    This paper presents an analysis of discourse markers in two spontaneous speech corpora for European Portuguese - university lectures and map-task dialogues - and also in a collection of tweets, aiming at contributing to their categorization, scarcely existent for European Portuguese. Our results show that the selection of discourse markers is domain and speaker dependent. We also found that the most frequent discourse markers are similar in all three corpora, despite tweets containing discourse markers not found in the other two corpora. In this multidisciplinary study, comprising both a linguistic perspective and a computational approach, discourse markers are also automatically discriminated from other structural metadata events, namely sentence-like units and disfluencies. Our results show that discourse markers and disfluencies tend to co-occur in the dialogue corpus, but have a complementary distribution in the university lectures. We used three acoustic-prosodic feature sets and machine learning to automatically distinguish between discourse markers, disfluencies and sentence-like units. Our in-domain experiments achieved an accuracy of about 87% in university lectures and 84% in dialogues, in line with our previous results. The eGeMAPS features, commonly used for other paralinguistic tasks, achieved a considerable performance on our data, especially considering the small size of the feature set. Our results suggest that turn-initial discourse markers are usually easier to classify than disfluencies, a result also previously reported in the literature. We conducted a cross-domain evaluation in order to evaluate the robustness of the models across domains. The results achieved are about 11%-12% lower, but we conclude that data from one domain can still be used to classify the same events in the other. Overall, despite the complexity of this task, these are very encouraging state-of-the-art results. Ultimately, using exclusively acoustic-prosodic cues, discourse markers can be fairly discriminated from disfluencies and SUs. In order to better understand the contribution of each feature, we have also reported the impact of the features in both the dialogues and the university lectures. Pitch features are the most relevant ones for the distinction between discourse markers and disfluencies, namely pitch slopes. These features are in line with the wide pitch range of discourse markers, in a continuum from a very compressed pitch range to a very wide one, expressed by total deaccented material or H+L* L* contours, with upstep H tones.info:eu-repo/semantics/publishedVersio

    "A little more than kin" - Quotations as a linguistic phenomenon : a study based on quotations from Shakespeare's Hamlet

    Get PDF
    Quotations "oscillate between the occasional and the conventional" as Burger/Buhofer/Sialm (1982) once succinctly formulated. Developed from a PhD thesis, this book explores precisely this "oscillating" character of quotations: It discusses the nature of quotations and the relationship between common quotations and phraseology from a theoretical and an empirical perspective. Shakespeare's Hamlet was chosen as a canonical text whose frequently quoted traces can be followed across centuries. Scholarly work from various disciplines leads to an understanding of quotations as moving in a space created by the two dimensions of reference and repetition: Quotations are definable by a horizontal communicative axis (reference) and a vertical, intertextual axis of manifest lineages of use (repetition). Empirically, the data led to a categorisation of quotations as verbal, thematic and onomastic, based on the question "what has been repeated: words, themes or names?" Case studies further corroborate the proposition that verbal quotations may become (almost) ordinary multi-word units if the following conditions are met: a) they lose their referential dimension, b) they develop formal and/or semantic usage patterns and/or c) they are no longer limited to their original, literary discourse

    Uncertainty in deliberate lexical interventions

    Get PDF
    Language managers in their different forms (language planners, terminologists, professional neologists …) have long tried to intervene in the lexical usage of speakers, with various degrees of success: Some of their lexical items (partly) penetrate language use, others do not. Based on electronic networks of practice of the Esperanto speech community, Mélanie Maradan establishes the foundation for a new method to extract speakers’ opinions on lexical items from text corpora. The method is intended as a tool for language managers to detect and explore in context the reasons why speakers might accept or reject lexical items. Mélanie Maradan holds a master’s degree in translation and terminology from the University of Geneva/Switzerland as well as a joint doctoral degree in multilingual information processing and philosophy (Dr. phil.) from the universities of Geneva and Hildesheim/Germany. Her research interests include planned languages (Esperanto studies) as well as neology and corpus linguistics. She works as a professional translator and terminologist in Switzerland

    Uncertainty in deliberate lexical interventions

    Get PDF
    Language managers in their different forms (language planners, terminologists, professional neologists …) have long tried to intervene in the lexical usage of speakers, with various degrees of success: Some of their lexical items (partly) penetrate language use, others do not. Based on electronic networks of practice of the Esperanto speech community, Mélanie Maradan establishes the foundation for a new method to extract speakers’ opinions on lexical items from text corpora. The method is intended as a tool for language managers to detect and explore in context the reasons why speakers might accept or reject lexical items. Mélanie Maradan holds a master’s degree in translation and terminology from the University of Geneva/Switzerland as well as a joint doctoral degree in multilingual information processing and philosophy (Dr. phil.) from the universities of Geneva and Hildesheim/Germany. Her research interests include planned languages (Esperanto studies) as well as neology and corpus linguistics. She works as a professional translator and terminologist in Switzerland

    “You’re trolling because…” – A Corpus-based Study of Perceived Trolling and Motive Attribution in the Comment Threads of Three British Political Blogs

    Get PDF
    This paper investigates the linguistically marked motives that participants attribute to those they call trolls in 991 comment threads of three British political blogs. The study is concerned with how these motives affect the discursive construction of trolling and trolls. Another goal of the paper is to examine whether the mainly emotional motives ascribed to trolls in the academic literature correspond with those that the participants attribute to the alleged trolls in the analysed threads. The paper identifies five broad motives ascribed to trolls: emotional/mental health-related/social reasons, financial gain, political beliefs, being employed by a political body, and unspecified political affiliation. It also points out that depending on these motives, trolling and trolls are constructed in various ways. Finally, the study argues that participants attribute motives to trolls not only to explain their behaviour but also to insult them

    Expanding the Lexicon

    Get PDF
    The book series is dedicated to the study of the multifaceted dynamics of wordplay as an interface phenomenon. The contributions aim to bring together approaches from various disciplines and present case studies on different communicative settings, including everyday language and literary communication, and thus offer fresh perspectives on wordplay in the context of linguistic innovation, language contact, and speaker-hearer-interaction

    A Computational Theory of the Use-Mention Distinction in Natural Language

    Get PDF
    To understand the language we use, we sometimes must turn language on itself, and we do this through an understanding of the use-mention distinction. In particular, we are able to recognize mentioned language: that is, tokens (e.g., words, phrases, sentences, letters, symbols, sounds) produced to draw attention to linguistic properties that they possess. Evidence suggests that humans frequently employ the use-mention distinction, and we would be severely handicapped without it; mentioned language frequently occurs for the introduction of new words, attribution of statements, explanation of meaning, and assignment of names. Moreover, just as we benefit from mutual recognition of the use-mention distinction, the potential exists for us to benefit from language technologies that recognize it as well. With a better understanding of the use-mention distinction, applications can be built to extract valuable information from mentioned language, leading to better language learning materials, precise dictionary building tools, and highly adaptive computer dialogue systems. This dissertation presents the first computational study of how the use-mention distinction occurs in natural language, with a focus on occurrences of mentioned language. Three specific contributions are made. The first is a framework for identifying and analyzing instances of mentioned language, in an effort to reconcile elements of previous theoretical work for practical use. Definitions for mentioned language, metalanguage, and quotation have been formulated, and a procedural rubric has been constructed for labeling instances of mentioned language. The second is a sequence of three labeled corpora of mentioned language, containing delineated instances of the phenomenon. The corpora illustrate the variety of mentioned language, and they enable analysis of how the phenomenon relates to sentence structure. Using these corpora, inter-annotator agreement studies have quantified the concurrence of human readers in labeling the phenomenon. The third contribution is a method for identifying common forms of mentioned language in text, using patterns in metalanguage and sentence structure. Although the full breadth of the phenomenon is likely to elude computational tools for the foreseeable future, some specific, common rules for detecting and delineating mentioned language have been shown to perform well
    • …
    corecore