11 research outputs found

    Stylometric Studies based on Tone and Word Length Motifs

    Get PDF

    Probing Multilingual BERT for Genetic and Typological Signals

    Full text link
    We probe the layers in multilingual BERT (mBERT) for phylogenetic and geographic language signals across 100 languages and compute language distances based on the mBERT representations. We 1) employ the language distances to infer and evaluate language trees, finding that they are close to the reference family tree in terms of quartet tree distance, 2) perform distance matrix regression analysis, finding that the language distances can be best explained by phylogenetic and worst by structural factors and 3) present a novel measure for measuring diachronic meaning stability (based on cross-lingual representation variability) which correlates significantly with published ranked lists based on linguistic approaches. Our results contribute to the nascent field of typological interpretability of cross-lingual text representations.Comment: COLING 202

    Metaphorical Language Change Is Self-Organized Criticality

    Full text link
    One way to resolve the actuation problem of metaphorical language change is to provide a statistical profile of metaphorical constructions and generative rules with antecedent conditions. Based on arguments from the view of language as complex systems and the dynamic view of metaphor, this paper argues that metaphorical language change qualifies as a self-organized criticality state and the linguistic expressions of a metaphor can be profiled as a fractal with spatio-temporal correlations. Synchronously, these metaphorical expressions self-organize into a self-similar, scale-invariant fractal that follows a power-law distribution; temporally, long range inter-dependence constrains the self-organization process by the way of transformation rules that are intrinsic of a language system. This argument is verified in the paper with statistical analyses of twelve randomly selected Chinese verb metaphors in a large-scale diachronic corpus

    Quantifying Interpreting Types: Language Sequence Mirrors Cognitive Load Minimization in Interpreting Tasks

    Get PDF
    Most interpreting theories claim that different interpreting types should involve varied processing mechanisms and procedures. However, few studies have examined their underlying differences. Even though some previous results based on quantitative approaches show that different interpreting types yield outputs of varying lexical and syntactic features, the grammatical parsing approach is limited. Language sequences that form without relying on parsing or processing with a specific linguistic approach or grammar excel other quantitative approaches at revealing the sequential behavior of language production. As a non-grammatically-bound unit of language sequences, frequency motif can visualize the local distribution of content and function words, and can also statistically classify languages and identify text types. Thus, the current research investigates the distribution, length and position-dependent properties of frequency motifs across different interpreting outputs in pursuit of the sequential generation behaviors. It is found that the distribution, the length and certain position-dependent properties of the specific language sequences differ significantly across simultaneous interpreting and consecutive interpreting output. The features of frequency motifs manifest that both interpreting output is produced in the manner that abides by the least effort principle. The current research suggests that interpreting types can be differentiated through this type of language sequential unit and offers evidence for how the different task features mediate the sequential organization of interpreting output under different demand to achieve cognitive load minimization

    Automatic diachronic distance between diatopic variants of portuguese and spanish

    Get PDF
    [POR] O objetivo deste trabalho é aplicar uma metodologia baseada na perplexidade, para calcular automaticamente a distância interlinguística entre diferentes períodos históricos de variantes diatópicas de idiomas. Esta metodologia aplica-se a um corpus construído adhoc em ortografia original, numa base equilibrada de ficção e não-ficção, que mede a distância histórica entre o português europeu e do Brasil, por um lado, e o espanhol europeu e o da Argentina, por outro. Os resultados mostram distâncias muito próximas em ortografia original e transcrita automaticamente, entre as variedades diatópicas do português e do espanhol, com ligeiras convergências/divergências desde meados do século XX até hoje. É de salientar que o método não é supervisionado e pode ser aplicado a outras variedades diatópicas de línguas.[EN] The objective of this work is to apply a perplexitybased methodology to automatically calculate the cross-lingual distance between different historical periods of diatopic language variants. This methodology applies to an adhoc constructed corpus in original spelling, on a balanced basis of fiction and non-fiction, which measures the historical distance between European and Brazilian Portuguese on the one hand, and European and Argentinian Spanish on the other. The results show very close distances, both in original spelling and automatically transcribed spelling, between the diatopic varieties of Portuguese and Spanish, with slight convergences/divergences from the middle of the 20th century until today. It should be noted that the method is not supervised and can be applied to other diatopic varieties of languages

    Emotion and Plot in Xenophon’s Ephesiaka

    Get PDF
    The ancient Greek novel Ephesiaka contains two long inset narratives, both tales of erotic suffering that mirror the romance of the main story’s protagonists. This study examines how the inset narratives contribute to an “emotional plot” through the repetition of verbal motifs of emotion, and how the novel’s deliberately simple style characterizes eros through the cumulative alternation of verbal motifs throughout individual episodes. The Ephesiaka’s plot-focused and formulaic style articulates an emotional plot inextricable from the action plot, prioritizing the visible expression of emotions rather than internal states or conflicts. While this style can be considered “paraliterary” because it guides the reader overtly in its description of outward action, the unobtrusive narration provides little value judgement, leading to modern critical disagreement about how the novel characterizes the different forms of eros portrayed in the inset narratives.Master of Art

    Tackling the Toolkit. Plotting Poetry through Computational Literary Studies

    Get PDF
    In Tackling the Toolkit, we focus on the methodological innovations, challenges, obstacles and even shortcomings associated with applying quantitative methods to poetry specifically and poetics more broadly. Using tools including natural language processing, web ontologies, similarity detection devices and machine learning, our contributors explore not only metres, stanzas, stresses and rhythms but also genres, subgenres, lexical material and cognitive processes. Whether they are testing old theories and laws, making complex concepts machine-readable or developing new lines of textual analysis, their works challenge standard descriptions of norms and variations

    Machine Learning Methods with Noisy, Incomplete or Small Datasets

    Get PDF
    In many machine learning applications, available datasets are sometimes incomplete, noisy or affected by artifacts. In supervised scenarios, it could happen that label information has low quality, which might include unbalanced training sets, noisy labels and other problems. Moreover, in practice, it is very common that available data samples are not enough to derive useful supervised or unsupervised classifiers. All these issues are commonly referred to as the low-quality data problem. This book collects novel contributions on machine learning methods for low-quality datasets, to contribute to the dissemination of new ideas to solve this challenging problem, and to provide clear examples of application in real scenarios
    corecore