523 research outputs found

    Corpora creation in contrastive linguistics (Cтворення корпусів у дослідженнях з зіставного мовознавства)

    Get PDF
    Universal and specific features of language usage can become more evident if tested against the non-elicited language data on large scale. This requirement can be met by using corpora that provide ample data to test research hypotheses in contrastive language studies in objective and falsifiable manner. However, criteria in corpora creation and comparability measures in the evaluation of available corpora present a separate problem in contrastive linguistics. The article presents an overview of the types of corpora used in Contrastive Linguistics research and describes their characteristic features. The study proceeds to look into the sources of data used in corpora creation both in (commercially) available corpora and data collections compiled to answer a particular research question. The article describes the techniques used in creating comparable corpora for contrastive studies and presents the comparability measures to evaluate the corpora. The study examines the case of building a topic-specific comparable corpus in English and Ukrainian. The corpus focuses on education-related vocabulary in the languages under analysis. The corpus comparability is measured using translation equivalence and word frequency similarity. The article used the procedures outlined above to collect a quasi-comparable (non-aligned) corpus focusing on the topic of education with the English and Ukrainian languages in contrast. Using frequency comparability measure it was established that both components of the corpus (in the English and Ukrainian languages) contain keywords related to the topic of education. (У статті проаналізовано типи корпусів, які використовуються у дослідженнях з зіставного мовознавства з метою виявлення універсальних та специфічних особливостей мов. Встановлено основні джерела матеріалів для укладання корпусів, критерії відбору текстів, етапи укладання корпусів, моделі оцінки та характеристики корпусів для контрастивних студій. У статті розглянуто методи, що використовуються у створенні корпусів для зіставних досліджень, описано досвід укладання корпусів для зіставних досліджень на матеріалі англійської та української мов. Критерії відбору матеріалу, етапи побудови корпусів та перспектив їх використання розглянуто на прикладі корпусів лексики сфери освіти в аналізованих мовах.

    Measuring the comparability of multilingual corpora extracted from Twitter and others

    Get PDF
    International audienceMultilingual corpora are widely exploited in several tasks of natural language processing, these corpora are principally of two sorts: comparable and parallel corpora. The comparable corpora gather texts in several languages dealing with analogous subjects but are not translations of each other such as in parallel corpora. In this paper, a comparative study on two stemming techniques is conducted in order to improve the comparability measure based on a bilingual dictionary. These methods are: Buckwalter Arabic Morphological Analyzer (BAMA) and a proposed approach based on Light Stemming (LS) adapted specifically to Twitter, then we combined them. We evaluated and compared these techniques on three different (English-Arabic) corpora: a corpus extracted from the social network Twit-ter, Euronews and a parallel corpus extracted from newspapers (ANN). The experimental results show that the best comparability measure is achieved for the combination of BAMA with LS which leads to a similarity of 61% for Twitter, 52% for Euronews and 65% for ANN. For a confidence of 40% we aligned 73.8% of Arabic and English tweets

    Delving into the uncharted territories of Word Sense Disambiguation

    Get PDF
    The automatic disambiguation of word senses, i.e. Word Sense Disambiguation, is a long-standing task in the field of Natural Language Processing; an AI-complete problem that took its first steps more than half a century ago, and which, to date, has apparently attained human-like performances on standard evaluation benchmarks. Unfortunately, the steady evolution that the task experienced over time in terms of sheer performance has not been followed hand in hand by adequate theoretical support, nor by careful error analysis. Furthermore, we believe that the lack of an exhaustive bird’s eye view which accounts for the sort of high-end and unrealistic computational architectures that systems will soon need in order to further refine their performances could lead the field to a dead angle in a few years. In essence, taking advantage of the current moment of great accomplishments and renewed interest in the task, we argue that Word Sense Disambiguation is mature enough for researchers to really observe the extent of the results hitherto obtained, evaluate what is actually missing, and answer the much sought for question: “are current state-of-the-art systems really able to effectively solve lexical ambiguity?” Driven by the desire to become both architects and participants in this period of pondering, we have identified a few macro-areas representatives of the challenges of automatic disambiguation. From this point of view, in this thesis, we propose experimental solutions and empirical tools so as to bring to the attention of the Word Sense Disambiguation community unusual and unexplored points of view. We hope these will represent a new perspective through which to best observe the current state of disambiguation, as well as to foresee future paths for the task to evolve on. Specifically, 1q) prompted by the growing concern about the rise in performance being closely linked to the demand for more and more unrealistic computational architectures in all areas of application of Deep Learning related techniques, we 1a) provide evidence for the undisclosed potential of approaches based on knowledge-bases, via the exploitation of syntagmatic information. Moreover, 2q) driven by the dissatisfaction with the use of cognitively-inaccurate, finite inventories of word senses in Word Sense Disambiguation, we 2a) introduce an approach based on Definition Modeling paradigms to generate contextual definitions for target words and phrases, hence going beyond the limits set by specific lexical-semantic inventories. Finally, 3q) moved by the desire to analyze the real implications beyond the idea of “machines performing disambiguation on par with their human counterparts” we 3a) put forward a detailed analysis of the shared errors affecting current state-of-the-art systems based on diverse approaches for Word Sense Disambiguation, and highlight, by means of a novel evaluation dataset tailored to represent common and critical issues shared by all systems, performances way lower than those usually reported in the current literature

    Language and Linguistics in a Complex World Data, Interdisciplinarity, Transfer, and the Next Generation. ICAME41 Extended Book of Abstracts

    Get PDF
    This is a collection of papers, work-in-progress reports, and other contributions that were part of the ICAME41 digital conference

    Language and Linguistics in a Complex World Data, Interdisciplinarity, Transfer, and the Next Generation. ICAME41 Extended Book of Abstracts

    Get PDF
    This is a collection of papers, work-in-progress reports, and other contributions that were part of the ICAME41 digital conference

    A survey on perceived speaker traits: personality, likability, pathology, and the first challenge

    Get PDF
    The INTERSPEECH 2012 Speaker Trait Challenge aimed at a unified test-bed for perceived speaker traits – the first challenge of this kind: personality in the five OCEAN personality dimensions, likability of speakers, and intelligibility of pathologic speakers. In the present article, we give a brief overview of the state-of-the-art in these three fields of research and describe the three sub-challenges in terms of the challenge conditions, the baseline results provided by the organisers, and a new openSMILE feature set, which has been used for computing the baselines and which has been provided to the participants. Furthermore, we summarise the approaches and the results presented by the participants to show the various techniques that are currently applied to solve these classification tasks
    corecore