132 research outputs found

    Annotation guidelines for labeling English-Dutch cognate pairs (version 1.0)

    Get PDF

    Are Automatic Methods for Cognate Detection Good Enough for Phylogenetic Reconstruction in Historical Linguistics?

    Get PDF
    We evaluate the performance of state-of-the-art algorithms for automatic cognate detection by comparing how useful automatically inferred cognates are for the task of phylogenetic inference compared to classical manually annotated cognate sets. Our findings suggest that phylogenies inferred from automated cog- nate sets come close to phylogenies inferred from expert-annotated ones, although on average, the latter are still superior. We con- clude that future work on phylogenetic reconstruction can profit much from automatic cognate detection. Especially where scholars are merely interested in exploring the bigger picture of a language family’s phylogeny, algorithms for automatic cognate detection are a useful complement for current research on language phylogenies

    Alignment Analysis of Sequential Segmentation of Lexicons to Improve Automatic Cognate Detection

    Full text link
    Ranking functions in information retrieval are often used in search engines to recommend the relevant answers to the query. This paper makes use of this notion of information retrieval and applies onto the problem domain of cognate detection. The main contributions of this paper are: (1) positional segmentation, which incorporates the sequential notion; (2) graphical error modelling, which deduces the transformations. The current research work focuses on classification problem; which is distinguishing whether a pair of words are cognates. This paper focuses on a harder problem, whether we could predict a possible cognate from the given input. Our study shows that when language modelling smoothing methods are applied as the retrieval functions and used in conjunction with positional segmentation and error modelling gives better results than competing baselines, in both classification and prediction of cognates. Source code is at: https://github.com/pranav-ust/cognatesComment: Published at ACL-SRW 201

    Linear mappings: semantic transfer from transformer models for cognate detection and coreference resolution

    Get PDF
    Includes bibliographical references.2022 Fall.Embeddings or vector representations of language and their properties are useful for understanding how Natural Language Processing technology works. The usefulness of embeddings, however, depends on how contextualized or information-rich such embeddings are. In this work, I apply a novel affine (linear) mapping technique first established in the field of computer vision to embeddings generated from large Transformer-based language models. In particular, I study its use in two challenging linguistic tasks: cross-lingual cognate detection and cross-document coreference resolution. Cognate detection for two Low-Resource Languages (LRL), Assamese and Bengali, is framed as a binary classification problem using semantic (embedding-based), articulatory, and phonetic features. Linear maps for this task are extrinsically evaluated on the extent of transfer of semantic information between monolingual as well as multi-lingual models including those specialized for low-resourced Indian languages. For cross-document coreference resolution, whole-document contextual representations are generated for event and entity mentions from cross- document language models like CDLM and other BERT-variants and then linearly mapped to form coreferring clusters based on their cosine similarities. I evaluate my results on gold output based on established coreference metrics like BCUB and MUC. My findings reveal that linearly transforming vectors from one model's embedding space to another carries certain semantic information with high fidelity thereby revealing the existence of a canonical embedding space and its geometric properties for language models. Interestingly, even for a much more challenging task like coreference resolution, linear maps are able to transfer semantic information between "lighter" models or less contextual models and "larger" models with near-equivalent performance or even improved results in some cases

    Automated identification of borrowings in multilingual wordlists

    Get PDF
    Although lexical borrowing is an important aspect of language evolution, there have been few attempts to automate the identification of borrowings in lexical datasets. Moreover, none of the solutions which have been proposed so far identify borrowings across multiple languages. This study proposes a new method for the task and tests it on a newly compiled large comparative dataset of 48 South-East Asian languages from Southern China. The method yields very promising results, while it is conceptually straightforward and easy to apply. This makes the approach a perfect candidate for computer-assisted exploratory studies on lexical borrowing in contact areas

    A Global Lexical Dataset (GLED) with cognate annotation and phonological alignments

    Get PDF
    This repository comprises a dataset developed from a subset of ASJP, in which all lemmas are presented in a broad phonological transcription, automatically annotated for cognacy, and phonologically aligned. Per-family NEXUS files with binary annotation of presence/absence of cognate sets are also available. The dataset is intended to facilitate prototyping studies and methods in quantitative historical linguistics

    Sentiment analysis for Hinglish code-mixed tweets by means of cross-lingual word embeddings

    Get PDF

    Bangime: secret language, language isolate, or language island? A computer‐assisted case study

    Get PDF
    We report the results of a qualitative and quantitative lexical comparison between Bangime and neighboring languages. Our results indicate that the status of the language as an isolate remains viable, and that Bangime speakers have had different levels of language contact with other Malian populations at various points throughout their history. Bangime speakers, the Bangande, claim Dogon ancestry. The Bangande portray this connection to Dogon through the fact that the language has both recent borrowings from neighboring Dogon varieties and more rooted vocabulary from Dogon languages spoken to the east from whence the Bangande claim to have come. Evidence of multilayered long‐term contact is clear: lexical items have even permeated even core vocabulary. However, strikingly, the Bangande are seemingly unaware that their language is not intelligible with any Dogon variety. We hope that our fiindings will influence future studies on the reconstruction of the Dogon languages and other neighboring language varieties to shed light on the mysterious history of Bangime and its speakers
    • 

    corecore