Search CORE

132 research outputs found

Identifying cognates in English-Dutch and French-Dutch by means of orthographic information and cross-lingual word embeddings

Author: Labat Sofie
Lefever Els
Singh Pranaydeep
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2020
Field of study

Ghent University Academic Bibliography

Annotation guidelines for labeling English-Dutch cognate pairs (version 1.0)

Author: Labat Sofie
Lefever Els
Vandevoorde Lore
Publication venue: LT3, Faculty of Arts, Humanities and Law, Ghent University
Publication date: 01/01/2019
Field of study

Ghent University Academic Bibliography

Are Automatic Methods for Cognate Detection Good Enough for Phylogenetic Reconstruction in Historical Linguistics?

Author: Gerhard Jäger
Johann-Mattis LIst
Johannes Wahle
Taraka Rama
Publication venue: 'Modern Language Association'
Publication date: 01/01/2018
Field of study

We evaluate the performance of state-of-the-art algorithms for automatic cognate detection by comparing how useful automatically inferred cognates are for the task of phylogenetic inference compared to classical manually annotated cognate sets. Our findings suggest that phylogenies inferred from automated cog- nate sets come close to phylogenies inferred from expert-annotated ones, although on average, the latter are still superior. We con- clude that future work on phylogenetic reconstruction can profit much from automatic cognate detection. Especially where scholars are merely interested in exploring the bigger picture of a language family’s phylogeny, algorithms for automatic cognate detection are a useful complement for current research on language phylogenies

arXiv.org e-Print Archive

Crossref

Humanities Commons

MPG.PuRe

Alignment Analysis of Sequential Segmentation of Lexicons to Improve Automatic Cognate Detection

Author: A Pranav
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2018
Field of study

Ranking functions in information retrieval are often used in search engines to recommend the relevant answers to the query. This paper makes use of this notion of information retrieval and applies onto the problem domain of cognate detection. The main contributions of this paper are: (1) positional segmentation, which incorporates the sequential notion; (2) graphical error modelling, which deduces the transformations. The current research work focuses on classification problem; which is distinguishing whether a pair of words are cognates. This paper focuses on a harder problem, whether we could predict a possible cognate from the given input. Our study shows that when language modelling smoothing methods are applied as the retrieval functions and used in conjunction with positional segmentation and error modelling gives better results than competing baselines, in both classification and prediction of cognates. Source code is at: https://github.com/pranav-ust/cognatesComment: Published at ACL-SRW 201

arXiv.org e-Print Archive

Crossref

Linear mappings: semantic transfer from transformer models for cognate detection and coreference resolution

Author: Nath Abhijnan
Publication venue: Colorado State University. Libraries
Publication date: 01/01/2022
Field of study

Includes bibliographical references.2022 Fall.Embeddings or vector representations of language and their properties are useful for understanding how Natural Language Processing technology works. The usefulness of embeddings, however, depends on how contextualized or information-rich such embeddings are. In this work, I apply a novel affine (linear) mapping technique first established in the field of computer vision to embeddings generated from large Transformer-based language models. In particular, I study its use in two challenging linguistic tasks: cross-lingual cognate detection and cross-document coreference resolution. Cognate detection for two Low-Resource Languages (LRL), Assamese and Bengali, is framed as a binary classification problem using semantic (embedding-based), articulatory, and phonetic features. Linear maps for this task are extrinsically evaluated on the extent of transfer of semantic information between monolingual as well as multi-lingual models including those specialized for low-resourced Indian languages. For cross-document coreference resolution, whole-document contextual representations are generated for event and entity mentions from cross- document language models like CDLM and other BERT-variants and then linearly mapped to form coreferring clusters based on their cosine similarities. I evaluate my results on gold output based on established coreference metrics like BCUB and MUC. My findings reveal that linearly transforming vectors from one model's embedding space to another carries certain semantic information with high fidelity thereby revealing the existence of a canonical embedding space and its geometric properties for language models. Interestingly, even for a much more challenging task like coreference resolution, linear maps are able to transfer semantic information between "lighter" models or less contextual models and "larger" models with near-equivalent performance or even improved results in some cases

Mountain Scholar (Digital Collections of Colorado and Wyoming)

Automated identification of borrowings in multilingual wordlists

Author: Forkel R.
List J.
Publication venue: 'F1000 Research Ltd'
Publication date: 01/01/2021
Field of study

Although lexical borrowing is an important aspect of language evolution, there have been few attempts to automate the identification of borrowings in lexical datasets. Moreover, none of the solutions which have been proposed so far identify borrowings across multiple languages. This study proposes a new method for the task and tests it on a newly compiled large comparative dataset of 48 South-East Asian languages from Southern China. The method yields very promising results, while it is conceptually straightforward and easy to apply. This makes the approach a perfect candidate for computer-assisted exploratory studies on lexical borrowing in contact areas

MPG.PuRe

A Global Lexical Dataset (GLED) with cognate annotation and phonological alignments

Author: Tiago Tresoldi
Publication venue
Publication date: 01/01/2022
Field of study

This repository comprises a dataset developed from a subset of ASJP, in which all lemmas are presented in a broad phonological transcription, automatically annotated for cognacy, and phonologically aligned. Per-family NEXUS files with binary annotation of presence/absence of cognate sets are also available. The dataset is intended to facilitate prototyping studies and methods in quantitative historical linguistics

Humanities Commons

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Sentiment analysis for Hinglish code-mixed tweets by means of cross-lingual word embeddings

Author: Lefever Els
Singh Pranaydeep
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2020
Field of study

Ghent University Academic Bibliography

Bangime: secret language, language isolate, or language island? A computer‐assisted case study

Author: Hantgan Abbie
List Johann‐Mattis
Publication venue: 'Edinburgh University Library'
Publication date: 07/09/2022
Field of study

We report the results of a qualitative and quantitative lexical comparison between Bangime and neighboring languages. Our results indicate that the status of the language as an isolate remains viable, and that Bangime speakers have had different levels of language contact with other Malian populations at various points throughout their history. Bangime speakers, the Bangande, claim Dogon ancestry. The Bangande portray this connection to Dogon through the fact that the language has both recent borrowings from neighboring Dogon varieties and more rooted vocabulary from Dogon languages spoken to the east from whence the Bangande claim to have come. Evidence of multilayered long‐term contact is clear: lexical items have even permeated even core vocabulary. However, strikingly, the Bangande are seemingly unaware that their language is not intelligible with any Dogon variety. We hope that our fiindings will influence future studies on the reconstruction of the Dogon languages and other neighboring language varieties to shed light on the mysterious history of Bangime and its speakers

Papers in Historical Phonology

Journal Hosting Service | The University of Edinburgh

MPG.PuRe