15,958 research outputs found
Trimming Phonetic Alignments Improves the Inference of Sound Correspondence Patterns from Multilingual Wordlists
Sound correspondence patterns form the basis of cognate detection and
phonological reconstruction in historical language comparison. Methods for the
automatic inference of correspondence patterns from phonetically aligned
cognate sets have been proposed, but their application to multilingual
wordlists requires extremely well annotated datasets. Since annotation is
tedious and time consuming, it would be desirable to find ways to improve
aligned cognate data automatically. Taking inspiration from trimming techniques
in evolutionary biology, which improve alignments by excluding problematic
sites, we propose a workflow that trims phonetic alignments in comparative
linguistics prior to the inference of correspondence patterns. Testing these
techniques on a large standardized collection of ten datasets with expert
annotations from different language families, we find that the best trimming
technique substantially improves the overall consistency of the alignments. The
results show a clear increase in the proportion of frequent correspondence
patterns and words exhibiting regular cognate relations.Comment: The paper was accepted at the SIGTYP workshop 2023 co-located with
EAC
Computer-Assisted Language Comparison: State of the ArtW
Historical language comparison opens windows onto a human past, long before the availability of written records. Since traditional language comparison within the framework of the comparative method is largely based on manual data comparison, requiring the meticulous sifting through dictionaries, word lists, and grammars, the framework is difficult to apply, especially in times where more and more data have become available in digital form. Unfortunately, it is not possible to simply automate the process of historical language comparison, not only because computational solutions lag behind human judgments in historical linguistics, but also because they lack the flexibility that would allow them to integrate various types of information from various kinds of sources. A more promising approach is to integrate computational and classical approaches within a computer-assisted framework, “neither completely computer-driven nor ignorant of the assistance computers afford” [1, p. 4]. In this paper, we will illustrate what we consider the current state of the art of computer-assisted language comparison by presenting a workflow that starts with raw data and leads up to a stage where sound correspondence patterns across multiple languages have been identified and can be readily presented, inspected, and discussed. We illustrate this workflow with the help of a newly prepared dataset on Hmong-Mien languages. Our illustration is accompanied by Python code and instructions on how to use additional web-based tools we developed so that users can apply our workflow for their own purposes
Automatic detection of borrowing (Open problems in computational diversity linguistics 2)
This is the third of a series of 12 blog posts published in 2019, discussing open problems in computational diversity linguistics. It discusses the problem of automatic borrowing detection
Handling word formation in comparative linguistics
Word formation plays a central role in human language. Yet computational approaches to historical linguistics often pay little attention to it. This means that the detailed findings of classical historical linguistics are often only used in qualitative studies, yet not in quantitative studies. Based on human- and machine-readable formats suggested by the CLDF-initiative, we propose a framework for the annotation of cross-linguistic etymological relations that allows for the differentiation between etymologies that involve only regular sound change and those that involve linear and non-linear processes of word formation. This paper introduces this approach by means of sample datasets and a small Python library to facilitate annotation
Representing and Computing Uncertainty in Phonological Reconstruction
Despite the inherently fuzzy nature of reconstructions in historical
linguistics, most scholars do not represent their uncertainty when proposing
proto-forms. With the increasing success of recently proposed approaches to
automating certain aspects of the traditional comparative method, the formal
representation of proto-forms has also improved. This formalization makes it
possible to address both the representation and the computation of uncertainty.
Building on recent advances in supervised phonological reconstruction, during
which an algorithm learns how to reconstruct words in a given proto-language
relying on previously annotated data, and inspired by improved methods for
automated word prediction from cognate sets, we present a new framework that
allows for the representation of uncertainty in linguistic reconstruction and
also includes a workflow for the computation of fuzzy reconstructions from
linguistic data.Comment: To appear in: Proceedings of the 4th Workshop on Computational
Approaches to Historical Language Chang
The Heterogeneity of Reading-Related Difficulties in Chinese
The present chapter reviews cognitive-linguistic skills which are associated with various reading-related difficulties in Chinese. Research findings have showed that rapid naming and orthographic deficits are the unique marker deficits of Chinese developmental dyslexia. However, studies have indicated overlapping and dissociative deficits in dyslexia and spelling difficulties. Findings on dissociation between word reading and spelling difficulties suggest that weaknesses in orthographic processing may specifically cause difficulties in Chinese word spelling. Deficits in rapid naming are more associated with word reading fluency than reading accuracy. Beyond word level processing, there are children who encounter difficulties in reading comprehension even with adequate decoding skills. This group of specific poor comprehenders was found to be weak in some discourse-level skills, like comprehension monitoring and inferencing. Knowledge of these findings will inform us about effective identification of and intervention for children with difficulties in one or a combination of several reading-related difficulties in Chinese
A pragmatic approach to semantic repositories benchmarking
The aim of this paper is to benchmark various semantic repositories in order to evaluate their deployment in a commercial image retrieval and browsing application. We adopt a two-phase approach for evaluating the target semantic repositories: analytical parameters such as query language and reasoning support are used to select the pool of the target repositories, and practical parameters such as load and query response times are used to select the best match to application requirements. In addition to utilising a widely accepted benchmark for OWL repositories (UOBM), we also use a real-life dataset from the target application, which provides us with the opportunity of consolidating our findings. A distinctive advantage of this benchmarking study is that the essential requirements for the target system such as the semantic expressivity and data scalability are clearly defined, which allows us to claim contribution to the benchmarking methodology for this class of applications
Networking Phylogeny for Indo-European and Austronesian Languages
Harnessing cognitive abilities of many individuals, a language evolves upon their mutual interactions establishing a persistent social environment to which language is closely attuned. Human history is encoded in the rich sets of linguistic data by means of symmetry patterns that are not always feasibly represented by trees. Here we use the methods developed in the study of complex networks to decipher accurately symmetry records on the language phylogeny of the Indo-European and the Austronesian language families, considering, in both cases, the samples of fifty different languages. In particular, we support the Anatolian theory of Indo-European origin and the ‘express train’ model of Austronesian expansion from South-East Asia, with an essential role for the Batanes islands located between the Philippines and Taiwan
A computer-assisted pproach to the comparison of mainland southeast Asian languages
This cumulative thesis is based on three separate projects based on a computer-assisted language comparison (CALC) framework to address common obstacles to studying the history of Mainland Southeast Asian (MSEA) languages, such as sparse and non-standardized lexical data, as well as an inadequate method of cognate judgments, and to provide caveats to scholars who will use Bayesian phylogenetic analysis. The first project provides a format that standardizes the sound inventories, regulates language labels, and clarifies lexical items. This standardized format allows us to merge various forms of raw data. The format also summarizes information to assist linguists in researching the relatedness among words and inferring relationships among languages. The second project focuses on increasing the transparency of lexical data and cognate judg- ments with regard to compound words. The method enables the annotation of each part of a word with semantic meanings and syntactic features. In addition, four different conversion methods were developed to convert morpheme cognates into word cognates for input into the Bayesian phylogenetic analysis. The third project applies the methods used in the first project to create a workflow by merging linguistic data sets and inferring a language tree using a Bayesian phylogenetic algorithm. Further- more, the project addresses the importance of integrating cross-disciplinary studies into historical linguistic research. Finally, the methods we proposed for managing lexical data for MSEA languages are discussed and summarized in six perspectives. The work can be seen as a milestone in reconstructing human prehistory in an area that has high linguistic and cultural diversity
- …