298 research outputs found

    A cross-linguistic database of phonetic transcription systems

    Get PDF
    Contrary to what non-practitioners might expect, the systems of phonetic notation used by linguists are highly idiosyncratic. Not only do various linguistic subfields disagree on the specific symbols they use to denote the speech sounds of languages, but also in large databases of sound inventories considerable variation can be found. Inspired by recent efforts to link cross-linguistic data with help of reference catalogues (Glottolog, Concepticon) across different resources, we present initial efforts to link different phonetic notation systems to a catalogue of speech sounds. This is achieved with the help of a database accompanied by a software framework that uses a limited but easily extendable set of non-binary feature values to allow for quick and convenient registration of different transcription systems, while at the same time linking to additional datasets with restricted inventories. Linking different transcription systems enables us to conveniently translate between different phonetic transcription systems, while linking sounds to databases allows users quick access to various kinds of metadata, including feature values, statistics on phoneme inventories, and information on prosody and sound classes. In order to prove the feasibility of this enterprise, we supplement an initial version of our cross-linguistic database of phonetic transcription systems (CLTS), which currently registers five transcription systems and links to fifteen datasets, as well as a web application, which permits users to conveniently test the power of the automatic translation across transcription systems

    Speaker Accent Modulates the Effects of Orthographic and Phonological Similarity on Auditory Processing by Learners of English

    Get PDF
    Published: 19 May 2022The cognate effect refers to translation equivalents with similar form between languages—i.e., cognates, such as “band” (English) and “banda” (Spanish)—being processed faster than words with dissimilar forms—such as, “cloud” and “nube.” Substantive literature supports this claim, but is mostly based on orthographic similarity and tested in the visual modality. In a previous study, we found an inhibitory orthographic similarity effect in the auditory modality—i.e., greater orthographic similarity led to slower response times and reduced accuracy. The aim of the present study is to explain this effect. In doing so, we explore the role of the speaker’s accent in auditory word recognition and whether native accents lead to a mismatch between the participants’ phonological representation and the stimulus. Participants carried out a lexical decision task and a typing task in which they spelled out the word they heard. Words were produced by two speakers: one with a native English accent (Standard American) and the other with a non-native accent matching that of the participants (native Spanish speaker from Spain). We manipulated orthographic and phonological similarity orthogonally and found that accent did have some effect on both response time and accuracy as well as modulating the effects of similarity. Overall, the non-native accent improved performance, but it did not fully explain why high orthographic similarity items show an inhibitory effect in the auditory modality. Theoretical implications and future directions are discussed.This research was supported by the Basque Government through the BERC 2022-2025 program and by the Spanish State Research Agency through BCBL Severo Ochoa excellence accreditation CEX2020-001010-S. CF and EN-B are supported by MINECO predoctoral grants from the Spanish government (BES-2016-077169) and (BES-2016-078896) respectively. CM was further supported by the Spanish Ministry of Economy and Competitiveness [PID2020-113926GB-I00, PSI2017-82941-P, and RED2018-102615-T] and the Basque Government [PIBA18-29] and funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (Grant Agreement No:819093 to CM)

    Assessment of Dyslexia in the Urdu Language

    Get PDF

    Studying Evolutionary Change: Transdisciplinary Advances in Understanding and Measuring Evolution

    Get PDF
    Evolutionary processes can be found in almost any historical, i.e. evolving, system that erroneously copies from the past. Well studied examples do not only originate in evolutionary biology but also in historical linguistics. Yet an approach that would bind together studies of such evolving systems is still elusive. This thesis is an attempt to narrowing down this gap to some extend. An evolving system can be described using characters that identify their changing features. While the problem of a proper choice of characters is beyond the scope of this thesis and remains in the hands of experts we concern ourselves with some theoretical as well data driven approaches. Having a well chosen set of characters describing a system of different entities such as homologous genes, i.e. genes of same origin in different species, we can build a phylogenetic tree. Consider the special case of gene clusters containing paralogous genes, i.e. genes of same origin within a species usually located closely, such as the well known HOX cluster. These are formed by step- wise duplication of its members, often involving unequal crossing over forming hybrid genes. Gene conversion and possibly other mechanisms of concerted evolution further obfuscate phylogenetic relationships. Hence, it is very difficult or even impossible to disentangle the detailed history of gene duplications in gene clusters. Expanding gene clusters that use unequal crossing over as proposed by Walter Gehring leads to distinctive patterns of genetic distances. We show that this special class of distances helps in extracting phylogenetic information from the data still. Disregarding genome rearrangements, we find that the shortest Hamiltonian path then coincides with the ordering of paralogous genes in a cluster. This observation can be used to detect ancient genomic rearrangements of gene clus- ters and to distinguish gene clusters whose evolution was dominated by unequal crossing over within genes from those that expanded through other mechanisms. While the evolution of DNA or protein sequences is well studied and can be formally described, we find that this does not hold for other systems such as language evolution. This is due to a lack of detectable mechanisms that drive the evolutionary processes in other fields. Hence, it is hard to quantify distances between entities, e.g. languages, and therefore the characters describing them. Starting out with distortions of distances, we first see that poor choices of the distance measure can lead to incorrect phylogenies. Given that phylogenetic inference requires additive metrics we can infer the correct phylogeny from a distance matrix D if there is a monotonic, subadditive function ζ such that ζ^−1(D) is additive. We compute the metric-preserving transformation ζ as the solution of an optimization problem. This result shows that the problem of phylogeny reconstruction is well defined even if a detailed mechanistic model of the evolutionary process is missing. Yet, this does not hinder studies of language evolution using automated tools. As the amount of available and large digital corpora increased so did the possibilities to study them automatically. The obvious parallels between historical linguistics and phylogenetics lead to many studies adapting bioinformatics tools to fit linguistics means. Here, we use jAlign to calculate bigram alignments, i.e. an alignment algorithm that operates with regard to adjacency of letters. Its performance is tested in different cognate recognition tasks. Using pairwise alignments one major obstacle is the systematic errors they make such as underestimation of gaps and their misplacement. Applying multiple sequence alignments instead of a pairwise algorithm implicitly includes more evolutionary information and thus can overcome the problem of correct gap placement. They can be seen as a generalization of the string-to-string edit problem to more than two strings. With the steady increase in computational power, exact, dynamic programming solutions have become feasible in practice also for 3- and 4-way alignments. For the pairwise (2-way) case, there is a clear distinction between local and global alignments. As more sequences are consid- ered, this distinction, which can in fact be made independently for both ends of each sequence, gives rise to a rich set of partially local alignment problems. So far these have remained largely unexplored. Thus, a general formal frame- work that gives raise to a classification of partially local alignment problems is introduced. It leads to a generic scheme that guides the principled design of exact dynamic programming solutions for particular partially local alignment problems

    Exploring the Cognitive Underpinnings of Word Retrieval Deficits in Dyslexia Using the Tip-of-the-Tongue Paradigm

    Get PDF
    Over the past thirty years a consensus has emerged that the word reading difficulties of dyslexic readers stem from deficits in phonological processing. One experimental paradigm that has provided support for this view is the finding that dyslexic readers demonstrate deficits in word retrieval from long term memory on picture naming tasks. Dyslexic readers are able to retrieve fewer words in their receptive vocabularies and are less accurate than normally developing readers. However, the conclusion that dyslexic readers? difficulties in picture naming are the consequence of deficits in phonological processing is inferential. The current study uses the tip-of-the-tongue (TOT) paradigm to provide evidence that dyslexic readers demonstrate a specific deficit in the retrieval of phonological information from long term memory. Participants consisted of 16 dyslexic children and 31 control children, mean age of 115 months. Children were given a picture naming task consisting of 143 target words that varied in length and frequency of use. Results indicate that dyslexic children report more TOT experiences than control children. Moreover, when examined from the perspective of theoretical models of word retrieval, dyslexic children did not differ from control children in the percent of failures at the first step of word retrieval, the retrieval of semantic information. However, dyslexic children reported a significantly higher proportion of failures at the second step in word retrieval, the retrieval of phonological representations. This is one of the first studies to provide direct support that dyslexia is related to a specific deficit in phonological representation

    Techniques for Automatic Normalization of Orthographically Variant Yiddish Texts

    Full text link
    Yiddish is characterized by a multitude of orthographic systems. A number of approaches to automatic normalization of variant orthography have been explored for the processing of historic texts of languages whose orthography has since been standardized. However, these approaches have not yet been applied to Yiddish. Using a manually normalized set of 16 Yiddish documents as a training and test corpus, four techniques for automatic normalization were compared: a hand-crafted set of transformation rules, an off-the-shelf spell checker, edit distance minimization with manually set weights, and edit distance minimization with weights learned through a training set. Performance was evaluated by calculating the proportion of correctly normalized words in a test set, and by measuring precision and recall in a test of information retrieval. For the given test corpus, normalization by minimization of edit distance with multi-character edit operations and learned weights was found to perform best in all tests

    The Semantics of Word Division in Northwest Semitic Writing Systems

    Get PDF
    Much focus in writing systems research has been on the correspondences on the level of the grapheme/phoneme. Seeking to complement these, this monograph considers the targets of graphic word-level units in natural language, focusing on ancient North West Semitic (NWS) writing systems, principally Hebrew, Aramaic, Phoenician and Ugaritic. While in Modern European languages word division tends to mark-up morphosyntactic elements, in most NWS writing systems word division is argued to target prosodic units, whereby written ‘words’ consist of units which must be pronounced together with a single primary accent or stress. This is opposed to other possibilities including Semantic word division, as seen in Middle Egyptian hieroglyphic.  The monograph starts by considering word division in a source where, unlike the rest of the material considered, the phonology is well represented, the medieval tradition of Tiberian Hebrew and Aramaic. There word division is found to mark-up ‘minimal prosodic words’, i.e. units that must under any circumstances be pronounced together as a single phonological unit. After considering the Sitz im Leben of such a word division strategy, the monograph moves on to compare Tiberian word division with that in early epigraphic NWS, where it is shown that orthographic wordhood has an almost identical distribution. The most economical explanation for this is argued to be that word division has the same underlying basis in NWS writing since the earliest times. Thereafter word division in Ugaritic alphabetic cuneiform is considered, where two word division strategies are identified, corresponding broadly to two genres of text, poetry and prose. 'Poetic' word division is taken as an instance of mainstream ‘prosodic word division’, while the other is morphosyntactic in scope anticipating later word division strategies in Europe by several centuries. Finally, the monograph considers the digital encoding of word division in NWS texts, especially the difficulties, as well as potential solutions to, the problem of marking up texts with overlapping, viz. morphosyntactic and prosodic, analyses

    Information-theoretic causal inference of lexical flow

    Get PDF
    This volume seeks to infer large phylogenetic networks from phonetically encoded lexical data and contribute in this way to the historical study of language varieties. The technical step that enables progress in this case is the use of causal inference algorithms. Sample sets of words from language varieties are preprocessed into automatically inferred cognate sets, and then modeled as information-theoretic variables based on an intuitive measure of cognate overlap. Causal inference is then applied to these variables in order to determine the existence and direction of influence among the varieties. The directed arcs in the resulting graph structures can be interpreted as reflecting the existence and directionality of lexical flow, a unified model which subsumes inheritance and borrowing as the two main ways of transmission that shape the basic lexicon of languages. A flow-based separation criterion and domain-specific directionality detection criteria are developed to make existing causal inference algorithms more robust against imperfect cognacy data, giving rise to two new algorithms. The Phylogenetic Lexical Flow Inference (PLFI) algorithm requires lexical features of proto-languages to be reconstructed in advance, but yields fully general phylogenetic networks, whereas the more complex Contact Lexical Flow Inference (CLFI) algorithm treats proto-languages as hidden common causes, and only returns hypotheses of historical contact situations between attested languages. The algorithms are evaluated both against a large lexical database of Northern Eurasia spanning many language families, and against simulated data generated by a new model of language contact that builds on the opening and closing of directional contact channels as primary evolutionary events. The algorithms are found to infer the existence of contacts very reliably, whereas the inference of directionality remains difficult. This currently limits the new algorithms to a role as exploratory tools for quickly detecting salient patterns in large lexical datasets, but it should soon be possible for the framework to be enhanced e.g. by confidence values for each directionality decision

    Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval

    Get PDF
    Although more and more language pairs are covered by machine translation services, there are still many pairs that lack translation resources. Cross-language information retrieval (CLIR) is an application which needs translation functionality of a relatively low level of sophistication since current models for information retrieval (IR) are still based on a bag-of-words. The Web provides a vast resource for the automatic construction of parallel corpora which can be used to train statistical translation models automatically. The resulting translation models can be embedded in several ways in a retrieval model. In this paper, we will investigate the problem of automatically mining parallel texts from the Web and different ways of integrating the translation models within the retrieval process. Our experiments on standard test collections for CLIR show that the Web-based translation models can surpass commercial MT systems in CLIR tasks. These results open the perspective of constructing a fully automatic query translation device for CLIR at a very low cost.Comment: 37 page
    • 

    corecore