1,653 research outputs found

    An Algorithm For Building Language Superfamilies Using Swadesh Lists

    Get PDF
    The main contributions of this thesis are the following: i. Developing an algorithm to generate language families and superfamilies given for each input language a Swadesh list represented using the international phonetic alphabet (IPA) notation. ii. The algorithm is novel in using the Levenshtein distance metric on the IPA representation and in the way it measures overall distance between pairs of Swadesh lists. iii. Building a Swadesh list for the author\u27s native Kinyarwanda language because a Swadesh list could not be found even after an extensive search for it. Adviser: Peter Reves

    Using Global Constraints and Reranking to Improve Cognates Detection

    Full text link
    Global constraints and reranking have not been used in cognates detection research to date. We propose methods for using global constraints by performing rescoring of the score matrices produced by state of the art cognates detection systems. Using global constraints to perform rescoring is complementary to state of the art methods for performing cognates detection and results in significant performance improvements beyond current state of the art performance on publicly available datasets with different language pairs and various conditions such as different levels of baseline state of the art performance and different data size conditions, including with more realistic large data size conditions than have been evaluated with in the past.Comment: 10 pages, 6 figures, 6 tables; published in the Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1983-1992, Vancouver, Canada, July 201

    Computational phylogenetics and the classification of South American languages

    Get PDF
    In recent years, South Americanist linguists have embraced computational phylogenetic methods to resolve the numerous outstanding questions about the genealogi- cal relationships among the languages of the continent. We provide a critical review of the methods and language classification results that have accumulated thus far, emphasizing the superiority of character-based methods over distance-based ones and the importance of develop- ing adequate comparative datasets for producing well- resolved classifications

    Automatic Identification of False Friends in Parallel Corpora: Statistical and Semantic Approach

    Get PDF
    False friends are pairs of words in two languages that are perceived as similar but have different meanings. We present an improved algorithm for acquiring false friends from sentence-level aligned parallel corpus based on statistical observations of words occurrences and co-occurrences in the parallel sentences. The results are compared with an entirely semantic measure for cross-lingual similarity between words based on using the Web as a corpus through analyzing the words’ local contexts extracted from the text snippets returned by searching in Google. The statistical and semantic measures are further combined into an improved algorithm for identification of false friends that achieves almost twice better results than previously known algorithms. The evaluation is performed for identifying cognates between Bulgarian and Russian but the proposed methods could be adopted for other language pairs for which parallel corpora and bilingual glossaries are available

    Cross-Linguistic Influence in the Bilingual Mental Lexicon: Evidence of Cognate Effects in the Phonetic Production and Processing of a Vowel Contrast.

    Get PDF
    The present study examines cognate effects in the phonetic production and processing of the Catalan back mid-vowel contrast (/o/-/ɔ/) by 24 early and highly proficient Spanish-Catalan bilinguals in Majorca (Spain). Participants completed a picture-naming task and a forced-choice lexical decision task in which they were presented with either words (e.g., /bɔsk/ "forest") or non-words based on real words, but with the alternate mid-vowel pair in stressed position ((*)/bosk/). The same cognate and non-cognate lexical items were included in the production and lexical decision experiments. The results indicate that even though these early bilinguals maintained the back mid-vowel contrast in their productions, they had great difficulties identifying non-words and real words based on the identity of the Catalan mid-vowel. The analyses revealed language dominance and cognate effects: Spanish-dominants exhibited higher error rates than Catalan-dominants, and production and lexical decision accuracy were also affected by cognate status. The present study contributes to the discussion of the organization of early bilinguals' dominant and non-dominant sound systems, and proposes that exemplar theoretic approaches can be extended to include bilingual lexical connections that account for the interactions between the phonetic and lexical levels of early bilingual individuals

    Automated identification of borrowings in multilingual wordlists

    Get PDF
    Although lexical borrowing is an important aspect of language evolution, there have been few attempts to automate the identification of borrowings in lexical datasets. Moreover, none of the solutions which have been proposed so far identify borrowings across multiple languages. This study proposes a new method for the task and tests it on a newly compiled large comparative dataset of 48 South-East Asian languages from Southern China. The method yields very promising results, while it is conceptually straightforward and easy to apply. This makes the approach a perfect candidate for computer-assisted exploratory studies on lexical borrowing in contact areas
    • 

    corecore