1,653 research outputs found
An Algorithm For Building Language Superfamilies Using Swadesh Lists
The main contributions of this thesis are the following: i. Developing an algorithm to generate language families and superfamilies given for each input language a Swadesh list represented using the international phonetic alphabet (IPA) notation. ii. The algorithm is novel in using the Levenshtein distance metric on the IPA representation and in the way it measures overall distance between pairs of Swadesh lists. iii. Building a Swadesh list for the author\u27s native Kinyarwanda language because a Swadesh list could not be found even after an extensive search for it.
Adviser: Peter Reves
Using Global Constraints and Reranking to Improve Cognates Detection
Global constraints and reranking have not been used in cognates detection
research to date. We propose methods for using global constraints by performing
rescoring of the score matrices produced by state of the art cognates detection
systems. Using global constraints to perform rescoring is complementary to
state of the art methods for performing cognates detection and results in
significant performance improvements beyond current state of the art
performance on publicly available datasets with different language pairs and
various conditions such as different levels of baseline state of the art
performance and different data size conditions, including with more realistic
large data size conditions than have been evaluated with in the past.Comment: 10 pages, 6 figures, 6 tables; published in the Proceedings of the
55th Annual Meeting of the Association for Computational Linguistics, pages
1983-1992, Vancouver, Canada, July 201
Computational phylogenetics and the classification of South American languages
In recent years, South Americanist linguists have embraced computational phylogenetic methods to resolve the numerous outstanding questions about the genealogi- cal relationships among the languages of the continent. We provide a critical review of the methods and language classification results that have accumulated thus far, emphasizing the superiority of character-based methods over distance-based ones and the importance of develop- ing adequate comparative datasets for producing well- resolved classifications
Automatic Identification of False Friends in Parallel Corpora: Statistical and Semantic Approach
False friends are pairs of words in two languages that are perceived as
similar but have different meanings. We present an improved
algorithm for acquiring false friends from sentence-level aligned parallel corpus
based on statistical observations of words occurrences and co-occurrences
in the parallel sentences. The results are compared with an entirely semantic
measure for cross-lingual similarity between words based on using the Web
as a corpus through analyzing the wordsâ local contexts extracted from the
text snippets returned by searching in Google. The statistical and semantic
measures are further combined into an improved algorithm for identification
of false friends that achieves almost twice better results than previously
known algorithms. The evaluation is performed for identifying cognates
between Bulgarian and Russian but the proposed methods could be adopted
for other language pairs for which parallel corpora and bilingual glossaries
are available
Cross-Linguistic Influence in the Bilingual Mental Lexicon: Evidence of Cognate Effects in the Phonetic Production and Processing of a Vowel Contrast.
The present study examines cognate effects in the phonetic production and processing of the Catalan back mid-vowel contrast (/o/-/É/) by 24 early and highly proficient Spanish-Catalan bilinguals in Majorca (Spain). Participants completed a picture-naming task and a forced-choice lexical decision task in which they were presented with either words (e.g., /bÉsk/ "forest") or non-words based on real words, but with the alternate mid-vowel pair in stressed position ((*)/bosk/). The same cognate and non-cognate lexical items were included in the production and lexical decision experiments. The results indicate that even though these early bilinguals maintained the back mid-vowel contrast in their productions, they had great difficulties identifying non-words and real words based on the identity of the Catalan mid-vowel. The analyses revealed language dominance and cognate effects: Spanish-dominants exhibited higher error rates than Catalan-dominants, and production and lexical decision accuracy were also affected by cognate status. The present study contributes to the discussion of the organization of early bilinguals' dominant and non-dominant sound systems, and proposes that exemplar theoretic approaches can be extended to include bilingual lexical connections that account for the interactions between the phonetic and lexical levels of early bilingual individuals
Automated identification of borrowings in multilingual wordlists
Although lexical borrowing is an important aspect of language evolution, there have been few attempts to automate the identification of borrowings in lexical datasets. Moreover, none of the solutions which have been proposed so far identify borrowings across multiple languages. This study proposes a new method for the task and tests it on a newly compiled large comparative dataset of 48 South-East Asian languages from Southern China. The method yields very promising results, while it is conceptually straightforward and easy to apply. This makes the approach a perfect candidate for computer-assisted exploratory studies on lexical borrowing in contact areas
- âŠ