554 research outputs found
An Algorithm For Building Language Superfamilies Using Swadesh Lists
The main contributions of this thesis are the following: i. Developing an algorithm to generate language families and superfamilies given for each input language a Swadesh list represented using the international phonetic alphabet (IPA) notation. ii. The algorithm is novel in using the Levenshtein distance metric on the IPA representation and in the way it measures overall distance between pairs of Swadesh lists. iii. Building a Swadesh list for the author\u27s native Kinyarwanda language because a Swadesh list could not be found even after an extensive search for it.
Adviser: Peter Reves
Measures of lexical distance between languages
The idea of measuring distance between languages seems to have its roots in
the work of the French explorer Dumont D'Urville \cite{Urv}. He collected
comparative words lists of various languages during his voyages aboard the
Astrolabe from 1826 to 1829 and, in his work about the geographical division of
the Pacific, he proposed a method to measure the degree of relation among
languages. The method used by modern glottochronology, developed by Morris
Swadesh in the 1950s, measures distances from the percentage of shared
cognates, which are words with a common historical origin. Recently, we
proposed a new automated method which uses normalized Levenshtein distance
among words with the same meaning and averages on the words contained in a
list. Recently another group of scholars \cite{Bak, Hol} proposed a refined of
our definition including a second normalization. In this paper we compare the
information content of our definition with the refined version in order to
decide which of the two can be applied with greater success to resolve
relationships among languages
Lexical evolution rates by automated stability measure
Phylogenetic trees can be reconstructed from the matrix which contains the
distances between all pairs of languages in a family. Recently, we proposed a
new method which uses normalized Levenshtein distances among words with same
meaning and averages on all the items of a given list. Decisions about the
number of items in the input lists for language comparison have been debated
since the beginning of glottochronology. The point is that words associated to
some of the meanings have a rapid lexical evolution. Therefore, a large
vocabulary comparison is only apparently more accurate then a smaller one since
many of the words do not carry any useful information. In principle, one should
find the optimal length of the input lists studying the stability of the
different items. In this paper we tackle the problem with an automated
methodology only based on our normalized Levenshtein distance. With this
approach, the program of an automated reconstruction of languages relationships
is completed
LEXICOSTATISTICS OF MALAY AND MALAGASY LANGUAGES: COMPARATIVE HISTORICAL LINGUISTIC STUDY
This study examines the kinship of the Malay language and the Malagasy language. These two languages come from the same proto language, namely Proto Austronesian (PAN). Departing from the researchers’ assumptions about the linguistic relationship both at the phoneme and morpheme levels, there is a close kinship system or relationship between these two languages. Even though they are geographically and geo-politically separated, preliminary research on these two languages shows several universal features, one of which is that both languages are agglutinative languages. Therefore, this study is an attempt to find empirical evidence about the separation time between Malay and Malagasy by using language grouping methods and lexicostatistical techniques. The first stage, the researchers collect 300 basic vocabularies compiled by Swadesh (1995). The method used in providing the data is the referential method, while the technique used is the note-taking technique. Second, the researchers determine which pairs of the two languages are cognate languages. Third, the researchers calculate the age and separation time of the two languages. Fourth, the researchers calculate the error term to determine a more precise separation time. The result of this research indicates that Malay and Malagasy were a single language at 4223-3951 thousand years ago and began to separate from their proto languages in 2201-1929 BC
Bayesian phylolinguistics infers the internal structure and the time-depth of the Turkic language family
Despite more than 200 years of research, the internal structure of the Turkic language family remains subject to debate. Classifications of Turkic so far are based on both classical historical–comparative linguistic and distance-based quantitative approaches. Although these studies yield an internal structure of the Turkic family, they cannot give us an understanding of the statistical robustness of the proposed branches, nor are they capable of reliably inferring absolute divergence dates, without assuming constant rates of change. Here we use computational Bayesian phylogenetic methods to build a phylogeny of the Turkic languages, express the reliability of the proposed branches in terms of probability, and estimate the time-depth of the family within credibility intervals. To this end, we collect a new dataset of 254 basic vocabulary items for thirty-two Turkic language varieties based on the recently introduced Leipzig–Jakarta list. Our application of Bayesian phylogenetic inference on lexical data of the Turkic languages is unprecedented. The resulting phylogenetic tree supports a binary structure for Turkic and replicates most of the conventional sub-branches in the Common Turkic branch. We calculate the robustness of the inferences for subgroups and individual languages whose position in the tree seems to be debatable. We infer the time-depth of the Turkic family at around 2100 years before present, thus providing a reliable quantitative basis for previous estimates based on classical historical linguistics and lexicostatistics
Internal classification of the Alor-Pantar language family using computational methods applied to the lexicon
The non-Austronesian languages of Alor and Pantar in eastern Indonesia have been shown to be genetically related using the comparative method, but the identified phonological innovations are typologically common and do not delineate neat subgroups. We apply computational methods to recently-collected lexical data and are able to identify subgroups based on the lexicon. Crucially, the lexical data are coded for cognacy based on identified phonological innovations. This methodology can succeed even where phonological innovations themselves fail to identify subgroups, showing that computational methods using lexical data can be a powerful tool supplementing the comparative method.peer reviewed by journal Language Dynamics and Chang
An Algorithm For Building Language Superfamilies Using Swadesh Lists
The main contributions of this thesis are the following: i. Developing an algorithm to generate language families and superfamilies given for each input language a Swadesh list represented using the international phonetic alphabet (IPA) notation. ii. The algorithm is novel in using the Levenshtein distance metric on the IPA representation and in the way it measures overall distance between pairs of Swadesh lists. iii. Building a Swadesh list for the author\u27s native Kinyarwanda language because a Swadesh list could not be found even after an extensive search for it.
Adviser: Peter Reves
- …