3,794 research outputs found
An Algorithm For Building Language Superfamilies Using Swadesh Lists
The main contributions of this thesis are the following: i. Developing an algorithm to generate language families and superfamilies given for each input language a Swadesh list represented using the international phonetic alphabet (IPA) notation. ii. The algorithm is novel in using the Levenshtein distance metric on the IPA representation and in the way it measures overall distance between pairs of Swadesh lists. iii. Building a Swadesh list for the author\u27s native Kinyarwanda language because a Swadesh list could not be found even after an extensive search for it.
Adviser: Peter Reves
Recommended from our members
Phonologically Informed Edit Distance Algorithms for Word Alignment with Low-Resource Languages
We present three methods for weighting edit distance algorithms based on linguistic information. These methods base their penalties on (i) phonological features, (ii) distributional character embeddings, or (iii) differences between cognate words. We also introduce a novel method for evaluating edit distance through the task of low-resource word alignment by using edit-distance neighbors in a high-resource pivot language to inform alignments from the low-resource language. At this task, the cognate-based scheme outperforms our other methods and the Levenshtein edit distance baseline, showing that NLP applications can benefit from information about cross-linguistic phonological patterns
Automatic Identification of False Friends in Parallel Corpora: Statistical and Semantic Approach
False friends are pairs of words in two languages that are perceived as
similar but have different meanings. We present an improved
algorithm for acquiring false friends from sentence-level aligned parallel corpus
based on statistical observations of words occurrences and co-occurrences
in the parallel sentences. The results are compared with an entirely semantic
measure for cross-lingual similarity between words based on using the Web
as a corpus through analyzing the wordsâ local contexts extracted from the
text snippets returned by searching in Google. The statistical and semantic
measures are further combined into an improved algorithm for identification
of false friends that achieves almost twice better results than previously
known algorithms. The evaluation is performed for identifying cognates
between Bulgarian and Russian but the proposed methods could be adopted
for other language pairs for which parallel corpora and bilingual glossaries
are available
Improving treebank-based automatic LFG induction for Spanish
We describe several improvements to the method of treebank-based LFG induction for Spanish from the Cast3LB treebank (OâDonovan et al., 2005). We discuss the different categories of problems encountered and present the solutions adopted. Some of the problems involve a simple adoption of existing linguistic analyses, as in our treatment of clitic doubling and null subjects. In other cases there is no standard LFG account for the phenomenon
we wish to model and we adopt a compromise, conservative solution. This is exemplified by our treatment of Spanish periphrastic constructions. In yet another case, the less configurational nature of Spanish means that the LFG annotation algorithm has to rely mostly on Cast3LB function tags, and consequently a reliable method of adding those tags to parse trees had to be developed. This method achieves over 6% improvement over the baseline for the
Cast3LB-function-tag assignment task, and over 3% improvement over the baseline for LFG f-structure construction from function-tag-enriched trees
- âŠ