Search CORE

3,794 research outputs found

An Algorithm For Building Language Superfamilies Using Swadesh Lists

Author: Mutabazi Bill
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 23/04/2020
Field of study

The main contributions of this thesis are the following: i. Developing an algorithm to generate language families and superfamilies given for each input language a Swadesh list represented using the international phonetic alphabet (IPA) notation. ii. The algorithm is novel in using the Levenshtein distance metric on the IPA representation and in the way it measures overall distance between pairs of Swadesh lists. iii. Building a Swadesh list for the author\u27s native Kinyarwanda language because a Swadesh list could not be found even after an extensive search for it. Adviser: Peter Reves

DigitalCommons@University of Nebraska

Classification-based scientific term detection in patient information

Author: Delaere Isabelle
Hoste Veronique
Lefever Els
Vanopstal Klaar
Publication venue: 'John Benjamins Publishing Company'
Publication date: 01/01/2010
Field of study

Ghent University Academic Bibliography

Automatic Tripartite Classification of Intransitive Verbs

Author: Soma Paul
Surtani Nitesh
Publication venue: 'Faculty of Computer Science, Universitas Indonesia'
Publication date: 01/01/2012
Field of study

Waseda University Repository

Recommended from our members

Phonologically Informed Edit Distance Algorithms for Word Alignment with Low-Resource Languages

Author: Frank Robert
McCoy Richard T
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/2018
Field of study

We present three methods for weighting edit distance algorithms based on linguistic information. These methods base their penalties on (i) phonological features, (ii) distributional character embeddings, or (iii) differences between cognate words. We also introduce a novel method for evaluating edit distance through the task of low-resource word alignment by using edit-distance neighbors in a high-resource pivot language to inform alignments from the low-resource language. At this task, the cognate-based scheme outperforms our other methods and the Levenshtein edit distance baseline, showing that NLP applications can benefit from information about cross-linguistic phonological patterns

ScholarWorks@UMass Amherst

Automatic Identification of False Friends in Parallel Corpora: Statistical and Semantic Approach

Author: Nakov Svetlin
Publication venue: Institute of Mathematics and Informatics Bulgarian Academy of Sciences
Publication date: 01/01/2009
Field of study

False friends are pairs of words in two languages that are perceived as similar but have different meanings. We present an improved algorithm for acquiring false friends from sentence-level aligned parallel corpus based on statistical observations of words occurrences and co-occurrences in the parallel sentences. The results are compared with an entirely semantic measure for cross-lingual similarity between words based on using the Web as a corpus through analyzing the words’ local contexts extracted from the text snippets returned by searching in Google. The statistical and semantic measures are further combined into an improved algorithm for identification of false friends that achieves almost twice better results than previously known algorithms. The evaluation is performed for identifying cognates between Bulgarian and Russian but the proposed methods could be adopted for other language pairs for which parallel corpora and bilingual glossaries are available

Bulgarian Digital Mathematics Library at IMI-BAS

Improving treebank-based automatic LFG induction for Spanish

Author: Chrupała Grzegorz
van Genabith Josef
Publication venue: CSLI Publications
Publication date: 01/01/2006
Field of study

We describe several improvements to the method of treebank-based LFG induction for Spanish from the Cast3LB treebank (O’Donovan et al., 2005). We discuss the different categories of problems encountered and present the solutions adopted. Some of the problems involve a simple adoption of existing linguistic analyses, as in our treatment of clitic doubling and null subjects. In other cases there is no standard LFG account for the phenomenon we wish to model and we adopt a compromise, conservative solution. This is exemplified by our treatment of Spanish periphrastic constructions. In yet another case, the less configurational nature of Spanish means that the LFG annotation algorithm has to rely mostly on Cast3LB function tags, and consequently a reliable method of adding those tags to parse trees had to be developed. This method achieves over 6% improvement over the baseline for the Cast3LB-function-tag assignment task, and over 3% improvement over the baseline for LFG f-structure construction from function-tag-enriched trees

Irish Universities

DCU Online Research Access Service