8 research outputs found

    A Crosslingual Investigation of Conceptualization in 1335 Languages

    Full text link
    Languages differ in how they divide up the world into concepts and words; e.g., in contrast to English, Swahili has a single concept for `belly' and `womb'. We investigate these differences in conceptualization across 1,335 languages by aligning concepts in a parallel corpus. To this end, we propose Conceptualizer, a method that creates a bipartite directed alignment graph between source language concepts and sets of target language strings. In a detailed linguistic analysis across all languages for one concept (`bird') and an evaluation on gold standard data for 32 Swadesh concepts, we show that Conceptualizer has good alignment accuracy. We demonstrate the potential of research on conceptualization in NLP with two experiments. (1) We define crosslingual stability of a concept as the degree to which it has 1-1 correspondences across languages, and show that concreteness predicts stability. (2) We represent each language by its conceptualization pattern for 83 concepts, and define a similarity measure on these representations. The resulting measure for the conceptual similarity of two languages is complementary to standard genealogical, typological, and surface similarity measures. For four out of six language families, we can assign languages to their correct family based on conceptual similarity with accuracy between 54% and 87%.Comment: ACL 202

    A Crosslingual Investigation of Conceptualization in 1335 Languages

    Get PDF
    Languages differ in how they divide up the world into concepts and words; e.g., in contrast to English, Swahili has a single concept for ‘belly’ and ‘womb’. We investigate these differences in conceptualization across 1,335 languages by aligning concepts in a parallel corpus. To this end, we propose Conceptualizer, a method that creates a bipartite directed alignment graph between source language concepts and sets of target language strings. In a detailed linguistic analysis across all languages for one concept (‘bird’) and an evaluation on gold standard data for 32 Swadesh concepts, we show that Conceptualizer has good alignment accuracy. We demonstrate the potential of research on conceptualization in NLP with two experiments. (1) We define crosslingual stability of a concept as the degree to which it has 1-1 correspondences across languages, and show that concreteness predicts stability. (2) We represent each language by its conceptualization pattern for 83 concepts, and define a similarity measure on these representations. The resulting measure for the conceptual similarity between two languages is complementary to standard genealogical, typological, and surface similarity measures. For four out of six language families, we can assign languages to their correct family based on conceptual similarity with accuracies between 54% and 87

    Not by chance. Russian aspect in rule-based machine translation

    Full text link
    The aim of this paper is twofold: it illustrates the benefits of rule-based instead of statistical machine translation, and it provides a starting point for the machine translation of the Russian aspect into English. Rule-based machine translation is still promising, from both a computational and theoretical point of view, because by implementing rules on the computer theoretical assumptions concerning linguistic structures can be verified and improved. This will be shown using the example of the category of aspect, which is one of the main challenges for machine translation from Russian to English. A small corpus study on the translation of Russian sentences with verbs in the past tense (perfective and imperfective) by human translators shows that three-quarters of Russian verbs (both imperfective and perfective) are translated by English simple past forms. While this results from language internal markedness relations, the translation of the remaining 25 % requires an in-depth analysis of the various interpretations possible for the Russian aspect. We propose a semantic analysis based on which rules for the interpretation and translation of Russian aspect in a machine translation system can be derived. Their implementation in the machine translation system ETAP is shown in this paper using two test cases as examples

    Sentence-alignment and application of russian-german multi-target parallel corpora for linguistic analysis and literary studies

    No full text
    This paper presents the application of multi-target parallel corpora consisting of a single source text and multiple target translations of it for linguistic analysis. We discuss the alignment, interactive search and visualization of this type of data within a specific tool called ALuDo (Alignment with Lucene for Dostoyevsky). This is a Java implementation that uses local grammars, ontological information, bilingual dictionaries and statistical approaches for alignment and search. The data set in use is the Russian novel Crime and Punishment by Fyodor Dostoyevsky and three German translations of it. With this bilingual corpus quite a number of investigations in the field of linguistics and of literary studies are possible. Additionally, we release part of the resulting parallel corpus
    corecore