32,161 research outputs found

    Termhood-based Comparability Metrics of Comparable Corpus in Special Domain

    Full text link
    Cross-Language Information Retrieval (CLIR) and machine translation (MT) resources, such as dictionaries and parallel corpora, are scarce and hard to come by for special domains. Besides, these resources are just limited to a few languages, such as English, French, and Spanish and so on. So, obtaining comparable corpora automatically for such domains could be an answer to this problem effectively. Comparable corpora, that the subcorpora are not translations of each other, can be easily obtained from web. Therefore, building and using comparable corpora is often a more feasible option in multilingual information processing. Comparability metrics is one of key issues in the field of building and using comparable corpus. Currently, there is no widely accepted definition or metrics method of corpus comparability. In fact, Different definitions or metrics methods of comparability might be given to suit various tasks about natural language processing. A new comparability, namely, termhood-based metrics, oriented to the task of bilingual terminology extraction, is proposed in this paper. In this method, words are ranked by termhood not frequency, and then the cosine similarities, calculated based on the ranking lists of word termhood, is used as comparability. Experiments results show that termhood-based metrics performs better than traditional frequency-based metrics

    A Survey of Paraphrasing and Textual Entailment Methods

    Full text link
    Paraphrasing methods recognize, generate, or extract phrases, sentences, or longer natural language expressions that convey almost the same information. Textual entailment methods, on the other hand, recognize, generate, or extract pairs of natural language expressions, such that a human who reads (and trusts) the first element of a pair would most likely infer that the other element is also true. Paraphrasing can be seen as bidirectional textual entailment and methods from the two areas are often similar. Both kinds of methods are useful, at least in principle, in a wide range of natural language processing applications, including question answering, summarization, text generation, and machine translation. We summarize key ideas from the two areas by considering in turn recognition, generation, and extraction methods, also pointing to prominent articles and resources.Comment: Technical Report, Natural Language Processing Group, Department of Informatics, Athens University of Economics and Business, Greece, 201

    Multilingual term extraction from comparable corpora : informativeness of monolingual term extraction features

    Get PDF
    Most research on bilingual automatic term extraction (ATE) from comparable corpora focuses on both components of the task separately, i.e. monolingual automatic term extraction and finding equivalent pairs cross-lingually. The latter usually relies on context vectors and is notoriously inaccurate for infrequent terms. The aim of this pilot study is to investigate whether using information gathered for the former might be beneficial for the cross-lingual linking as well, thereby illustrating the potential of a more holistic approach to ATE from comparable corpora with re-use of information across the components. To test this hypothesis, an existing dataset was expanded, which covers three languages and four domains. A supervised binary classifier is shown to achieve robust performance, with stable results across languages and domains

    On the cross-linguistic equivalence of sentir(e) in Romance languages: a contrastive study in semantics

    Get PDF
    Recent linguistic studies on perception have focused mainly on verbs referring to the dominant visual and auditory modalities, (e.g. English see/look and hear/listen) and have largely ignored the minor verbs. The present paper seeks to fill this gap by comparing the complex semantics of the cognate verbs sentir(e) in three Romance languages, namely Spanish, French and Italian. Because the objective study of semantics is a problematic issue, we pay special attention to methodological problems and opt for a combined corpus approach involving both a translation corpus and comparable data. Evidence from both corpora indicates that, notwithstanding the fact that the rich polysemy of the three verbs partly coincides, each individual verb has undergone semantic specializations differentiating the morphological cognates

    Lexical typology : a programmatic sketch

    Get PDF
    The present paper is an attempt to lay the foundation for Lexical Typology as a new kind of linguistic typology.1 The goal of Lexical Typology is to investigate crosslinguistically significant patterns of interaction between lexicon and grammar

    Transferable Positive/Negative Speech Emotion Recognition via Class-wise Adversarial Domain Adaptation

    Get PDF
    Speech emotion recognition plays an important role in building more intelligent and human-like agents. Due to the difficulty of collecting speech emotional data, an increasingly popular solution is leveraging a related and rich source corpus to help address the target corpus. However, domain shift between the corpora poses a serious challenge, making domain shift adaptation difficult to function even on the recognition of positive/negative emotions. In this work, we propose class-wise adversarial domain adaptation to address this challenge by reducing the shift for all classes between different corpora. Experiments on the well-known corpora EMODB and Aibo demonstrate that our method is effective even when only a very limited number of target labeled examples are provided.Comment: 5 pages, 3 figures, accepted to ICASSP 201
    corecore