32,161 research outputs found
Termhood-based Comparability Metrics of Comparable Corpus in Special Domain
Cross-Language Information Retrieval (CLIR) and machine translation (MT)
resources, such as dictionaries and parallel corpora, are scarce and hard to
come by for special domains. Besides, these resources are just limited to a few
languages, such as English, French, and Spanish and so on. So, obtaining
comparable corpora automatically for such domains could be an answer to this
problem effectively. Comparable corpora, that the subcorpora are not
translations of each other, can be easily obtained from web. Therefore,
building and using comparable corpora is often a more feasible option in
multilingual information processing. Comparability metrics is one of key issues
in the field of building and using comparable corpus. Currently, there is no
widely accepted definition or metrics method of corpus comparability. In fact,
Different definitions or metrics methods of comparability might be given to
suit various tasks about natural language processing. A new comparability,
namely, termhood-based metrics, oriented to the task of bilingual terminology
extraction, is proposed in this paper. In this method, words are ranked by
termhood not frequency, and then the cosine similarities, calculated based on
the ranking lists of word termhood, is used as comparability. Experiments
results show that termhood-based metrics performs better than traditional
frequency-based metrics
A Survey of Paraphrasing and Textual Entailment Methods
Paraphrasing methods recognize, generate, or extract phrases, sentences, or
longer natural language expressions that convey almost the same information.
Textual entailment methods, on the other hand, recognize, generate, or extract
pairs of natural language expressions, such that a human who reads (and trusts)
the first element of a pair would most likely infer that the other element is
also true. Paraphrasing can be seen as bidirectional textual entailment and
methods from the two areas are often similar. Both kinds of methods are useful,
at least in principle, in a wide range of natural language processing
applications, including question answering, summarization, text generation, and
machine translation. We summarize key ideas from the two areas by considering
in turn recognition, generation, and extraction methods, also pointing to
prominent articles and resources.Comment: Technical Report, Natural Language Processing Group, Department of
Informatics, Athens University of Economics and Business, Greece, 201
Multilingual term extraction from comparable corpora : informativeness of monolingual term extraction features
Most research on bilingual automatic term extraction (ATE) from comparable corpora focuses on both components of the task separately, i.e. monolingual automatic term extraction and finding equivalent pairs cross-lingually. The latter usually relies on context vectors and is notoriously inaccurate for infrequent terms. The aim of this pilot study is to investigate whether using information gathered for the former might be beneficial for the cross-lingual linking as well, thereby illustrating the potential of a more holistic approach to ATE from comparable corpora with re-use of information across the components. To test this hypothesis, an existing dataset was expanded, which covers three languages and four domains. A supervised binary classifier is shown to achieve robust performance, with stable results across languages and domains
On the cross-linguistic equivalence of sentir(e) in Romance languages: a contrastive study in semantics
Recent linguistic studies on perception have focused mainly on verbs referring to the dominant visual and auditory modalities, (e.g. English see/look and hear/listen) and have largely ignored the minor verbs. The present paper seeks to fill this gap by comparing the complex semantics of the cognate verbs sentir(e) in three Romance languages, namely Spanish, French and Italian. Because the objective study of semantics is a problematic issue, we pay special attention to methodological problems and opt for a combined corpus approach involving both a translation corpus and comparable data. Evidence from both corpora indicates that, notwithstanding the fact that the rich polysemy of the three verbs partly coincides, each individual verb has undergone semantic specializations differentiating the morphological cognates
Lexical typology : a programmatic sketch
The present paper is an attempt to lay the foundation for Lexical Typology as a new kind of linguistic typology.1 The goal of Lexical Typology is to investigate crosslinguistically significant patterns of interaction between lexicon and grammar
Transferable Positive/Negative Speech Emotion Recognition via Class-wise Adversarial Domain Adaptation
Speech emotion recognition plays an important role in building more
intelligent and human-like agents. Due to the difficulty of collecting speech
emotional data, an increasingly popular solution is leveraging a related and
rich source corpus to help address the target corpus. However, domain shift
between the corpora poses a serious challenge, making domain shift adaptation
difficult to function even on the recognition of positive/negative emotions. In
this work, we propose class-wise adversarial domain adaptation to address this
challenge by reducing the shift for all classes between different corpora.
Experiments on the well-known corpora EMODB and Aibo demonstrate that our
method is effective even when only a very limited number of target labeled
examples are provided.Comment: 5 pages, 3 figures, accepted to ICASSP 201
- …