6,062 research outputs found
In no uncertain terms : a dataset for monolingual and multilingual automatic term extraction from comparable corpora
Automatic term extraction is a productive field of research within natural language processing, but it still faces significant obstacles regarding datasets and evaluation, which require manual term annotation. This is an arduous task, made even more difficult by the lack of a clear distinction between terms and general language, which results in low inter-annotator agreement. There is a large need for well-documented, manually validated datasets, especially in the rising field of multilingual term extraction from comparable corpora, which presents a unique new set of challenges. In this paper, a new approach is presented for both monolingual and multilingual term annotation in comparable corpora. The detailed guidelines with different term labels, the domain- and language-independent methodology and the large volumes annotated in three different languages and four different domains make this a rich resource. The resulting datasets are not just suited for evaluation purposes but can also serve as a general source of information about terms and even as training data for supervised methods. Moreover, the gold standard for multilingual term extraction from comparable corpora contains information about term variants and translation equivalents, which allows an in-depth, nuanced evaluation
Noisy-parallel and comparable corpora filtering methodology for the extraction of bi-lingual equivalent data at sentence level
Text alignment and text quality are critical to the accuracy of Machine
Translation (MT) systems, some NLP tools, and any other text processing tasks
requiring bilingual data. This research proposes a language independent
bi-sentence filtering approach based on Polish (not a position-sensitive
language) to English experiments. This cleaning approach was developed on the
TED Talks corpus and also initially tested on the Wikipedia comparable corpus,
but it can be used for any text domain or language pair. The proposed approach
implements various heuristics for sentence comparison. Some of them leverage
synonyms and semantic and structural analysis of text as additional
information. Minimization of data loss was ensured. An improvement in MT system
score with text processed using the tool is discussed.Comment: arXiv admin note: text overlap with arXiv:1509.09093,
arXiv:1509.0888
Domain adaptation strategies in statistical machine translation: a brief overview
© Cambridge University Press, 2015.Statistical machine translation (SMT) is gaining interest given that it can easily be adapted to any pair of languages. One of the main challenges in SMT is domain adaptation because the performance in translation drops when testing conditions deviate from training conditions. Many research works are arising to face this challenge. Research is focused on trying to exploit all kinds of material, if available. This paper provides an overview of research, which copes with the domain adaptation challenge in SMT.Peer ReviewedPostprint (author's final draft
Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval
Although more and more language pairs are covered by machine translation
services, there are still many pairs that lack translation resources.
Cross-language information retrieval (CLIR) is an application which needs
translation functionality of a relatively low level of sophistication since
current models for information retrieval (IR) are still based on a
bag-of-words. The Web provides a vast resource for the automatic construction
of parallel corpora which can be used to train statistical translation models
automatically. The resulting translation models can be embedded in several ways
in a retrieval model. In this paper, we will investigate the problem of
automatically mining parallel texts from the Web and different ways of
integrating the translation models within the retrieval process. Our
experiments on standard test collections for CLIR show that the Web-based
translation models can surpass commercial MT systems in CLIR tasks. These
results open the perspective of constructing a fully automatic query
translation device for CLIR at a very low cost.Comment: 37 page
- …