11,915 research outputs found
Automatic parallel corpora and bilingual terminology extraction from parallel WebSites
In our days, the notion, the importance and the significance of parallel corpora is so big that needs no special introduction. Unfortunately, public available parallel corpora is somewhat limited in range. There are big corpora about politics or legislation, about medicine and other specific areas, but we miss corpora for other different areas. Currently there is a huge investment on using the Web as a corpus. This article uncovers GWB, a tool that aims automatic construction of parallel corpora from the web. We defend that it is possible to build high quality terminological corpora in an automatic fashion, just by specifying a sensible Internet domain and using an appropriate set of seed keywords. GWB is a web-spider that works in conjunction with a set of other Open-Source tools, defining a pipeline that includes the documents retrieval from the web, alignment at sentence level and its quality analysis, bilingual dictionaries and terminology extraction and construction of off-line dictionaries
Experiments with Russian to Kazakh sentence alignment
Sentence alignment is the final step in building parallel corpora, which arguably has the greatest impact on the quality of a resulting corpus and the accuracy of machine translation systems that use it for training. However, the quality of sentence alignment itself depends on a number of factors. In this paper we investigate the impact of several data processing techniques on the quality of sentence alignment. We develop and use a number of automatic evaluation metrics, and provide empirical evidence that application of all of the considered data processing techniques yields bitexts with the lowest ratio of noise and the highest ratio of parallel sentences
Parallel sentence retrieval from comparable corpora for biomedical text simplification
International audienceParallel sentences provide semantically similar information which can vary on a given dimension , such as language or register. Parallel sentences with register variation (like expert and non-expert documents) can be exploited for the automatic text simplification. The aim of automatic text simplification is to better access and understand a given information. In the biomedical field, simplification may permit patients to understand medical and health texts. Yet, there is currently no such available resources. We propose to exploit comparable corpora which are distinguished by their registers (specialized and simplified versions) to detect and align parallel sentences. These corpora are in French and are related to the biomedical area. Manually created reference data show 0.76 inter-annotator agreement. Our purpose is to state whether a given pair of specialized and simplified sentences is parallel and can be aligned or not. We treat this task as binary classification (alignment/non-alignment). We perform experiments with a controlled ratio of imbalance and on the highly unbalanced real data. Our results show that the method we present here can be used to automatically generate a corpus of parallel sentences from our comparable corpus
Learning languages from parallel corpora
This work describes a blueprint for an application that generates language learning exercises from parallel corpora. Word alignment and parallel structures allow for the automatic assessment of sentence pairs in the source and target languages, while users of the application continuously improve the quality of the data with their interactions, thus crowdsourcing parallel language learning material. Through triangulation, their assessment can be transferred to language pairs other than the original ones if multiparallel corpora are used as a source.
Several challenges need to be addressed for such an application to work, and we will discuss three of them here. First, the question of how adequate learning material can be identified in corpora has received some attention in the last decade, and we will detail what the structure of parallel corpora implies for that selection. Secondly, we will consider which type of exercises can be generated automatically from parallel corpora such that they foster learning and keep learners motivated. And thirdly, we will highlight the potential of employing users, that is both teachers and learners, as crowdsourcers to help improve the material
An annotation scheme and gold standard for Dutch-English word alignment
The importance of sentence-aligned parallel corpora has been widely acknowledged. Reference corpora in which sub-sentential translational correspondences are indicated manually are more labour-intensive to create, and hence less wide-spread. Such manually created reference alignments - also called Gold Standards - have been used in research projects to develop or test automatic word alignment systems. In most translations, translational correspondences are rather complex; for example word-by-word correspondences can be found only for a limited number of words. A reference corpus in which those complex translational correspondences are aligned manually is therefore also a useful resource for the development of translation tools and for translation studies. In this paper, we describe how we created a Gold Standard for the Dutch-English language pair. We present the annotation scheme, annotation guidelines, annotation tool and inter-annotator results. To cover a wide range of syntactic and stylistic phenomena that emerge from different writing and translation styles, our Gold Standard data set contains texts from different text types. The Gold Standard will be publicly available as part of the Dutch Parallel Corpus
Augmenting Librispeech with French Translations: A Multimodal Corpus for Direct Speech Translation Evaluation
Recent works in spoken language translation (SLT) have attempted to build
end-to-end speech-to-text translation without using source language
transcription during learning or decoding. However, while large quantities of
parallel texts (such as Europarl, OpenSubtitles) are available for training
machine translation systems, there are no large (100h) and open source parallel
corpora that include speech in a source language aligned to text in a target
language. This paper tries to fill this gap by augmenting an existing
(monolingual) corpus: LibriSpeech. This corpus, used for automatic speech
recognition, is derived from read audiobooks from the LibriVox project, and has
been carefully segmented and aligned. After gathering French e-books
corresponding to the English audio-books from LibriSpeech, we align speech
segments at the sentence level with their respective translations and obtain
236h of usable parallel data. This paper presents the details of the processing
as well as a manual evaluation conducted on a small subset of the corpus. This
evaluation shows that the automatic alignments scores are reasonably correlated
with the human judgments of the bilingual alignment quality. We believe that
this corpus (which is made available online) is useful for replicable
experiments in direct speech translation or more general spoken language
translation experiments.Comment: LREC 2018, Japa
Automatic detection of parallel sentences from comparable biomedical texts
International audienceParallel sentences provide semantically similar information which can vary on a given dimension, such as language or register. Parallel sentences with register variation (like expert and non-expert documents) can be exploited for the automatic text simplification. The aim of automatic text simplification is to better access and understand a given information. In the biomedical field, simplification may permit patients to understand medical and health texts. Yet, there is currently no such available resources. We propose to exploit comparable corpora which are distinguished by their registers (specialized and simplified versions) to detect and align parallel sentences. These corpora are in French and are related to the biomedical area. Our purpose is to state whether a given pair of specialized and simplified sentences is to be aligned or not. Manually created reference data show 0.76 inter-annotator agreement. We treat this task as binary classification (alignment/non-alignment). We perform experiments on balanced and imbalanced data. The results on balanced data reach up to 0.96 F-Measure. On imbalanced data, the results are lower but remain competitive when using classification models train on balanced data. Besides, among the three datasets exploited (se-mantic equivalence and inclusions), the detection of equivalence pairs is more efficient
Extended Parallel Corpus for Amharic-English Machine Translation
This paper describes the acquisition, preprocessing, segmentation, and
alignment of an Amharic-English parallel corpus. It will be useful for machine
translation of an under-resourced language, Amharic. The corpus is larger than
previously compiled corpora; it is released for research purposes. We trained
neural machine translation and phrase-based statistical machine translation
models using the corpus. In the automatic evaluation, neural machine
translation models outperform phrase-based statistical machine translation
models.Comment: Accepted to 2nd AfricanNLP workshop at EACL 202
Improving the translation environment for professional translators
When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side.
This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project
Large-scale Hierarchical Alignment for Data-driven Text Rewriting
We propose a simple unsupervised method for extracting pseudo-parallel
monolingual sentence pairs from comparable corpora representative of two
different text styles, such as news articles and scientific papers. Our
approach does not require a seed parallel corpus, but instead relies solely on
hierarchical search over pre-trained embeddings of documents and sentences. We
demonstrate the effectiveness of our method through automatic and extrinsic
evaluation on text simplification from the normal to the Simple Wikipedia. We
show that pseudo-parallel sentences extracted with our method not only
supplement existing parallel data, but can even lead to competitive performance
on their own.Comment: RANLP 201
- …