The Scielo database is an important source of scientific information in Latin
America, containing articles from several research domains. A striking
characteristic of Scielo is that many of its full-text contents are presented
in more than one language, thus being a potential source of parallel corpora.
In this article, we present the development of a parallel corpus from Scielo in
three languages: English, Portuguese, and Spanish. Sentences were automatically
aligned using the Hunalign algorithm for all language pairs, and for a subset
of trilingual articles also. We demonstrate the capabilities of our corpus by
training a Statistical Machine Translation system (Moses) for each language
pair, which outperformed related works on scientific articles. Sentence
alignment was also manually evaluated, presenting an average of 98.8% correctly
aligned sentences across all languages. Our parallel corpus is freely available
in the TMX format, with complementary information regarding article metadata.Comment: Published in Proceedings of the Eleventh International Conference on
Language Resources and Evaluation (LREC 2018