978 research outputs found

    Bilingual sentence alignment of pre-Qin history literature for digital humanities study

    Get PDF
    Sentence aligned bilingual text of history literature provides support of digital resources for related digital humanities studies, but existing studies have done little work on sentence alignment of ancient Chinese and English. In this study, we made a preliminary attempt to align the sentence of ancient Chinese and English. We used the bilingual text of the Analects of Confucius and Zuo's Commentaries of the Spring and Autumn Annals, extracted features and adopted the classification method to divide the bilingual candidate sentence pairs based on probability scores. The bilingual sentence alignment model based on SVM had the best performance on a larger amount of data when using three features and confirmed the impact of candidate dataset

    Overview of the SOURCe project as an open educational resource

    Get PDF
    In this paper, we present an overview of the functionalities of the SOURCe Project, as it has evolved in a span of eight years after its creation in 2012. The SOURCe Project currently includes the search engine for the Searchable Online French-Greek Parallel Corpus for the University of Cyprus (SOURCe), the Pencil (an alignment tool), the Exercises, the Synonyms and the Library. The collection of corpora and tools are designed to be freely available as language processing resources for language and translation teachers and learners, but also for translators. Furthermore, we discuss how its applications can form an important part of effective learning resources. This paper focuses on the development and construction of the SOURCe Project, which is mainly based on a set of parallel corpora. The design, the content, and the availability of these corpora aim to serve the needs of teachers and students of French as a foreign language (other languages may be also added in the future) and also to facilitate future linguistic research. Our approach is a corpus linguistics one, undertaken from the perspective of language acquisition and translation studies

    Compiling and using a parallel corpus for research in translation

    Get PDF
    There are so many variables underlying translation that examining anything longer than a few paragraphs of translated text at a time can become quite a daunting task. The advent of corpus linguistics, however, has made it possible to analyse enormous quantities of translated text in unprecedented ways. In line with these advances, parallel corpora can provide access to many aspects of translation that had previously not been possible to study in a systematic way. The first part of this paper discusses different types of decisions that have to be made when building a parallel corpus, with particular emphasis to compilation questions that are unique to parallel corpora as opposed to corpora in general. This is followed by an account of the choices made when creating COMPARA - a post-edited, bi-directional parallel corpus of English and Portuguese literary texts with 3 million words, freely available for research and education at http://www.linguateca.pt/COMPARA/. Finally, examples of how this parallel corpus can be (and has been) used in translation research are presented

    Building a Parallel Corpus on the World's Oldest Banking Magazine

    Full text link
    We report on our processing steps to build a diachronic parallel corpus based on the world's oldest banking magazine. The magazine has been published since 1895 in German, with translations in French and partly in English and Italian. Our data sources are printed issues (until 1997), PDF issues (since 1998) and HTML files (since 2001). The corpus building poses special challenges in article boundary recognition and cross-language article and sentence alignment. Our corpus fills a gap in parallel corpora with respect to genre (magazine articles), domain (banking and economy articles), and its time span (120 years)

    Corpus-Based Machine Translation : A Study Case for the e-Government of Costa Rica Corpus-Based Machine Translation: A Study Case for the e-Government of Costa Rica

    Get PDF
    Esta investigación pretende estudiar el estado del arte en las tecnologías de la traducción automática. Se explorará la teoría fundamental de los sistemas estadísticos basados en frases (PB-SMT) y neuronales (NMT): su arquitectura y funcionamiento. Luego, nos concentraremos en un caso de estudio que pondrá a prueba la capacidad del traductor para aprovechar al máximo el potencial de estas tecnologías. Este caso de estudio incita al traductor a poner en práctica todos sus conocimientos y habilidades profesionales para llevar a cabo la preparación de datos, entrenamiento, evaluación y ajuste de los motores.This research paper aims to approach the state-of-the-art technologies in machine translation. Following an overview of the architecture and mechanisms underpinning PB-SMT and NMT systems, we will focus on a specific use-case that would attest the translator's agency at maximizing the cutting-edge potential of these technologies, particularly the PB-SMT's capacity. The use-case urges the translator to dig out of his/her toolbox the best practices possible to improve the translation output text by means of data preparation, training, assessment and refinement tasks

    Automatic Extraction of Linguistic Data from Digitized Documents

    Get PDF
    BLS 39: General Session and Special Session on Space and Directionalit
    corecore