Compilation and Exploitation of Parallel Corpora

Tomaž Erjavec

Compilation and Exploitation of Parallel Corpora

Authors: Tomaž Erjavec
Publication date: 1 January 2003
Publisher: 'University of Zagreb - University Computing Centre'
Doi

Abstract

With more and more text being available in electronic form, it is becoming relatively easy to obtain digital texts together with their translations. The paper presents the processing steps necessary to compile such texts into parallel corpora, an extremely useful language resource. Parallel corpora can be used as a translation aid for second-language learners, for translators and lexicographers, or as a data-source for various language technology tools. We present our work in this direction, which is characterised by the use of open standards for text annotation, the use of publicly available third-party tools and wide availability of the produced resources. Explained is the corpus annotation chain involving normalisation, tokenisation, segmentation, alignment, word-class syntactic tagging, and lemmatisation. Two exploitation results over our annotated corpora are also presented, namely aWeb concordancer and the extraction of bi-lingual lexica

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

HRČAK - Portal of Croatian Scientific and Professional Journals

oai:hrcak.srce.hr:44755

Last time updated on 10/12/2021

Hrčak - Portal of scientific journals of Croatia

oai:hrcak.srce.hr:44755

Last time updated on 27/08/2013

Crossref

Last time updated on 01/04/2019