38 research outputs found

    Tree Alignment through Semantic Role Annotation Projection

    Get PDF
    Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora AEPC 2010. Editors: Lars Ahrenberg, Jörg Tiedemann and Martin Volk. NEALT Proceedings Series, Vol. 10 (2010), 73-82. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15893

    Improving the translation environment for professional translators

    Get PDF
    When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side. This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project

    Parser-independent Semantic Tree Alignment

    No full text
    We describe an approach for training a semantic role labeler through cross-lingual projection between different types of parse trees, with the purpose of enhancing tree alignment on the level of syntactic translation divergences. After applying an existing semantic role labeler to parse trees in a resource-rich language (English), we partially project the semantic information to the parse trees of the corresponding target sentences (specifically in Dutch), based on word alignment. After this precision-oriented projection, we apply a method for training a semantic role labeler which consists in determining a large set of features describing target predicates and predicate-role connections, independently from the type of tree annotation (phrase structure or dependencies). These features describe tree paths starting at or connecting nodes. The semantic role labeling method does not require any knowledge of the parser nor manual intervention. We evaluated the performance of the cross-lingual projection and semantic role labeling using an English parser assigning PropBank labels and Dutch manually annotated parses, and are currently studying ways to use the predicted semantic information for enhancing tree alignment.http://www.lrec-conf.org/proceedings/lrec2012/workshops/12.LREC%202012%20Advanced%20Treebanking%20Proceedings.pdfstatus: publishe

    Improving fuzzy matching through syntactic knowledge

    No full text
    Fuzzy matching in translation memories (TM) is mostly string-based in current CAT tools. These tools look for TM sentences highly similar to an input sentence, using edit distance to detect the differences between sentences. Current CAT tools use limited or no linguistic knowledge in this procedure. In the recently started SCATE project, which aims at improving translators’ efficiency, we apply syntactic fuzzy matching in order to detect abstract similarities and to increase the number of fuzzy matches. We parse TM sentences in order to create hierarchical structures identifying constituents and/or dependencies. We calculate TER (Translation Error Rate) between an existing human translation of an input sentence and the translation of its fuzzy match in TM. This allows us to assess the usefulness of syntactic matching with respect to string-based matching. First results hint at the potential of syntactic matching to lower TER rates for sentences with a low match score in a string-based setting.status: publishe

    Belgisch Staatsblad Corpus: Retrieving French-Dutch Sentences from Official Documents

    No full text
    We describe the compilation of a large corpus of French-Dutch sentence pairs from official Belgian documents which are available in the online version of the publication Belgisch Staatsblad/Moniteur belge, and which have been published between 1997 and 2006. After downloading files in batch, we filtered out documents which have no translation in the other language, documents which contain several languages (by checking on discriminating words), and pairs of documents with a substantial difference in length. We segmented the documents into sentences and aligned the latter, which resulted in 5 million sentence pairs (only one-to-one links were included in the parallel corpus); there are 2.4 million unique pairs. Sample-based evaluation of the sentence alignment results indicates a near 100% accuracy, which can be explained by the text genre, the procedure filtering out weakly parallel articles and the restriction to one-to-one links. The corpus is larger than a number of well-known French-Dutch resources. It is made available to the community. Further investigation is needed in order to determine the original language in which documents were written.status: publishe

    Language-driven bilingual term extraction for medical texts

    No full text
    Domain-specific texts aligned with their translation make up a very useful resource for translators. They allow them to look up terms and their translation in context and extract terms to be fed into glossaries. Most systems performing word alignment and term extraction on bilingual texts focus on the use of statistics and minimize the introduction of linguistic knowledge. An important drawback of statistically based methods is their dependency on the length of the bilingual text. Correctly aligning low-frequency words proves difficult with these methods. In order to overcome this problem, we propose a language-driven procedure for word alignment and term extraction, applied to the field of medicine. The procedure is based on low structured lexical and linguistic resources such as domain-specific bilingual glossaries and lemmatizers, on the presence of cognates, i.e. similar words in source and target language, and on subword alignment. The result is a word alignment in context, rather than a calculation of translation probabilities. As the word alignment only covers cognates and terms with their translations from the glossary, we statistically process the non-aligned parts, constrained in size by the aligned parts (anchors), in order to align and extract new bilingual term candidates for the glossary.status: publishe

    Assessing linguistically aware fuzzy matching in translation memories

    No full text
    The concept of fuzzy matching in translation memories can take place using linguistically aware or unaware methods, or a combination of both. We designed a flexible and time-efficient framework which applies and combines linguistically unaware or aware metrics in the source and target language. We measure the correlation of fuzzy matching metric scores with the evaluation score of the suggested translation to find out how well the usefulness of a suggestion can be predicted, and we measure the difference in recall between fuzzy matching metrics by looking at the improvements in mean TER as the match score decreases. We found that combinations of fuzzy matching metrics outperform single metrics and that the best-scoring combination is a non-linear combination of the different metrics we have tested.no issnstatus: publishe
    corecore