314 research outputs found

    Comparability measurement for terminology extraction

    Get PDF
    Proceedings of the Workshop CHAT 2011: Creation, Harmonization and Application of Terminology Resources. Editors: Tatiana Gornostay and Andrejs Vasiļjevs. NEALT Proceedings Series, Vol. 12 (2011), 3-10. © 2011 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/16956

    Proceedings

    Get PDF
    Proceedings of the Workshop CHAT 2011: Creation, Harmonization and Application of Terminology Resources. Editors: Tatiana Gornostay and Andrejs Vasiļjevs. NEALT Proceedings Series, Vol. 12 (2011). © 2011 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/16956

    In no uncertain terms : a dataset for monolingual and multilingual automatic term extraction from comparable corpora

    Get PDF
    Automatic term extraction is a productive field of research within natural language processing, but it still faces significant obstacles regarding datasets and evaluation, which require manual term annotation. This is an arduous task, made even more difficult by the lack of a clear distinction between terms and general language, which results in low inter-annotator agreement. There is a large need for well-documented, manually validated datasets, especially in the rising field of multilingual term extraction from comparable corpora, which presents a unique new set of challenges. In this paper, a new approach is presented for both monolingual and multilingual term annotation in comparable corpora. The detailed guidelines with different term labels, the domain- and language-independent methodology and the large volumes annotated in three different languages and four different domains make this a rich resource. The resulting datasets are not just suited for evaluation purposes but can also serve as a general source of information about terms and even as training data for supervised methods. Moreover, the gold standard for multilingual term extraction from comparable corpora contains information about term variants and translation equivalents, which allows an in-depth, nuanced evaluation

    Terminology Integration in Statistical Machine Translation

    Get PDF
    Elektroniskā versija nesatur pielikumusPromocijas darbs apraksta autora izpētītas metodes un izstrādātus rīkus divvalodu terminoloģijas integrācijai statistiskās mašīntulkošanas sistēmās. Autors darbā piedāvā inovatīvas metodes terminu integrācijai SMT sistēmu trenēšanas fāzē (ar statiskas integrācijas palīdzību) un tulkošanas fāzē (ar dinamiskas integrācijas palīdzību). Darbā uzmanība pievērsta ne tikai metodēm terminu integrācijai SMT, bet arī metodēm valodas resursu, kas nepieciešami dažādu uzdevumu veikšanai terminu integrācijas SMT darbplūsmās, ieguvei. Piedāvātās metodes ir novērtētas automātiskas un manuālas novērtēšanas eksperimentos. Iegūtie rezultāti parāda, ka statiskās un dinamiskās integrācijas metodes ļauj būtiski uzlabot tulkošanas kvalitāti. Darbā aprakstītie rezultāti ir aprobēti vairākos pētniecības projektos un ieviesti praktiskos risinājumos. Atslēgvārdi: statistiskā mašīntulkošana, terminoloģija, starpvalodu informācijas izvilkšanaThe doctoral thesis describes methods and tools researched and developed by the author for bilingual terminology integration into statistical machine translation systems. The author presents novel methods for terminology integration in SMT systems during training (through static integration) and during translation (through dynamic integration). The work focusses not only on the SMT integration techniques, but also on methods for acquisition of linguistic resources that are necessary for different tasks involved in workflows for terminology integration in SMT systems. The proposed methods have been evaluated using automatic and manual evaluation methods. The results show that both static and dynamic integration methods allow increasing translation quality. The thesis describes also areas where the methods have been approbated in practice. Keywords: statistical machine translation, terminology, cross-lingual information extractio

    Improving the translation environment for professional translators

    Get PDF
    When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side. This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project

    Extraction of Knowledge-Rich Contexts in Russian – A Study in the Automotive Domain

    Get PDF
    Proceedings of the 18th Nordic Conference of Computational Linguistics NODALIDA 2011. Editors: Bolette Sandford Pedersen, Gunta Nešpore and Inguna Skadiņa. NEALT Proceedings Series, Vol. 11 (2011), 311-314. © 2011 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/1695

    Extracting bilingual terms from the Web

    Get PDF
    In this paper we make two contributions. First, we describe a multi-component system called BiTES (Bilingual Term Extraction System) designed to automatically gather domain-specific bilingual term pairs from Web data. BiTES components consist of data gathering tools, domain classifiers, monolingual text extraction systems and bilingual term aligners. BiTES is readily extendable to new language pairs and has been successfully used to gather bilingual terminology for 24 language pairs, including English and all official EU languages, save Irish. Second, we describe a novel set of methods for evaluating the main components of BiTES and present the results of our evaluation for six language pairs. Results show that the BiTES approach can be used to successfully harvest quality bilingual term pairs from the Web. Our evaluation method delivers significant insights about the strengths and weaknesses of our techniques. It can be straightforwardly reused to evaluate other bilingual term extraction systems and makes a novel contribution to the study of how to evaluate bilingual terminology extraction systems

    Conference Program

    Get PDF
    Proceedings of the 18th Nordic Conference of Computational Linguistics NODALIDA 2011. Editors: Bolette Sandford Pedersen, Gunta Nešpore and Inguna Skadiņa. NEALT Proceedings Series, Vol. 11 (2011), xii-xvii. © 2011 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/16955

    Evaluation of contextual embeddings on less-resourced languages

    Get PDF
    The current dominance of deep neural networks in natural language processing is based on contextual embeddings such as ELMo, BERT, and BERT derivatives. Most existing work focuses on English; in contrast, we present here the first multilingual empirical comparison of two ELMo and several monolingual and multilingual BERT models using 14 tasks in nine languages. In monolingual settings, our analysis shows that monolingual BERT models generally dominate, with a few exceptions such as the dependency parsing task, where they are not competitive with ELMo models trained on large corpora. In cross-lingual settings, BERT models trained on only a few languages mostly do best, closely followed by massively multilingual BERT models
    corecore