Search CORE

359 research outputs found

Processing Annotated TMX Parallel Corpora

Author: Almeida J. J.
Brito Rui Miguel Magalhães
Simões Alberto
Publication venue
Publication date: 01/11/2014
Field of study

In the later years the amount of freely available multilingual corpora has grown in an exponential way. Unfortunately the way these corpora are made available is very diverse, ranging from simple text files or specific XML schemas to supposedly standard formats like the XML Corpus Encoding Initiative, the Text Encoding Initiative, or even the Translation Memory Exchange formats. In this document we defend the usage of Translation Memory Exchange documents, but we enrich its structure in order to support the annotation of the documents with different information like lemmas, multi-words or entities. To support the adoption of the proposed formats, we present a set of tools to manipulate the different formats in an agile way

Universidade do Minho: RepositoriUM

Language technologies for a multilingual Europe

Author
Publication venue: Language Science Press
Publication date: 01/04/2020
Field of study

This volume of the series “Translation and Multilingual Natural Language Processing” includes most of the papers presented at the Workshop “Language Technology for a Multilingual Europe”, held at the University of Hamburg on September 27, 2011 in the framework of the conference GSCL 2011 with the topic “Multilingual Resources and Multilingual Applications”, along with several additional contributions. In addition to an overview article on Machine Translation and two contributions on the European initiatives META-NET and Multilingual Web, the volume includes six full research articles. Our intention with this workshop was to bring together various groups concerned with the umbrella topics of multilingualism and language technology, especially multilingual technologies. This encompassed, on the one hand, representatives from research and development in the field of language technologies, and, on the other hand, users from diverse areas such as, among others, industry, administration and funding agencies. The Workshop “Language Technology for a Multilingual Europe” was co-organised by the two GSCL working groups “Text Technology” and “Machine Translation” (http://gscl.info) as well as by META-NET (http://www.meta-net.eu)

Directory of Open Access Books (DOAB)

A call for the environment: A bilingual corpus-driven analysis of creative language in online texts by WWF and Greenpeace

Author
Publication venue
Publication date
Field of study

Since its development as a discipline in the 1980s, environmental communication has sought to inform and warn people about the threats and issues that concern nature and wildlife by providing an accurate representation of them. There are a variety of actors and media that have contributed to spread such communication and raise awareness about environmental issues, both on a local and a global level. In order to reach even lay people and to fulfil their persuasive purpose, environment texts have undergone a process of popularisation and have exploited every linguistic resource, including the most creative ones. This dissertation aims at investigating the use of creative solutions in environment texts published by two environmental organisations, namely, WWF and Greenpeace. This investigation was carried out by designing a comparable corpus, consisting of online texts in Italian, British English and American English found in the websites of such NGOs. The study focused on the titles and subheadings of those texts, which were classified and grouped according to the type of lexical creativity they contain. The analysis showed that only a minority of cases included traditional figures of speech and idiomatic expressions that maintained their original form and meaning, as the majority contained manipulations at the semantic, structural, or phonological level. These deformations concerned collocations, idioms, and even quotes of famous books, songs, films, and other cultural or intertextual references. However, the most used device in the corpus turned out to be wordplay, followed by the exploitation of the polysemy of words that is generated in a particular context. Overall, it was observed that what affects the choice towards a creative device rather than another is not the general topic but the specific content of that text

Padua Thesis and Dissertation Archive

Language technologies for a multilingual Europe

Author
Publication venue
Publication date
Field of study

OAPEN Library

Low-Resource Unsupervised NMT:Diagnosing the Problem and Providing a Linguistically Motivated Solution

Author: Edman Lukas
Noord van, Gertjan
Toral Ruiz Antonio
Publication venue
Publication date: 01/01/2020
Field of study

Dissertations of the University of Groningen

Low-Resource Unsupervised NMT:Diagnosing the Problem and Providing a Linguistically Motivated Solution

Author: Edman Lukas
Noord van, Gertjan
Toral Ruiz Antonio
Publication venue
Publication date: 01/01/2020
Field of study

ARTS repository - University of Groningen

Low-Resource Unsupervised NMT:Diagnosing the Problem and Providing a Linguistically Motivated Solution

Author: Edman Lukas
Noord van, Gertjan
Toral Ruiz Antonio
Publication venue
Publication date: 01/01/2020
Field of study

Unsupervised Machine Translation hasbeen advancing our ability to translatewithout parallel data, but state-of-the-artmethods assume an abundance of mono-lingual data. This paper investigates thescenario where monolingual data is lim-ited as well, finding that current unsuper-vised methods suffer in performance un-der this stricter setting. We find that theperformance loss originates from the poorquality of the pretrained monolingual em-beddings, and we propose using linguis-tic information in the embedding train-ing scheme. To support this, we look attwo linguistic features that may help im-prove alignment quality: dependency in-formation and sub-word information. Us-ing dependency-based embeddings resultsin a complementary word representationwhich offers a boost in performance ofaround 1.5 BLEU points compared to stan-dardWORD2VECwhen monolingual datais limited to 1 million sentences per lan-guage. We also find that the inclusion ofsub-word information is crucial to improv-ing the quality of the embedding

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

Machine translation for institutional academic texts: Output quality, terminology translation and post-editor trust

Author: Scansani Randy <1991>
Publication venue: Alma Mater Studiorum - Università di Bologna
Publication date: 30/03/2020
Field of study

The present work is a feasibility study on the application of Machine Translation (MT) to institutional academic texts, specifically course catalogues, for Italian-English and German-English. The first research question of this work focuses on the feasibility of profitably applying MT to such texts. Since the benefits of a good quality MT might be counteracted by preconceptions of translators towards the output, the second research question examines translator trainees' trust towards an MT output as compared to a human translation (HT). Training and test sets are created for both language combinations in the institutional academic domain. MT systems used are ModernMT and Google Translate. Overall evaluations of the output quality are carried out using automatic metrics. Results show that applying neural MT to institutional academic texts can be beneficial even when bilingual data are not available. When small amounts of sentence pairs become available, MT quality improves. Then, a gold standard data set with manual annotations of terminology (MAGMATic) is created and used for an evaluation of the output focused on terminology translation. The gold standard was publicly released to stimulate research on terminology assessment. The assessment proves that domain-adaptation improves the quality of term translation. To conclude, a method to measure trust in a post-editing task is proposed and results regarding translator trainees trust towards MT are outlined. All participants are asked to work on the same text. Half of them is told that it is an MT output to be post-edited, and the other half that it is a HT needing revision. Results prove that there is no statistically significant difference between post-editing and HT revision in terms of number of edits and temporal effort. Results thus suggest that a new generation of translators that received training on MT and post-editing is not influenced by preconceptions against MT

AMS Tesi di Dottorato

Computational approaches to semantic change (Volume 6)

Author
Publication venue: Language Science Press
Publication date: 16/10/2021
Field of study

Semantic change — how the meanings of words change over time — has preoccupied scholars since well before modern linguistics emerged in the late 19th and early 20th century, ushering in a new methodological turn in the study of language change. Compared to changes in sound and grammar, semantic change is the least understood. Ever since, the study of semantic change has progressed steadily, accumulating a vast store of knowledge for over a century, encompassing many languages and language families. Historical linguists also early on realized the potential of computers as research tools, with papers at the very first international conferences in computational linguistics in the 1960s. Such computational studies still tended to be small-scale, method-oriented, and qualitative. However, recent years have witnessed a sea-change in this regard. Big-data empirical quantitative investigations are now coming to the forefront, enabled by enormous advances in storage capability and processing power. Diachronic corpora have grown beyond imagination, defying exploration by traditional manual qualitative methods, and language technology has become increasingly data-driven and semantics-oriented. These developments present a golden opportunity for the empirical study of semantic change over both long and short time spans

Directory of Open Access Books (DOAB)

Cross-language Ontology Learning: Incorporating and Exploiting Cross-language Data in the Ontology Learning Process

Author: Hjelm Hans
Publication venue
Publication date: 01/01/2009
Field of study

Hans Hjelm. Cross-language Ontology Learning: Incorporating and Exploiting Cross-language Data in the Ontology Learning Process. NEALT Monograph Series, Vol. 1 (2009), 159 pages. © 2009 Hans Hjelm. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/10126

Publikationer från Stockholms universitet

Digitala Vetenskapliga Arkivet - Academic Archive On-line

DSpace at Tartu University Library