Search CORE

534 research outputs found

Corpus resources for translators: academic luxury or professional necessity

Author: Bowker Lynne
Publication venue: Universidade de São Paulo. Faculdade de Filosofia, Letras e Ciências Humanas
Publication date: 18/12/2004
Field of study

Since the late 1990s, corpora have become established as popular and useful translation resources within translator training institutes; however, the uptake of corpora in the world of professional translators appears to have been considerably slower. This article explores the different uses of corpora (including translation memories) in these two sectors and attempts to account for these differences. To determine how corpora are used in academics, a literature survey of papers produced by translator trainers is conducted. With regard to the use of corpora in a professional setting, this study focuses on professionals working in Canada. To find out the extent to which Canadian professional translators make use of corpora, two methods are used. Firstly, a literature survey of publications produced by Canadian translators’ associations is carried out, and secondly, a database of job advertisements is analyzed to determine how many Canadian employers are seeking candidates who are familiar with corpus-based resources. The resulting data are compared and discussed with a view to uncovering and understanding why corpus use differs in academic and professional circles.Desde o final dos anos 90, os corpora se firmaram como uma ferramenta de tradução conhecida e útil nos centros de treinamento de tradutores. No entanto, parece que sua difusão no universo do tradutor profissional tem sido bem mais lenta. O presente artigo explora os diferentes usos de corpora (inclusive as memórias de tradução) nesses dois setores e pretende apontar suas diferenças. Para determinar como os corpora são usados no meio acadêmico, foi realizado um levantamento bibliográfico de artigos produzidos por tradutores aprendizes. Com relação ao uso de corpora no meio profissional, este estudo concentra-se nos tradutores profissionais do Canadá. Para verificar até que ponto eles fazem uso de corpora, foram empregados dois procedimentos. Primeiramente, foi feito um levantamento dos artigos publicados por associações canadenses de tradutores e, numa segunda etapa, avaliou-se um banco de dados de ofertas de emprego para levantar quantos empregadores procuram candidatos familiarizados com o uso de corpora como ferramenta de trabalho. Os dados resultantes são comparados e discutidos com vistas a revelar e compreender por que o uso de corpora é diferente nos meios acadêmico e profissional

Cadernos Espinosanos (E-Journal)

Bilingual Lexicon Extraction from Comparable Corpora as Metasearch

Author: Hazem Amir
Morin Emmanuel
Peña Saldarriaga Sebastián
Publication venue: HAL CCSD
Publication date: 24/06/2011
Field of study

International audienceIn this article we present a novel way of looking at the problem of automatic acquisition of pairs of translationally equivalent words from comparable corpora. We ﬁrst present the standard and extended approaches traditionally dedicated to this task. We then reinterpret the extended method, and motivate a novel model to reformulate this approach inspired by the metasearch engines in information retrieval. The empirical results show that performances of our model are always better than the baseline obtained with the extended approach and also competitive with the standard approach

Corpus-Based contrastive analysis and translation universals: a tool for translation quality assessment Englis-Spanish

Author: Labrador De La Cruz María Belén
Rabadán Rosa
Ramón García Noelia
Publication venue: 'John Benjamins Publishing Company'
Publication date: 21/11/2018
Field of study

p. 303-328Assessing translation quality is generally seen as a difficult task because of the inadequacy of the tools available. The aim of this paper is to demonstrate the usefulness of a corpus-based contrastive methodology (ACTRES2 Project) developed at the University of León (Spain) for identifying instances of low-quality rendering of grammatical features when translating from English into Spanish using translation universals. The analysis provides information about: i) the resources available (or absence thereof) in each of the languages to express a given meaning and their relative centrality; ii) the solutions favored by translators to bridge the cross-linguistic disparities and/or gaps; iii) the erroneous or non-existent uses and structures transferred from the source language into the target language. These results can be systematized in terms of simplification, interference, or unique grammatical features. Additional areas that can benefit from this type of research are translation practice, translator training and foreign language teaching (FLT), among others.S

Leon University (Spain)

Exploring the use of parallel corpora in the complilation of specialised bilingual dictionaries of technical terms: a case study of English and isiXhosa

Author: Shoba Feziwe Martha
Publication venue
Publication date: 01/07/2018
Field of study

Text in EnglishAbstracts in English, isiXhosa and AfrikaansThe Constitution of the Republic of South Africa, Act 108 of 1996, mandates the state to take practical and positive measures to elevate the status and the use of indigenous languages. The implementation of this pronouncement resulted in a growing demand for specialised translations in fields like technology, science, commerce, law and finance. The lack of terminology and resources such as specialised bilingual dictionaries in indigenous languages, particularly isiXhosa remains a growing concern that hinders the translation and the intellectualisation of isiXhosa. A growing number of African scholars affirm the importance of specialised dictionaries in the African languages as tools for language and terminology development so that African languages can be used in the areas of science and technology. In the light of the background above, this study explored how parallel corpora can be interrogated using a bilingual concordancer, ParaConc to extract bilingual terminology that can be used to create specialised bilingual dictionaries. A corpus-based approach was selected due to its speed, efficiency and accuracy in extracting bilingual terms in their immediate contexts. In enhancing the research outcomes, Descriptive Translations Studies (DTS) and Corpus-based translation studies (CTS) were used in a complementary manner. Because the study is interdisciplinary, the function theories of lexicography that emphasise the function and needs of users were also applied. The analysis and extraction of bilingual terminology for dictionary making was successful through the use of the following ParaConc features, namely frequencies, hot word lists, hot words, search facility and concordances (Key Word in Context), among others. The findings revealed that English-isiXhosa Parallel Corpus is a repository of translation equivalents and other information categories that can make specialised dictionaries more user-friendly and multifunctional. The frequency lists were revealed as an effective method of selecting headwords for inclusion in a dictionary. The results also unraveled the complex functions of bilingual concordances where information on collocations and multiword units, sense distinction and usage examples could be easily identifiable proving that this approach is more efficient than the traditional method. The study contributes to the knowledge on corpus-based lexicography, standardisation of finance terminology resource development and making of user-friendly dictionaries that are tailor-made for different needs of users.Umgaqo-siseko weli loMzantsi Afrika ukhululele uRhulumente ukuba athabathe amanyathelo abonakalayo ekuphuhliseni nasekuphuculeni iilwimi zesiNtu. Esi sindululo sibangele ukwanda kokuguqulelwa kwamaxwebhu angezobuchwepheshe, inzululwazi, umthetho, ezemali noqoqosho angesiNgesi eguqulelwa kwiilwimi ebezifudula zingasiwe-so ezinjengesiXhosa. Ukunqongophala kwesigama kunye nezichazi-magama kube yingxaki enkulu ekuguquleleni ngakumbi izichazi-magama ezilwimi-mbini eziqulethe isigama esikhethekileyo. Iingcali ezininzi ziyangqinelana ukuba olu hlobo lwezi zichazi-magama luyimfuneko kuba ludlala iindima enkulu ekuphuhlisweni kweelwimi zesiNtu, ekuyileni isigama, nasekusetyenzisweni kwazo kumabakala obunzululwazi nobuchwepheshe. Olu phando ke luvavanya ukusetyenziswa kwekhophasi equlethe amaxwebhu esiNgesi neenguqulelo zawo zesiXhosa njengovimba wokudimbaza isigama sezemali esinokunceda ekuqulunqweni kwesichazi-magama esilwimi-mbini. Isizathu esibangele ukukhetha le ndlela yophando esebenzisa ikhompyutha kukuba iyakhawuleza, ulwazi oluthathwe kwikhophasi luchanekile, yaye isigama kwikhophasi singqamana ngqo nomxholo wamaxwebhu nto leyo eyenza kube lula ukufumana iintsingiselo nemizekelo ephilayo. Ukutyebisa olu phando indlela yekhophasi iye yaxhaswa zezinye iindlela zophando ezityunjiweyo: ufundo lwenguguqulelo oluchazayo (DTS) kunye neendlela zokuguqulela ezijoliswe kumsebenzi nakuhlobo lwabasebenzisi zinguqulelo ezo. Kanti ke ziqwalaselwe neenkqubo zophando lobhalo-zichazi-magama eziinjongo zokuqulunqa izichazi-magama ezesebenzisekayo neziluncedo kuninzi lwabasebenzisi zichazi-magama ngakumbi kwisizwe esisebenzisa iilwimi ezininzi. Ukuhlalutya nokudimbaza isigama kwikhophasi kolu phando kusetyenziswe isixhobo sekhompyutha esilungiselelwe ikhophasi enelwiimi ezimbini nangaphezulu ebizwa ngokuba yiParaConc. Iziphumo zolu phando zibonise mhlophe ukuba ikhophasi eneenguqulelo nguvimba weendidi ngendidi zamagama nolwazi olunokuphucula izichazi-magama zeli xesha. Kaloku abaguquleli basebenzise amaqhinga ngamaqhinga ukunika iinguqulelo bekhokelwa yimigomo nemithetho yoguqulelo enxuse abasebenzisi bamaxwebhu aguqulelweyo. Ubuchule beParaConc bokukwazi ukuhlela amagama ngokwendlela afumaneka ngayo kunye neenkcukacha zamanani budandalazise indlela eyiyo yokukhetha imichazwa enokungena kwisichazi-magama. Iziphumo zikwabonakalise iintlaninge yolwazi olufumaneka kwiKWIC, lwazi olo olungelula ukulufumana xa usebenzisa undlela-ndala wokwakha isichazi-magama. Esi sifundo esihlanganyele uGuqulelo olusekelwe kwiKhophasi noQulunqo-zichazi-magama zobuchwepheshe luya kuba negalelo elingathethekiyo kwindlela yokwakha izichazi-magama kwilwiimi zeSintu ngokubanzi nancakasana kwisiXhosa, nto leyo eya kothula umthwalo kubaqulunqi-zichazi-magama. Ukwakha nokuqulunqa izichazi-magama ezilwimi-mbini zezemali kuya kwandisa imithombo yesigama esinqongopheleyo kananjalo sivelise izichazi-magama eziluncedo kwisininzi sabantu.Die Grondwet van die Republiek van Suid-Afrika, Wet 108 van 1996, gee aan die staat die mandaat om praktiese en positiewe maatreëls te tref om die status en gebruik van inheemse tale te verhoog. Die implementering van hierdie uitspraak het gelei tot ’n toenemende vraag na gespesialiseerde vertalings in domeine soos tegnologie, wetenskap, handel, regte en finansies. Die gebrek aan terminologie en hulpbronne soos gespesialiseerde woordeboeke in inheemse tale, veral Xhosa, wek toenemende kommer wat die vertaling en die intellektualisering van Xhosa belemmer. ’n Toenemende aantal vakkundiges in Afrika beklemtoon die belangrikheid van gespesialiseerde woordeboeke in die Afrikatale as instrumente vir taal- en terminologie-ontwikkeling sodat Afrikatale gebruik kan word in die areas van wetenskap en tegnologie. In die lig van die voorafgaande agtergrond het hierdie studie ondersoek ingestel na hoe parallelle korpora deursoek kan word deur ’n tweetalige konkordanser (ParaConc) te gebruik om tweetalige terminologie te ontgin wat gebruik kan word in die onwikkeling van tweetalige gespesialiseerde woordeboeke. ’n Korpusgebaseerde benadering is gekies vir die spoed, doeltreffendheid en akkuraatheid waarmee dit tweetalige terme uit hulle onmiddellike kontekste kan onttrek. Beskrywende Vertaalstudies (DTS) en Korpusgebaseerde Vertaalstudies (CTS) is op ’n aanvullende wyse gebruik om die navorsingsuitkomste te verbeter. Aangesien die studie interdissiplinêr is, is die funksieteorieë van leksikografie wat die funksie en behoeftes van gebruikers beklemtoon, ook toegepas. Die analise en ontginning van tweetalige terminologie om woordeboeke te ontwikkel was suksesvol deur, onder andere, gebruik te maak van die volgende ParaConc-eienskappe, naamlik, frekwensies, hotword-lyste, hot words, die soekfunksie en konkordansies (Sleutelwoord-in-Konteks). Die bevindings toon dat ’n Engels-Xhosa Parallelle Korpus ’n bron van vertaalekwivalente en ander inligtingskategorieë is wat gespesialiseerde woordeboeke meer gebruikersvriendelik en multifunksioneel kan maak. Die frekwensielyste is geïdentifiseer as ’n doeltreffende metode om hoofwoorde te selekteer wat opgeneem kan word in ’n woordeboek. Die bevindings het ook die komplekse funksies van tweetalige konkordansers ontknoop waar inligting oor kollokasies en veelvuldigewoord-eenhede, betekenisonderskeiding en gebruiksvoorbeelde maklik identifiseer kon word wat aandui dat hierdie metode viii doeltreffender is as die tradisionele metode. Die studie dra by tot die kennisveld van korpusgebaseerde leksikografie, standaardisering van finansiële terminologie, hulpbronontwikkeling en die ontwikkeling van gebruikersvriendelike woordeboeke wat doelgemaak is vir verskillende behoeftes van gebruikers.Linguistics and Modern LanguagesD. Litt. et Phil. (Linguistics (Translation Studies)

Unisa Institutional Repository

D-TERMINE : data-driven term extraction methodologies investigated

Author: Rigouts Terryn Ayla
Publication venue: Universiteit Gent. Faculteit Letteren en Wijsbegeerte
Publication date: 01/01/2021
Field of study

Automatic term extraction is a task in the field of natural language processing that aims to automatically identify terminology in collections of specialised, domain-specific texts. Terminology is defined as domain-specific vocabulary and consists of both single-word terms (e.g., corpus in the field of linguistics, referring to a large collection of texts) and multi-word terms (e.g., automatic term extraction). Terminology is a crucial part of specialised communication since terms can concisely express very specific and essential information. Therefore, quickly and automatically identifying terms is useful in a wide range of contexts. Automatic term extraction can be used by language professionals to find which terms are used in a domain and how, based on a relevant corpus. It is also useful for other tasks in natural language processing, including machine translation. One of the main difficulties with term extraction, both manual and automatic, is the vague boundary between general language and terminology. When different people identify terms in the same text, it will invariably produce different results. Consequently, creating manually annotated datasets for term extraction is a costly, time- and effort- consuming task. This can hinder research on automatic term extraction, which requires gold standard data for evaluation, preferably even in multiple languages and domains, since terms are language- and domain-dependent. Moreover, supervised machine learning methodologies rely on annotated training data to automatically deduce the characteristics of terms, so this knowledge can be used to detect terms in other corpora as well. Consequently, the first part of this PhD project was dedicated to the construction and validation of a new dataset for automatic term extraction, called ACTER – Annotated Corpora for Term Extraction Research. Terms and Named Entities were manually identified with four different labels in twelve specialised corpora. The dataset contains corpora in three languages and four domains, leading to a total of more than 100k annotations, made over almost 600k tokens. It was made publicly available during a shared task we organised, in which five international teams competed to automatically extract terms from the same test data. This illustrated how ACTER can contribute towards advancing the state-of-the-art. It also revealed that there is still a lot of room for improvement, with moderate scores even for the best teams. Therefore, the second part of this dissertation was devoted to researching how supervised machine learning techniques might contribute. The traditional, hybrid approach to automatic term extraction relies on a combination of linguistic and statistical clues to detect terms. An initial list of unique candidate terms is extracted based on linguistic information (e.g., part-of-speech patterns) and this list is filtered based on statistical metrics that use frequencies to measure whether a candidate term might be relevant. The result is a ranked list of candidate terms. HAMLET – Hybrid, Adaptable Machine Learning Approach to Extract Terminology – was developed based on this traditional approach and applies machine learning to efficiently combine more information than could be used with a rule-based approach. This makes HAMLET less susceptible to typical issues like low recall on rare terms. While domain and language have a large impact on results, robust performance was reached even without domain- specific training data, and HAMLET compared favourably to a state-of-the-art rule-based system. Building on these findings, the third and final part of the project was dedicated to investigating methodologies that are even further removed from the traditional approach. Instead of starting from an initial list of unique candidate terms, potential terms were labelled immediately in the running text, in their original context. Two sequential labelling approaches were developed, evaluated and compared: a feature- based conditional random fields classifier, and a recurrent neural network with word embeddings. The latter outperformed the feature-based approach and was compared to HAMLET as well, obtaining comparable and even better results. In conclusion, this research resulted in an extensive, reusable dataset and three distinct new methodologies for automatic term extraction. The elaborate evaluations went beyond reporting scores and revealed the strengths and weaknesses of the different approaches. This identified challenges for future research, since some terms, especially ambiguous ones, remain problematic for all systems. However, overall, results were promising and the approaches were complementary, revealing great potential for new methodologies that combine multiple strategies

Ghent University Academic Bibliography

On semantic differences: a multivariate corpus-based study of the semantic field of inchoativity in translated and non-translated Dutch

Author: Vandevoorde Lore
Publication venue: Ghent University. Faculty of Arts and Philosophy
Publication date: 01/01/2016
Field of study

This dissertation places the study of semantic differences in translation compared to non-translation at the centre of its concerns. To date, much research in Corpus-based Translation Studies has focused on lexical and grammatical phenomena in an attempt to reveal presumed general tendencies of translation. On the semantic level, these general tendencies have rarely been investigated. Therefore, the goal of this study is to explore whether universal tendencies of translation also exist on the semantic level, thereby connecting the framework of translation universals to semantics

Ghent University Academic Bibliography

Integrated Use of Internal and External Evidence in the Alignment of Multi-Word Named Entities

Author: 九津見毅
井佐原均
佐田いち子
吉見毅彦
小谷克則
Publication venue: Logico-Linguistic Society of Japan
Publication date: 16/11/2005
Field of study

This paper proposes a method of extracting English multi-word named entities and their Japanese equivalents from a parallel corpus. The aim of our research is to extract multi-word named entities which are not listed in a dictionary of an English-to-Japanese MT system and appear infrequently in a parallel corpus. Our method makes its alignment on the basis of two kinds of external evidence provided by the context in which a bilingual pair appears, as well as two kinds of internal evidence within the pair. Each evidence is accompanied by a score, and the aggregate score is computed as a weighted sum of the scores. The appropriate weights are estimated with the logistic regression analysis. An experiment using a parallel corpus of Yomiuri Shimbun and The Daily Yomiuri satisfactorily found that 86.36% of the extracted bilingual pairs with the highest scores were judged to be correct

Waseda University Repository

An English/Spanish Glossary in Proteomics: An Approach to Science

Author: Modrego Merino Tamara
Publication venue
Publication date: 01/01/2014
Field of study

En este Trabajo de Fin de Grado (TFG) se describe y realiza la compilación de un glosario bilingüe (inglés / español) en el campo de la Proteómica, el estudio de las proteínas. Para compilar este glosario hemos seguido una elaboracion sistemática de la terminología. Por lo tanto, nuestra metodología analiza los conceptos de "género", "tipo textual" y "registro", así como la importancia de la terminología y del análisis del corpus para llevar a buen puerto la elaboración de esta herramienta terminográfica. Además, hemos seguido las indicaciones de Cabré (1993) sobre la identificación y extracción de términos del corpus, así como los pasos para compilar un glosario de Frankervert García et al. (2001). El inglés internacional es la lengua de los textos científicos; sin embargo, los expertos españoles en este campo no tienen un diccionario especializado que les facilite el trabajo de leer y escribir en inglés. Por lo tanto, este TFG pretende facilitar su labor, para que la difusión de sus resultados en un entorno internacional sea un poco más fácil.This undergraduate dissertation deals with the compilation of a bilingual (English/Spanish) glossary in the subject field of Proteomics, the study of proteins. To compile this glossary we have followed a systematic ellaboration of terminology. Therefore, our methodology analyses what ‘genre’, ‘text type’ and ‘register’ are, the importance of the field of ‘terminology’ and the relevance of the studies in the domain of ‘corpus linguistics’. We have followed Cabré’s (1993) stages for the compilation of terms out of our corpus, as well as the steps to design a glossary by Frankervert García et al. (2001). The aim of this dissertation is to give the Spanish experts a bilingual terminographic tool that would help them to understand and write their papers in English in the field of Proteomics since English is the international language of scientific texts. We have found that there is not such a tool in this specialized domain. We pretend to simplify their task, so the spread of their findings in an international environment will be a little bit easier.Departamento de Filología InglesaGrado en Estudios Inglese

Repositorio Documental de la Universidad de Valladolid

Exploring the field of inchoativity

Author: Vandevoorde Lore
Publication venue
Publication date: 01/01/2020
Field of study

Although the notion of meaning has always been at the core of translation, the invariance of meaning has, partly due to practical constraints, rarely been challenged in Corpus-based Translation Studies. In answer to this, the aim of this book is to question the invariance of meaning in translated texts: if translation scholars agree on the fact that translated language is different from non-translated language with respect to a number of grammatical and lexical aspects, would it be possible to identify differences between translated and non-translated language on the semantic level too? More specifically, this books tries to formulate an answer to the following three questions: (i) how can semantic differences in translated vs non-translated language be investigated in a corpus-based study?, (ii) are there any differences on the semantic level between translated and non-translated language? and (iii) if there are differences on the semantic level, can we ascribe them to any of the (universal) tendencies of translation? In this book, I establish a way to visually explore semantic similarity on the basis of representations of translated and non-translated semantic fields. A technique for the comparison of semantic fields of translated and non-translated language called SMM++ (based on Helge Dyvik’s Semantic Mirrors method) is developed, yielding statistics-based visualizations of semantic fields. The SMM++ is presented via the case of inchoativity in Dutch (beginnen [to begin]). By comparing the visualizations of the semantic fields on different levels (translated Dutch with French as a source language, with English as a source language and non-translated Dutch) I further explore whether the differences between translated and non-translated fields of inchoativity in Dutch can be linked to any of the well-known universals of translation. The main results of this study are explained on the basis of two cognitively inspired frameworks: Halverson’s Gravitational Pull Hypothesis and Paradis’ neurolinguistic theory of bilingualism

Institutional Repository of the Freie Universität Berlin

Cross-language Ontology Learning: Incorporating and Exploiting Cross-language Data in the Ontology Learning Process

Author: Hjelm Hans
Publication venue
Publication date: 01/01/2009
Field of study

Hans Hjelm. Cross-language Ontology Learning: Incorporating and Exploiting Cross-language Data in the Ontology Learning Process. NEALT Monograph Series, Vol. 1 (2009), 159 pages. © 2009 Hans Hjelm. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/10126

Publikationer från Stockholms universitet

Digitala Vetenskapliga Arkivet - Academic Archive On-line

DSpace at Tartu University Library