1,571 research outputs found

    Improving the translation environment for professional translators

    Get PDF
    When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side. This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project

    In no uncertain terms : a dataset for monolingual and multilingual automatic term extraction from comparable corpora

    Get PDF
    Automatic term extraction is a productive field of research within natural language processing, but it still faces significant obstacles regarding datasets and evaluation, which require manual term annotation. This is an arduous task, made even more difficult by the lack of a clear distinction between terms and general language, which results in low inter-annotator agreement. There is a large need for well-documented, manually validated datasets, especially in the rising field of multilingual term extraction from comparable corpora, which presents a unique new set of challenges. In this paper, a new approach is presented for both monolingual and multilingual term annotation in comparable corpora. The detailed guidelines with different term labels, the domain- and language-independent methodology and the large volumes annotated in three different languages and four different domains make this a rich resource. The resulting datasets are not just suited for evaluation purposes but can also serve as a general source of information about terms and even as training data for supervised methods. Moreover, the gold standard for multilingual term extraction from comparable corpora contains information about term variants and translation equivalents, which allows an in-depth, nuanced evaluation

    Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration

    Full text link
    Cross-language information retrieval (CLIR), where queries and documents are in different languages, has of late become one of the major topics within the information retrieval community. This paper proposes a Japanese/English CLIR system, where we combine a query translation and retrieval modules. We currently target the retrieval of technical documents, and therefore the performance of our system is highly dependent on the quality of the translation of technical terms. However, the technical term translation is still problematic in that technical terms are often compound words, and thus new terms are progressively created by combining existing base words. In addition, Japanese often represents loanwords based on its special phonogram. Consequently, existing dictionaries find it difficult to achieve sufficient coverage. To counter the first problem, we produce a Japanese/English dictionary for base words, and translate compound words on a word-by-word basis. We also use a probabilistic method to resolve translation ambiguity. For the second problem, we use a transliteration method, which corresponds words unlisted in the base word dictionary to their phonetic equivalents in the target language. We evaluate our system using a test collection for CLIR, and show that both the compound word translation and transliteration methods improve the system performance

    METRICC: Harnessing Comparable Corpora for Multilingual Lexicon Development

    Get PDF
    International audienceResearch on comparable corpora has grown in recent years bringing about the possibility of developing multilingual lexicons through the exploitation of comparable corpora to create corpus-driven multilingual dictionaries. To date, this issue has not been widely addressed. This paper focuses on the use of the mechanism of collocational networks proposed by Williams (1998) for exploiting comparable corpora. The paper first provides a description of the METRICC project, which is aimed at the automatically creation of comparable corpora and describes one of the crawlers developed for comparable corpora building, and then discusses the power of collocational networks for multilingual corpus-driven dictionary development

    Creating a bilingual dictionary of collocations: A learner-oriented approach

    Get PDF
    Considering the lack of specialised dictionaries in certain fields, a creative way of teaching through corpora-based work was proposed in a seminar for master’s students of translation studies (University of Ljubljana, Slovenia). Since phraseology and terminology play an important role both in specialised translation and in the learning path of students of translation studies, this article presents an active approach aimed at creating an online lexicographic resource in languages for specific purposes by using the didactic tool and database ARTES (Aide à la Rédaction de TExtes Scientifiques/Dictionary-assisted writing tool for scientific communication) previously developed at the Université de Paris (France). About thirty Slovene students enrolled in the first year of master’s study have been participating in the bilateral project since 2018. The aims of such an activity are multiple: students learn in a practical way how to compile comparable corpora from the internet, using the online corpus software Sketch Engine, to find similar linguistic constructions in the source and target languages. They also learn to create an online bilingual phraseological and terminological dictionary to facilitate the translation of specialised texts. In this way, they acquire skills and develop some knowledge in translation, terminology, and discourse phraseology. The article first describes the ARTES online database. Then, we present the teaching methodology and the students’ work, which consists of compiling corpora, extracting and translating collocations for the language pair French-Slovene, and entering them in the ARTES database. Finally, we propose an analysis of the most frequent collocation structures in both languages. The language pair considered here is French and Slovene, but the methodology can be applied to any other language pair

    Exploring the use of parallel corpora in the complilation of specialised bilingual dictionaries of technical terms: a case study of English and isiXhosa

    Get PDF
    Text in EnglishAbstracts in English, isiXhosa and AfrikaansThe Constitution of the Republic of South Africa, Act 108 of 1996, mandates the state to take practical and positive measures to elevate the status and the use of indigenous languages. The implementation of this pronouncement resulted in a growing demand for specialised translations in fields like technology, science, commerce, law and finance. The lack of terminology and resources such as specialised bilingual dictionaries in indigenous languages, particularly isiXhosa remains a growing concern that hinders the translation and the intellectualisation of isiXhosa. A growing number of African scholars affirm the importance of specialised dictionaries in the African languages as tools for language and terminology development so that African languages can be used in the areas of science and technology. In the light of the background above, this study explored how parallel corpora can be interrogated using a bilingual concordancer, ParaConc to extract bilingual terminology that can be used to create specialised bilingual dictionaries. A corpus-based approach was selected due to its speed, efficiency and accuracy in extracting bilingual terms in their immediate contexts. In enhancing the research outcomes, Descriptive Translations Studies (DTS) and Corpus-based translation studies (CTS) were used in a complementary manner. Because the study is interdisciplinary, the function theories of lexicography that emphasise the function and needs of users were also applied. The analysis and extraction of bilingual terminology for dictionary making was successful through the use of the following ParaConc features, namely frequencies, hot word lists, hot words, search facility and concordances (Key Word in Context), among others. The findings revealed that English-isiXhosa Parallel Corpus is a repository of translation equivalents and other information categories that can make specialised dictionaries more user-friendly and multifunctional. The frequency lists were revealed as an effective method of selecting headwords for inclusion in a dictionary. The results also unraveled the complex functions of bilingual concordances where information on collocations and multiword units, sense distinction and usage examples could be easily identifiable proving that this approach is more efficient than the traditional method. The study contributes to the knowledge on corpus-based lexicography, standardisation of finance terminology resource development and making of user-friendly dictionaries that are tailor-made for different needs of users.Umgaqo-siseko weli loMzantsi Afrika ukhululele uRhulumente ukuba athabathe amanyathelo abonakalayo ekuphuhliseni nasekuphuculeni iilwimi zesiNtu. Esi sindululo sibangele ukwanda kokuguqulelwa kwamaxwebhu angezobuchwepheshe, inzululwazi, umthetho, ezemali noqoqosho angesiNgesi eguqulelwa kwiilwimi ebezifudula zingasiwe-so ezinjengesiXhosa. Ukunqongophala kwesigama kunye nezichazi-magama kube yingxaki enkulu ekuguquleleni ngakumbi izichazi-magama ezilwimi-mbini eziqulethe isigama esikhethekileyo. Iingcali ezininzi ziyangqinelana ukuba olu hlobo lwezi zichazi-magama luyimfuneko kuba ludlala iindima enkulu ekuphuhlisweni kweelwimi zesiNtu, ekuyileni isigama, nasekusetyenzisweni kwazo kumabakala obunzululwazi nobuchwepheshe. Olu phando ke luvavanya ukusetyenziswa kwekhophasi equlethe amaxwebhu esiNgesi neenguqulelo zawo zesiXhosa njengovimba wokudimbaza isigama sezemali esinokunceda ekuqulunqweni kwesichazi-magama esilwimi-mbini. Isizathu esibangele ukukhetha le ndlela yophando esebenzisa ikhompyutha kukuba iyakhawuleza, ulwazi oluthathwe kwikhophasi luchanekile, yaye isigama kwikhophasi singqamana ngqo nomxholo wamaxwebhu nto leyo eyenza kube lula ukufumana iintsingiselo nemizekelo ephilayo. Ukutyebisa olu phando indlela yekhophasi iye yaxhaswa zezinye iindlela zophando ezityunjiweyo: ufundo lwenguguqulelo oluchazayo (DTS) kunye neendlela zokuguqulela ezijoliswe kumsebenzi nakuhlobo lwabasebenzisi zinguqulelo ezo. Kanti ke ziqwalaselwe neenkqubo zophando lobhalo-zichazi-magama eziinjongo zokuqulunqa izichazi-magama ezesebenzisekayo neziluncedo kuninzi lwabasebenzisi zichazi-magama ngakumbi kwisizwe esisebenzisa iilwimi ezininzi. Ukuhlalutya nokudimbaza isigama kwikhophasi kolu phando kusetyenziswe isixhobo sekhompyutha esilungiselelwe ikhophasi enelwiimi ezimbini nangaphezulu ebizwa ngokuba yiParaConc. Iziphumo zolu phando zibonise mhlophe ukuba ikhophasi eneenguqulelo nguvimba weendidi ngendidi zamagama nolwazi olunokuphucula izichazi-magama zeli xesha. Kaloku abaguquleli basebenzise amaqhinga ngamaqhinga ukunika iinguqulelo bekhokelwa yimigomo nemithetho yoguqulelo enxuse abasebenzisi bamaxwebhu aguqulelweyo. Ubuchule beParaConc bokukwazi ukuhlela amagama ngokwendlela afumaneka ngayo kunye neenkcukacha zamanani budandalazise indlela eyiyo yokukhetha imichazwa enokungena kwisichazi-magama. Iziphumo zikwabonakalise iintlaninge yolwazi olufumaneka kwiKWIC, lwazi olo olungelula ukulufumana xa usebenzisa undlela-ndala wokwakha isichazi-magama. Esi sifundo esihlanganyele uGuqulelo olusekelwe kwiKhophasi noQulunqo-zichazi-magama zobuchwepheshe luya kuba negalelo elingathethekiyo kwindlela yokwakha izichazi-magama kwilwiimi zeSintu ngokubanzi nancakasana kwisiXhosa, nto leyo eya kothula umthwalo kubaqulunqi-zichazi-magama. Ukwakha nokuqulunqa izichazi-magama ezilwimi-mbini zezemali kuya kwandisa imithombo yesigama esinqongopheleyo kananjalo sivelise izichazi-magama eziluncedo kwisininzi sabantu.Die Grondwet van die Republiek van Suid-Afrika, Wet 108 van 1996, gee aan die staat die mandaat om praktiese en positiewe maatreëls te tref om die status en gebruik van inheemse tale te verhoog. Die implementering van hierdie uitspraak het gelei tot ’n toenemende vraag na gespesialiseerde vertalings in domeine soos tegnologie, wetenskap, handel, regte en finansies. Die gebrek aan terminologie en hulpbronne soos gespesialiseerde woordeboeke in inheemse tale, veral Xhosa, wek toenemende kommer wat die vertaling en die intellektualisering van Xhosa belemmer. ’n Toenemende aantal vakkundiges in Afrika beklemtoon die belangrikheid van gespesialiseerde woordeboeke in die Afrikatale as instrumente vir taal- en terminologie-ontwikkeling sodat Afrikatale gebruik kan word in die areas van wetenskap en tegnologie. In die lig van die voorafgaande agtergrond het hierdie studie ondersoek ingestel na hoe parallelle korpora deursoek kan word deur ’n tweetalige konkordanser (ParaConc) te gebruik om tweetalige terminologie te ontgin wat gebruik kan word in die onwikkeling van tweetalige gespesialiseerde woordeboeke. ’n Korpusgebaseerde benadering is gekies vir die spoed, doeltreffendheid en akkuraatheid waarmee dit tweetalige terme uit hulle onmiddellike kontekste kan onttrek. Beskrywende Vertaalstudies (DTS) en Korpusgebaseerde Vertaalstudies (CTS) is op ’n aanvullende wyse gebruik om die navorsingsuitkomste te verbeter. Aangesien die studie interdissiplinêr is, is die funksieteorieë van leksikografie wat die funksie en behoeftes van gebruikers beklemtoon, ook toegepas. Die analise en ontginning van tweetalige terminologie om woordeboeke te ontwikkel was suksesvol deur, onder andere, gebruik te maak van die volgende ParaConc-eienskappe, naamlik, frekwensies, hotword-lyste, hot words, die soekfunksie en konkordansies (Sleutelwoord-in-Konteks). Die bevindings toon dat ’n Engels-Xhosa Parallelle Korpus ’n bron van vertaalekwivalente en ander inligtingskategorieë is wat gespesialiseerde woordeboeke meer gebruikersvriendelik en multifunksioneel kan maak. Die frekwensielyste is geïdentifiseer as ’n doeltreffende metode om hoofwoorde te selekteer wat opgeneem kan word in ’n woordeboek. Die bevindings het ook die komplekse funksies van tweetalige konkordansers ontknoop waar inligting oor kollokasies en veelvuldigewoord-eenhede, betekenisonderskeiding en gebruiksvoorbeelde maklik identifiseer kon word wat aandui dat hierdie metode viii doeltreffender is as die tradisionele metode. Die studie dra by tot die kennisveld van korpusgebaseerde leksikografie, standaardisering van finansiële terminologie, hulpbronontwikkeling en die ontwikkeling van gebruikersvriendelike woordeboeke wat doelgemaak is vir verskillende behoeftes van gebruikers.Linguistics and Modern LanguagesD. Litt. et Phil. (Linguistics (Translation Studies)

    Creación de datos multilingües para diversos enfoques basados en corpus en el ámbito de la traducción y la interpretación

    Get PDF
    Accordingly, this research work aims at exploiting and developing new technologies and methods to better ascertain not only translators’ and interpreters’ needs, but also professionals’ and ordinary people’s on their daily tasks, such as corpora and terminology compilation and management. The main topics covered by this work relate to Computational Linguistics (CL), Natural Language Processing (NLP), Machine Translation (MT), Comparable Corpora, Distributional Similarity Measures (DSM), Terminology Extraction Tools (TET) and Terminology Management Tools (TMT). In particular, this work examines three main questions: 1) Is it possible to create a simpler and user-friendly comparable corpora compilation tool? 2) How to identify the most suitable TMT and TET for a given translation or interpreting task? 3) How to automatically assess and measure the internal degree of relatedness in comparable corpora? This work is composed of thirteen peer-reviewed scientific publications, which are included in Appendix A, while the methodology used and the results obtained in these studies are summarised in the main body of this document. Fecha de lectura de Tesis Doctoral: 22 de noviembre 2019Corpora are playing an increasingly important role in our multilingual society. High-quality parallel corpora are a preferred resource in the language engineering and the linguistics communities. Nevertheless, the lack of sufficient and up-to-date parallel corpora, especially for narrow domains and poorly-resourced languages is currently one of the major obstacles to further advancement across various areas like translation, language learning and, automatic and assisted translation. An alternative is the use of comparable corpora, which are easier and faster to compile. Corpora, in general, are extremely important for tasks like translation, extraction, inter-linguistic comparisons and discoveries or even to lexicographical resources. Its objectivity, reusability, multiplicity and applicability of uses, easy handling and quick access to large volume of data are just an example of their advantages over other types of limited resources like thesauri or dictionaries. By a way of example, new terms are coined on a daily basis and dictionaries cannot keep up with the rate of emergence of new terms
    corecore