1,472 research outputs found

    Bilingual contexts from comparable corpora to mine for translations of collocations

    Get PDF
    Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing2016Due to the limited availability of parallel data in many languages, we propose a methodology that benefits from comparable corpora to find translation equivalents for collocations (as a specific type of difficult-to-translate multi-word expressions). Finding translations is known to be more difficult for collocations than for words. We propose a method based on bilingual context extraction and build a word (distributional) representation model drawing on these bilingual contexts (bilingual English-Spanish contexts in our case). We show that the bilingual context construction is effective for the task of translation equivalent learning and that our method outperforms a simplified distributional similarity baseline in finding translation equivalents

    Comparing collocations in translated and learner language

    Get PDF
    This paper compares use of collocations by Italian learners writing in and translating into English, conceptualising the two tasks as different modes of constrained language production and adopting Halverson’s (2017) Revised Gravitational Pull hypothesis as a theoretical model. A particular focus is placed on identifying a method for comparing datasets containing translations and essays, assembled opportunistically and varying in size and structure. The study shows that lexical association scores for dependency-defined word pairs are significantly higher in translations than essays. A qualitative analysis of a subset of collocations shared and unique to either mode shows that the former set features more collocations with direct cross-linguistic links (connectivity), and that the source/first language seems to affect both modes similarly. We tentatively conclude that second/target language salience effects are more visible in translation than second language use, while connectivity and source language salience affect both modes of bilingual processing similarly, regardless of the mediation variable

    Computational Phraseology light: automatic translation of multiword expressions without translation resources

    Get PDF
    This paper describes the first phase of a project whose ultimate goal is the implementation of a practical tool to support the work of language learners and translators by automatically identifying multiword expressions (MWEs) and retrieving their translations for any pair of languages. The task of translating multiword expressions is viewed as a two-stage process. The first stage is the extraction of MWEs in each of the languages; the second stage is a matching procedure for the extracted MWEs in each language which proposes the translation equivalents. This project pursues the development of a knowledge-poor approach for any pair of languages which does not depend on translation resources such as dictionaries, translation memories or parallel corpora which can be time consuming to develop or difficult to acquire, being expensive or proprietary. In line with this philosophy, the methodology developed does not rely on any dictionaries or parallel corpora, nor does it use any (bilingual) grammars. The only information comes from comparable corpora, inexpensively compiled. The first proofof- concept stage of this project covers English and Spanish and focuses on a particular subclass of MWEs: verb-noun expressions (collocations) such as take advantage, make sense, prestar atención and tener derecho. The choice of genre was determined by the fact that newswire is a widespread genre and available in different languages. An additional motivation was the fact that the methodology was developed as language independent with the objective of applying it to and testing it for different languages. The ACCURAT toolkit (Pinnis et al. 2012; Skadina et al. 2012; Su and Babych 2012a) was employed to compile automatically the comparable corpora and documents only above a specific threshold were considered for inclusion. More specifically, only pairs of English and Spanish documents with comparability score (cosine similarity) higher 0.45 were extracted. Statistical association measures were employed to quantify the strength of the relationship between two words and to propose that a combination of a verb and a noun above a specific threshold would be a (candidate for) multiword expression. This study focused on and compared four popular and established measures along with frequency: Log-likelihood ratio, T-Score, Log Dice and Salience. This project follows the distributional similarity premise which stipulates that translation equivalents share common words in their contexts and this applies also to multiword expressions. The Vector Space Model is traditionally used to represent words with their co-occurrences and to measure similarity. The vector representation for any word is constructed from the statistics of the occurrences of that word with other specific/context words in a corpus of texts. In this study, the word2vec method (Mikolov et al. 2013) was employed. Mikolov et al.’s method utilises patterns of word co-occurrences within a small window to predict similarities among words. Evaluation results are reported for both extracting MWEs and their automatic translation. A finding of the evaluation worth mentioning is that the size of the comparable corpora is more important for the performance of automatic translation of MWEs than the similarity between them as long as the comparable corpora used are of minimal similarity

    Creating a bilingual dictionary of collocations: A learner-oriented approach

    Get PDF
    Considering the lack of specialised dictionaries in certain fields, a creative way of teaching through corpora-based work was proposed in a seminar for master’s students of translation studies (University of Ljubljana, Slovenia). Since phraseology and terminology play an important role both in specialised translation and in the learning path of students of translation studies, this article presents an active approach aimed at creating an online lexicographic resource in languages for specific purposes by using the didactic tool and database ARTES (Aide à la Rédaction de TExtes Scientifiques/Dictionary-assisted writing tool for scientific communication) previously developed at the Université de Paris (France). About thirty Slovene students enrolled in the first year of master’s study have been participating in the bilateral project since 2018. The aims of such an activity are multiple: students learn in a practical way how to compile comparable corpora from the internet, using the online corpus software Sketch Engine, to find similar linguistic constructions in the source and target languages. They also learn to create an online bilingual phraseological and terminological dictionary to facilitate the translation of specialised texts. In this way, they acquire skills and develop some knowledge in translation, terminology, and discourse phraseology. The article first describes the ARTES online database. Then, we present the teaching methodology and the students’ work, which consists of compiling corpora, extracting and translating collocations for the language pair French-Slovene, and entering them in the ARTES database. Finally, we propose an analysis of the most frequent collocation structures in both languages. The language pair considered here is French and Slovene, but the methodology can be applied to any other language pair

    Partial Perception and Approximate Understanding

    Get PDF
    What is discussed in the present paper is the assumption concerning a human narrowed sense of perception of external world and, resulting from this, a basically approximate nature of concepts that are to portray it. Apart from the perceptual vagueness, other types of vagueness are also discussed, involving both the nature of things, indeterminacy of linguistic expressions and psycho-sociological conditioning of discourse actions in one language and in translational contexts. The second part of the paper discusses the concept of conceptual and linguistic resemblance (similarity, equivalence) and discourse approximating strategies and proposes a Resemblance Matrix, presenting ways used to narrow the approximation gap between the interacting parties in monolingual and translational discourses

    ELECTRONIC CORPORA IN TRANSLATION BOOTCAT-BOOTSTRAPPING CORPORA AND TERMS FROM THE WEB

    Get PDF
    In the new world of technology, the translation profession, like other disciplines, cannot be deprived of modern tools such as electronic corpora. Recently, large monolingual, comparable and parallel corpora have played a crucial role in solving various problems of linguistics, including translation. During recent years, a large number of studies within the discipline of translation studies have focused on corpora and their applications in translation classes. Such studies mainly look into the kind of information trainee translators can elicit from corpora and the effect of using corpus data on the quality of translations produced. Corpora, however, have a lot more to offer to both translation teachers and translation students. Corpus-based translation classrooms, by their very nature, can offer considerable advantages far beyond what traditional translation classes have to offer. This article, in fact, aims to elaborate on advantages of using corpora in translation classrooms for teachers and students of translation. Furthermore, we present types of corpora and a new method of compiling specialized corpora- BootCaT.BOOTCAT, BOOTSTRAPPING

    Tradurre formule giuridiche attraverso i corpora

    Get PDF
    Fixed lexical or syntactical expressions and formulae hallmark legal language. They serve both linguistic and legal purposes, and should be rendered accordingly in a target language and legal system. Most of the times, however, formulaic expressions are translated by resorting to calques, false cognates, or phrases that are uncommon in the target legal language (and legal system). This paper is aimed at exploring how and if corpus analysis can dispel doubts and help find acceptable translation candidates. As there are currently no publicly available legal corpora addressing corporate documents such as contracts and agreements, this paper wishes to bridge this gap by building and relying on an ad hoc corpus of authentic agreements written in English as a first language according to the laws of England and Wales. In this way, corpus evidence can help find equivalents and, possibly, address recurrent mistranslations from Italian into English. During the corpus analysis process, the paper shows and discusses search queries and how equivalents can be obtained. At the same time, it questions dictionary entries. The paper findings highlight that the consultation of the ad hoc corpus allows to find acceptable translations of Italian legal formulae and address recurrent mistranslations. English formulaic expressions, in fact, can be rendered satisfactorily thanks to the possibility of noticing word usages in context, keywords in contexts and collocations. Further research can encompass a wider variety of formulae and/or legal documents so that scholars and translators can be equipped with useful reference tools.Espressioni e formule lessicali o sintattiche predefinite caratterizzano il linguaggio giuridico e sono utilizzate sia per finalità linguistiche che legali, quindi devono essere necessariamente adattate alla lingua ed al sistema giuridico di arrivo. Tuttavia, molto spesso espressioni e formule sono tradotte ricorrendo a calchi, falsi affini o frasi non frequenti nella lingua giuridica (e nel sistema giuridico) di arrivo. Il presente articolo ha lo scopo di verificare se la consultazione di un corpus di contratti possa aiutare a dissipare dubbi linguistico-giuridici e a trovare traduzioni accettabili. Poiché al momento non esistono corpora giuridici pubblicamente disponibili contenenti documenti aziendali quali contratti, questo articolo si pone l'obiettivo di tentare di colmare questa lacuna creando e consultando un corpus ad hoc costituito da contratti autentici redatti in lingua inglese secondo la legge dell'Inghilterra e Galles. In questo modo, il corpus può aiutare a trovare equivalenti e, possibilmente, correggere ricorrenti traduzioni errate dall'italiano all'inglese. Durante il processo di analisi del corpus, si mostra come è possibile ottenere equivalenti. I risultati dell'articolo evidenziano che la consultazione del corpus consente di trovare traduzioni accettabili di formule giuridiche italiane e di correggere frequenti errori di traduzione. Le formule inglesi, infatti, possono essere rese in modo soddisfacente grazie alla possibilità di notare gli usi delle parole nel contesto, le parole chiave ricorrenti e le collocazioni. Ulteriori ricerche possono riguardare una più ampia varietà di formule e/o documenti legali in modo che studiosi e traduttori possano avvalersi di utili strumenti di riferimento

    From corpus-based collocation frequencies to readability measure

    Get PDF
    This paper provides a broad overview of three separate but related areas of research. Firstly, corpus linguistics is a growing discipline that applies analytical results from large language corpora to a wide variety of problems in linguistics and related disciplines. Secondly, readability research, as the name suggests, seeks to understand what makes texts more or less comprehensible to readers, and aims to apply this understanding to issues such as text rating and matching of texts to readers. Thirdly, collocation is a language feature that occurs when particular words are used frequently together for other than purely grammatical reasons. The intersection of these three aspects provides the basis for on-going research within the Department of Computer and Information Sciences at the University of Strathclyde and is the motivation for this overview. Specifically, we aim through analysis of collocation frequencies in major corpora, to afford valuable insight on the content of texts, which we believe will, in turn, provide a novel basis for estimating text readability

    Exploring the use of parallel corpora in the complilation of specialised bilingual dictionaries of technical terms: a case study of English and isiXhosa

    Get PDF
    Text in EnglishAbstracts in English, isiXhosa and AfrikaansThe Constitution of the Republic of South Africa, Act 108 of 1996, mandates the state to take practical and positive measures to elevate the status and the use of indigenous languages. The implementation of this pronouncement resulted in a growing demand for specialised translations in fields like technology, science, commerce, law and finance. The lack of terminology and resources such as specialised bilingual dictionaries in indigenous languages, particularly isiXhosa remains a growing concern that hinders the translation and the intellectualisation of isiXhosa. A growing number of African scholars affirm the importance of specialised dictionaries in the African languages as tools for language and terminology development so that African languages can be used in the areas of science and technology. In the light of the background above, this study explored how parallel corpora can be interrogated using a bilingual concordancer, ParaConc to extract bilingual terminology that can be used to create specialised bilingual dictionaries. A corpus-based approach was selected due to its speed, efficiency and accuracy in extracting bilingual terms in their immediate contexts. In enhancing the research outcomes, Descriptive Translations Studies (DTS) and Corpus-based translation studies (CTS) were used in a complementary manner. Because the study is interdisciplinary, the function theories of lexicography that emphasise the function and needs of users were also applied. The analysis and extraction of bilingual terminology for dictionary making was successful through the use of the following ParaConc features, namely frequencies, hot word lists, hot words, search facility and concordances (Key Word in Context), among others. The findings revealed that English-isiXhosa Parallel Corpus is a repository of translation equivalents and other information categories that can make specialised dictionaries more user-friendly and multifunctional. The frequency lists were revealed as an effective method of selecting headwords for inclusion in a dictionary. The results also unraveled the complex functions of bilingual concordances where information on collocations and multiword units, sense distinction and usage examples could be easily identifiable proving that this approach is more efficient than the traditional method. The study contributes to the knowledge on corpus-based lexicography, standardisation of finance terminology resource development and making of user-friendly dictionaries that are tailor-made for different needs of users.Umgaqo-siseko weli loMzantsi Afrika ukhululele uRhulumente ukuba athabathe amanyathelo abonakalayo ekuphuhliseni nasekuphuculeni iilwimi zesiNtu. Esi sindululo sibangele ukwanda kokuguqulelwa kwamaxwebhu angezobuchwepheshe, inzululwazi, umthetho, ezemali noqoqosho angesiNgesi eguqulelwa kwiilwimi ebezifudula zingasiwe-so ezinjengesiXhosa. Ukunqongophala kwesigama kunye nezichazi-magama kube yingxaki enkulu ekuguquleleni ngakumbi izichazi-magama ezilwimi-mbini eziqulethe isigama esikhethekileyo. Iingcali ezininzi ziyangqinelana ukuba olu hlobo lwezi zichazi-magama luyimfuneko kuba ludlala iindima enkulu ekuphuhlisweni kweelwimi zesiNtu, ekuyileni isigama, nasekusetyenzisweni kwazo kumabakala obunzululwazi nobuchwepheshe. Olu phando ke luvavanya ukusetyenziswa kwekhophasi equlethe amaxwebhu esiNgesi neenguqulelo zawo zesiXhosa njengovimba wokudimbaza isigama sezemali esinokunceda ekuqulunqweni kwesichazi-magama esilwimi-mbini. Isizathu esibangele ukukhetha le ndlela yophando esebenzisa ikhompyutha kukuba iyakhawuleza, ulwazi oluthathwe kwikhophasi luchanekile, yaye isigama kwikhophasi singqamana ngqo nomxholo wamaxwebhu nto leyo eyenza kube lula ukufumana iintsingiselo nemizekelo ephilayo. Ukutyebisa olu phando indlela yekhophasi iye yaxhaswa zezinye iindlela zophando ezityunjiweyo: ufundo lwenguguqulelo oluchazayo (DTS) kunye neendlela zokuguqulela ezijoliswe kumsebenzi nakuhlobo lwabasebenzisi zinguqulelo ezo. Kanti ke ziqwalaselwe neenkqubo zophando lobhalo-zichazi-magama eziinjongo zokuqulunqa izichazi-magama ezesebenzisekayo neziluncedo kuninzi lwabasebenzisi zichazi-magama ngakumbi kwisizwe esisebenzisa iilwimi ezininzi. Ukuhlalutya nokudimbaza isigama kwikhophasi kolu phando kusetyenziswe isixhobo sekhompyutha esilungiselelwe ikhophasi enelwiimi ezimbini nangaphezulu ebizwa ngokuba yiParaConc. Iziphumo zolu phando zibonise mhlophe ukuba ikhophasi eneenguqulelo nguvimba weendidi ngendidi zamagama nolwazi olunokuphucula izichazi-magama zeli xesha. Kaloku abaguquleli basebenzise amaqhinga ngamaqhinga ukunika iinguqulelo bekhokelwa yimigomo nemithetho yoguqulelo enxuse abasebenzisi bamaxwebhu aguqulelweyo. Ubuchule beParaConc bokukwazi ukuhlela amagama ngokwendlela afumaneka ngayo kunye neenkcukacha zamanani budandalazise indlela eyiyo yokukhetha imichazwa enokungena kwisichazi-magama. Iziphumo zikwabonakalise iintlaninge yolwazi olufumaneka kwiKWIC, lwazi olo olungelula ukulufumana xa usebenzisa undlela-ndala wokwakha isichazi-magama. Esi sifundo esihlanganyele uGuqulelo olusekelwe kwiKhophasi noQulunqo-zichazi-magama zobuchwepheshe luya kuba negalelo elingathethekiyo kwindlela yokwakha izichazi-magama kwilwiimi zeSintu ngokubanzi nancakasana kwisiXhosa, nto leyo eya kothula umthwalo kubaqulunqi-zichazi-magama. Ukwakha nokuqulunqa izichazi-magama ezilwimi-mbini zezemali kuya kwandisa imithombo yesigama esinqongopheleyo kananjalo sivelise izichazi-magama eziluncedo kwisininzi sabantu.Die Grondwet van die Republiek van Suid-Afrika, Wet 108 van 1996, gee aan die staat die mandaat om praktiese en positiewe maatreëls te tref om die status en gebruik van inheemse tale te verhoog. Die implementering van hierdie uitspraak het gelei tot ’n toenemende vraag na gespesialiseerde vertalings in domeine soos tegnologie, wetenskap, handel, regte en finansies. Die gebrek aan terminologie en hulpbronne soos gespesialiseerde woordeboeke in inheemse tale, veral Xhosa, wek toenemende kommer wat die vertaling en die intellektualisering van Xhosa belemmer. ’n Toenemende aantal vakkundiges in Afrika beklemtoon die belangrikheid van gespesialiseerde woordeboeke in die Afrikatale as instrumente vir taal- en terminologie-ontwikkeling sodat Afrikatale gebruik kan word in die areas van wetenskap en tegnologie. In die lig van die voorafgaande agtergrond het hierdie studie ondersoek ingestel na hoe parallelle korpora deursoek kan word deur ’n tweetalige konkordanser (ParaConc) te gebruik om tweetalige terminologie te ontgin wat gebruik kan word in die onwikkeling van tweetalige gespesialiseerde woordeboeke. ’n Korpusgebaseerde benadering is gekies vir die spoed, doeltreffendheid en akkuraatheid waarmee dit tweetalige terme uit hulle onmiddellike kontekste kan onttrek. Beskrywende Vertaalstudies (DTS) en Korpusgebaseerde Vertaalstudies (CTS) is op ’n aanvullende wyse gebruik om die navorsingsuitkomste te verbeter. Aangesien die studie interdissiplinêr is, is die funksieteorieë van leksikografie wat die funksie en behoeftes van gebruikers beklemtoon, ook toegepas. Die analise en ontginning van tweetalige terminologie om woordeboeke te ontwikkel was suksesvol deur, onder andere, gebruik te maak van die volgende ParaConc-eienskappe, naamlik, frekwensies, hotword-lyste, hot words, die soekfunksie en konkordansies (Sleutelwoord-in-Konteks). Die bevindings toon dat ’n Engels-Xhosa Parallelle Korpus ’n bron van vertaalekwivalente en ander inligtingskategorieë is wat gespesialiseerde woordeboeke meer gebruikersvriendelik en multifunksioneel kan maak. Die frekwensielyste is geïdentifiseer as ’n doeltreffende metode om hoofwoorde te selekteer wat opgeneem kan word in ’n woordeboek. Die bevindings het ook die komplekse funksies van tweetalige konkordansers ontknoop waar inligting oor kollokasies en veelvuldigewoord-eenhede, betekenisonderskeiding en gebruiksvoorbeelde maklik identifiseer kon word wat aandui dat hierdie metode viii doeltreffender is as die tradisionele metode. Die studie dra by tot die kennisveld van korpusgebaseerde leksikografie, standaardisering van finansiële terminologie, hulpbronontwikkeling en die ontwikkeling van gebruikersvriendelike woordeboeke wat doelgemaak is vir verskillende behoeftes van gebruikers.Linguistics and Modern LanguagesD. Litt. et Phil. (Linguistics (Translation Studies)
    corecore