1,137 research outputs found

    Corpus language input, corpus processes in learning, learner corpus product. Introduction

    Get PDF
    International audienc

    Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval

    Get PDF
    Although more and more language pairs are covered by machine translation services, there are still many pairs that lack translation resources. Cross-language information retrieval (CLIR) is an application which needs translation functionality of a relatively low level of sophistication since current models for information retrieval (IR) are still based on a bag-of-words. The Web provides a vast resource for the automatic construction of parallel corpora which can be used to train statistical translation models automatically. The resulting translation models can be embedded in several ways in a retrieval model. In this paper, we will investigate the problem of automatically mining parallel texts from the Web and different ways of integrating the translation models within the retrieval process. Our experiments on standard test collections for CLIR show that the Web-based translation models can surpass commercial MT systems in CLIR tasks. These results open the perspective of constructing a fully automatic query translation device for CLIR at a very low cost.Comment: 37 page

    Using corpora in scientific and technical translation training: resources to identify conventionality and promote creativity

    Full text link

    Integrating controlled corpus data in the classroom: A case-study of English NPs for French students in specialised translation

    Get PDF
    International audienceThis paper looks at the alternation of two complex English noun phrases in scientific English, which poses a challenge to French students in the specialised translation classroom. Indeed, no such alternation is observed in French. Starting from a preliminary study of a first series of constructions, we seek confirmation for generalisations about the constructions' preferred context of occurrence in a new sample of highly frequent constructions. We then discuss how the results of those analyses can be integrated in the translation classroom, through a new online tool aimed at raising students' awareness of this contrastive problem and helping them choose one or the other construction according to a set of corpus-based clues

    Exploring the use of parallel corpora in the complilation of specialised bilingual dictionaries of technical terms: a case study of English and isiXhosa

    Get PDF
    Text in EnglishAbstracts in English, isiXhosa and AfrikaansThe Constitution of the Republic of South Africa, Act 108 of 1996, mandates the state to take practical and positive measures to elevate the status and the use of indigenous languages. The implementation of this pronouncement resulted in a growing demand for specialised translations in fields like technology, science, commerce, law and finance. The lack of terminology and resources such as specialised bilingual dictionaries in indigenous languages, particularly isiXhosa remains a growing concern that hinders the translation and the intellectualisation of isiXhosa. A growing number of African scholars affirm the importance of specialised dictionaries in the African languages as tools for language and terminology development so that African languages can be used in the areas of science and technology. In the light of the background above, this study explored how parallel corpora can be interrogated using a bilingual concordancer, ParaConc to extract bilingual terminology that can be used to create specialised bilingual dictionaries. A corpus-based approach was selected due to its speed, efficiency and accuracy in extracting bilingual terms in their immediate contexts. In enhancing the research outcomes, Descriptive Translations Studies (DTS) and Corpus-based translation studies (CTS) were used in a complementary manner. Because the study is interdisciplinary, the function theories of lexicography that emphasise the function and needs of users were also applied. The analysis and extraction of bilingual terminology for dictionary making was successful through the use of the following ParaConc features, namely frequencies, hot word lists, hot words, search facility and concordances (Key Word in Context), among others. The findings revealed that English-isiXhosa Parallel Corpus is a repository of translation equivalents and other information categories that can make specialised dictionaries more user-friendly and multifunctional. The frequency lists were revealed as an effective method of selecting headwords for inclusion in a dictionary. The results also unraveled the complex functions of bilingual concordances where information on collocations and multiword units, sense distinction and usage examples could be easily identifiable proving that this approach is more efficient than the traditional method. The study contributes to the knowledge on corpus-based lexicography, standardisation of finance terminology resource development and making of user-friendly dictionaries that are tailor-made for different needs of users.Umgaqo-siseko weli loMzantsi Afrika ukhululele uRhulumente ukuba athabathe amanyathelo abonakalayo ekuphuhliseni nasekuphuculeni iilwimi zesiNtu. Esi sindululo sibangele ukwanda kokuguqulelwa kwamaxwebhu angezobuchwepheshe, inzululwazi, umthetho, ezemali noqoqosho angesiNgesi eguqulelwa kwiilwimi ebezifudula zingasiwe-so ezinjengesiXhosa. Ukunqongophala kwesigama kunye nezichazi-magama kube yingxaki enkulu ekuguquleleni ngakumbi izichazi-magama ezilwimi-mbini eziqulethe isigama esikhethekileyo. Iingcali ezininzi ziyangqinelana ukuba olu hlobo lwezi zichazi-magama luyimfuneko kuba ludlala iindima enkulu ekuphuhlisweni kweelwimi zesiNtu, ekuyileni isigama, nasekusetyenzisweni kwazo kumabakala obunzululwazi nobuchwepheshe. Olu phando ke luvavanya ukusetyenziswa kwekhophasi equlethe amaxwebhu esiNgesi neenguqulelo zawo zesiXhosa njengovimba wokudimbaza isigama sezemali esinokunceda ekuqulunqweni kwesichazi-magama esilwimi-mbini. Isizathu esibangele ukukhetha le ndlela yophando esebenzisa ikhompyutha kukuba iyakhawuleza, ulwazi oluthathwe kwikhophasi luchanekile, yaye isigama kwikhophasi singqamana ngqo nomxholo wamaxwebhu nto leyo eyenza kube lula ukufumana iintsingiselo nemizekelo ephilayo. Ukutyebisa olu phando indlela yekhophasi iye yaxhaswa zezinye iindlela zophando ezityunjiweyo: ufundo lwenguguqulelo oluchazayo (DTS) kunye neendlela zokuguqulela ezijoliswe kumsebenzi nakuhlobo lwabasebenzisi zinguqulelo ezo. Kanti ke ziqwalaselwe neenkqubo zophando lobhalo-zichazi-magama eziinjongo zokuqulunqa izichazi-magama ezesebenzisekayo neziluncedo kuninzi lwabasebenzisi zichazi-magama ngakumbi kwisizwe esisebenzisa iilwimi ezininzi. Ukuhlalutya nokudimbaza isigama kwikhophasi kolu phando kusetyenziswe isixhobo sekhompyutha esilungiselelwe ikhophasi enelwiimi ezimbini nangaphezulu ebizwa ngokuba yiParaConc. Iziphumo zolu phando zibonise mhlophe ukuba ikhophasi eneenguqulelo nguvimba weendidi ngendidi zamagama nolwazi olunokuphucula izichazi-magama zeli xesha. Kaloku abaguquleli basebenzise amaqhinga ngamaqhinga ukunika iinguqulelo bekhokelwa yimigomo nemithetho yoguqulelo enxuse abasebenzisi bamaxwebhu aguqulelweyo. Ubuchule beParaConc bokukwazi ukuhlela amagama ngokwendlela afumaneka ngayo kunye neenkcukacha zamanani budandalazise indlela eyiyo yokukhetha imichazwa enokungena kwisichazi-magama. Iziphumo zikwabonakalise iintlaninge yolwazi olufumaneka kwiKWIC, lwazi olo olungelula ukulufumana xa usebenzisa undlela-ndala wokwakha isichazi-magama. Esi sifundo esihlanganyele uGuqulelo olusekelwe kwiKhophasi noQulunqo-zichazi-magama zobuchwepheshe luya kuba negalelo elingathethekiyo kwindlela yokwakha izichazi-magama kwilwiimi zeSintu ngokubanzi nancakasana kwisiXhosa, nto leyo eya kothula umthwalo kubaqulunqi-zichazi-magama. Ukwakha nokuqulunqa izichazi-magama ezilwimi-mbini zezemali kuya kwandisa imithombo yesigama esinqongopheleyo kananjalo sivelise izichazi-magama eziluncedo kwisininzi sabantu.Die Grondwet van die Republiek van Suid-Afrika, Wet 108 van 1996, gee aan die staat die mandaat om praktiese en positiewe maatreëls te tref om die status en gebruik van inheemse tale te verhoog. Die implementering van hierdie uitspraak het gelei tot ’n toenemende vraag na gespesialiseerde vertalings in domeine soos tegnologie, wetenskap, handel, regte en finansies. Die gebrek aan terminologie en hulpbronne soos gespesialiseerde woordeboeke in inheemse tale, veral Xhosa, wek toenemende kommer wat die vertaling en die intellektualisering van Xhosa belemmer. ’n Toenemende aantal vakkundiges in Afrika beklemtoon die belangrikheid van gespesialiseerde woordeboeke in die Afrikatale as instrumente vir taal- en terminologie-ontwikkeling sodat Afrikatale gebruik kan word in die areas van wetenskap en tegnologie. In die lig van die voorafgaande agtergrond het hierdie studie ondersoek ingestel na hoe parallelle korpora deursoek kan word deur ’n tweetalige konkordanser (ParaConc) te gebruik om tweetalige terminologie te ontgin wat gebruik kan word in die onwikkeling van tweetalige gespesialiseerde woordeboeke. ’n Korpusgebaseerde benadering is gekies vir die spoed, doeltreffendheid en akkuraatheid waarmee dit tweetalige terme uit hulle onmiddellike kontekste kan onttrek. Beskrywende Vertaalstudies (DTS) en Korpusgebaseerde Vertaalstudies (CTS) is op ’n aanvullende wyse gebruik om die navorsingsuitkomste te verbeter. Aangesien die studie interdissiplinêr is, is die funksieteorieë van leksikografie wat die funksie en behoeftes van gebruikers beklemtoon, ook toegepas. Die analise en ontginning van tweetalige terminologie om woordeboeke te ontwikkel was suksesvol deur, onder andere, gebruik te maak van die volgende ParaConc-eienskappe, naamlik, frekwensies, hotword-lyste, hot words, die soekfunksie en konkordansies (Sleutelwoord-in-Konteks). Die bevindings toon dat ’n Engels-Xhosa Parallelle Korpus ’n bron van vertaalekwivalente en ander inligtingskategorieë is wat gespesialiseerde woordeboeke meer gebruikersvriendelik en multifunksioneel kan maak. Die frekwensielyste is geïdentifiseer as ’n doeltreffende metode om hoofwoorde te selekteer wat opgeneem kan word in ’n woordeboek. Die bevindings het ook die komplekse funksies van tweetalige konkordansers ontknoop waar inligting oor kollokasies en veelvuldigewoord-eenhede, betekenisonderskeiding en gebruiksvoorbeelde maklik identifiseer kon word wat aandui dat hierdie metode viii doeltreffender is as die tradisionele metode. Die studie dra by tot die kennisveld van korpusgebaseerde leksikografie, standaardisering van finansiële terminologie, hulpbronontwikkeling en die ontwikkeling van gebruikersvriendelike woordeboeke wat doelgemaak is vir verskillende behoeftes van gebruikers.Linguistics and Modern LanguagesD. Litt. et Phil. (Linguistics (Translation Studies)

    Les eines computacionals i el disseny de corpus orals: un diàleg vigent

    Get PDF
    The design of an oral corpus and the processes of registering, codifying and treating the materials in order to build a useful resource for linguistic analysis prompt numerous decisions regarding theory and methodology. This article is focused on those stages of corpus construction which are more clearly conditioned by the computational processing necessary to make it functional. In order to adequately match the initial expectations and the real possibilities of using the tool, each feature we intend to codify must be measured against the workload and the means required to do so. Therefore, it is essential to take into account the available possibilities of processing and exploitation as they have a crucial impact on decisions regarding the corpus’ construction. Based on experience acquired in the construction of the ESLORA corpus, the present article looks into some of the problems arising in the process of designing an oral corpus, such as the delicacy with which oral phenomena are represented, the segmentation of the discourse, the coexistence of different simultaneous tagging systems and the particularities of annotation in a bilingual or multilingual contextEl disseny d’un corpus oral i els processos de registrar, codificar i tractar els materials per construir un recurs útil per a l’anàlisi lingüística, comporta nombroses decisions pel que fa a la teoria i la metodologia. Aquest article s’ocupa d’aquelles etapes de la construcció d’un corpus que més clarament estan condicionades pel processament informàtic necessari que ha de fer el corpus funcional. Per tal de conjugar les expectatives inicials i les possibilitats reals quan usem l’eina, cada característica que pretenem codificar ha de ser mesurada quant a la càrrega de treball que comporta i els mitjans que són requerits per fer-ho possible. Per això, és essencial tenir en compte els recursos disponibles a l’hora de processar i explotar el corpus, ja que tenen un impacte fonamental en les decisions pel que fa a la construcció del corpus. Basat en l’experiència adquirida en la construcció del corpus ESLORA, l’article analitza alguns dels problemes que sorgeixen en el procés de dissenyar un corpus oral, com ara el grau de detall en què és representat el fenomen oral, la segmentació del discurs, la convivència de diferents sistemes d’etiquetatge simultanis i les particularitats de l’anotació en un context bilingüe o multilingüeThis study was financed by the Agencia Estatal de Investigación (AEI) ‘Spanish State Research Agency’ and by the Fondo Europeo de Desarrollo Regional (FEDER) (European Regional Development Fund) through the ESLORA+ project (FFI2017-86379-P). The authors are members of the research group Gramática del español ‘Spanish Grammar’ from the University of Santiago de Compostela, which has been awarded a grant for the Strengthening and Organisation of Research Groups with Potential for Growth by the Regional Government’s Education Department (ED431B 2017/39). The study has also benefited from the participation of the ESLORA project in the Red temática en estudios de Análisis del Discurso (FFI2017-90738-REDT)S

    Creación de datos multilingües para diversos enfoques basados en corpus en el ámbito de la traducción y la interpretación

    Get PDF
    Accordingly, this research work aims at exploiting and developing new technologies and methods to better ascertain not only translators’ and interpreters’ needs, but also professionals’ and ordinary people’s on their daily tasks, such as corpora and terminology compilation and management. The main topics covered by this work relate to Computational Linguistics (CL), Natural Language Processing (NLP), Machine Translation (MT), Comparable Corpora, Distributional Similarity Measures (DSM), Terminology Extraction Tools (TET) and Terminology Management Tools (TMT). In particular, this work examines three main questions: 1) Is it possible to create a simpler and user-friendly comparable corpora compilation tool? 2) How to identify the most suitable TMT and TET for a given translation or interpreting task? 3) How to automatically assess and measure the internal degree of relatedness in comparable corpora? This work is composed of thirteen peer-reviewed scientific publications, which are included in Appendix A, while the methodology used and the results obtained in these studies are summarised in the main body of this document. Fecha de lectura de Tesis Doctoral: 22 de noviembre 2019Corpora are playing an increasingly important role in our multilingual society. High-quality parallel corpora are a preferred resource in the language engineering and the linguistics communities. Nevertheless, the lack of sufficient and up-to-date parallel corpora, especially for narrow domains and poorly-resourced languages is currently one of the major obstacles to further advancement across various areas like translation, language learning and, automatic and assisted translation. An alternative is the use of comparable corpora, which are easier and faster to compile. Corpora, in general, are extremely important for tasks like translation, extraction, inter-linguistic comparisons and discoveries or even to lexicographical resources. Its objectivity, reusability, multiplicity and applicability of uses, easy handling and quick access to large volume of data are just an example of their advantages over other types of limited resources like thesauri or dictionaries. By a way of example, new terms are coined on a daily basis and dictionaries cannot keep up with the rate of emergence of new terms
    corecore