39 research outputs found

    Syntactic difficulties in translation

    Get PDF
    Even though machine translation (MT) systems such as Google Translate and DeepL have improved significantly over the last years, a continuous rise in globalisation and linguistic diversity requires increasing amounts of professional, error-free translation. One can imagine, for instance, that mistakes in medical leaflets can lead to disastrous consequences. Less catastrophic, but equally significant, is the lack of a consistent and creative style of MT systems in literary genres. In such cases, a human translation is preferred. Translating a text is a complex procedure that involves a variety of mental processes such as understanding the original message and its context, finding a fitting translation, and verifying that the translation is grammatical, contextually sound, and generally adequate and acceptable. From an educational perspective, it would be helpful if the translation difficulty of a given text can be predicted, for instance to ensure that texts of objectively appropriate difficulty levels are used in exams and assignments for translators. Also in the translation industry it may prove useful, for example to direct more difficult texts to more experienced translators. During this PhD project, my coauthors and I investigated which linguistic properties contribute to such difficulties. Specifically, we put our attention to syntactic differences between a source text and its translation, that is to say their (dis)similarities in terms of linguistic structure. To this end we developed new measures that can quantify such differences and made the implementation publicly available for other researchers to use. These metrics include word (group) movement (how does the order in the original text differ from that in a given translation), changes in the linguistic properties of words, and a comparison of the underlying abstract structure of a sentence and a translation. Translation difficulty cannot be directly measured but process information can help. Particularly, keystroke logging and eye-tracking data can be recorded during translation and used as a proxy for the required cognitive effort. An example: the longer a translator looks at a word, the more time and effort they likely need to process it. We investigated the effect that specific measures of syntactic similarity have on these behavioural processing features to get an indication of what their effect is on the translation difficulty. In short: how does the syntactic (dis)similarity between a source text and a possible translation impact the translation difficulty? In our experiments, we show that different syntactic properties indeed have an effect, and that differences in syntax between a source text and its translation affect the cognitive effort required to translate that text. These effects are not identical between syntactic properties, though, suggesting that individual syntactic properties affect the translation process in different ways and that not all syntactic dissimilarities contribute to translation difficulty equally.De kwaliteit van machinevertaalsystemen (MT) zoals Google Translate en DeepL is de afgelopen jaren sterk verbeterd. Door alsmaar meer globalisering en taalkundige diversiteit is er echter meer dan ooit nood aan professionele vertalingen waar geen fouten in staan. In zekere communicatievormen zouden vertaalfouten namelijk tot desastreuse gevolgen kunnen leiden, bijvoorbeeld in medische bijsluiters. Ook in minder levensbedreigende situaties verkiezen we nog steeds menselijke vertalingen, bijvoorbeeld daar waar een creatieve en consistente stijl noodzakelijk is, zoals in boeken en poëzie. Een tekst vertalen is een complex karwei waarin verschillende mentale processen een rol spelen. Zo moet bijvoorbeeld de brontekst gelezen en begrepen worden, moet er naar een vertaling gezocht worden, en daarbovenop moet tijdens het vertaalproces de vertaling continu gecontroleerd worden om te zorgen dat het ten eerste een juiste vertaling is en ten tweede dat de tekst ook grammaticaal correct is in de doeltaal. Vanuit een pedagogisch standpunt zou het nuttig zijn om de vertaalmoeilijkheid van een tekst te voorspellen. Zo wordt ervoor gezorgd dat de taken en examens van vertaalstudenten tot een objectief bepaald moeilijkheidsniveau behoren. Ook in de vertaalindustrie zou zo’n systeem van toepassing zijn; moeilijkere teksten kunnen aan de meest ervaren vertalers worden bezorgd. Samen met mijn medeauteurs heb ik tijdens dit doctoraatsproject onderzocht welke eigenschappen van een tekst bijdragen tot vertaalmoeilijkheden. We legden daarbij de nadruk op taalkundige, structurele verschillen tussen de brontekst en diens vertaling, en ontwikkelden verscheidene metrieken om dit soort syntactische verschillen te kunnen meten. Zo kan bijvoorbeeld een verschillende woord(groep)volgorde worden gekwantificeerd, kunnen verschillen in taalkundige labels worden geteld, en kunnen de abstracte, onderliggende structuren van een bronzin en een vertaling vergeleken worden. We maakten de implementatie van deze metrieken openbaar beschikbaar. De vertaalmoeilijkheid van een tekst kan niet zomaar gemeten worden, maar door naar gedragsdata van een vertaler te kijken, krijgen we wel een goed idee van de moeilijkheden waarmee ze geconfronteerd werden. De bewegingen en focuspunten van de ogen van de vertaler en hun toetsaanslagen kunnen worden geregistreerd en nadien gebruikt in een experimentele analyse. Ze geven ons nuttig informatie en kunnen zelfs dienen als een benadering van de nodige inspanning die geleverd moest worden tijdens het vertaalproces. Daarmee leidt het ons ook naar de elementen (woorden, woordgroepen) waar de vertaler moeilijkheden mee had. Als een vertaler lang naar een woord kijkt, dan kunnen we aannemen dat de verwerking ervan veel inspanning vergt. We kunnen deze gedragsdata dus gebruiken als een maat voor moeilijkheid. In ons onderzoek waren we voornamelijk benieuwd naar het effect van syntactische verschillen tussen een bronzin en een doelzin op dit soort gedragsdata. Onze resultaten tonen aan dat de voorgestelde metrieken inderdaad een effect hebben en dat taalkundige verschillen tussen een bron- en doeltekst leiden tot een hogere cognitieve belasting tijdens het vertalen van een tekst. Deze effecten verschillen per metriek, wat duidt op het belang van (onderzoek naar) individuele syntactische metrieken; niet elke metriek draagt even veel bij aan vertaalmoeilijkheden

    Towards a better integration of fuzzy matches in neural machine translation through data augmentation

    Get PDF
    We identify a number of aspects that can boost the performance of Neural Fuzzy Repair (NFR), an easy-to-implement method to integrate translation memory matches and neural machine translation (NMT). We explore various ways of maximising the added value of retrieved matches within the NFR paradigm for eight language combinations, using Transformer NMT systems. In particular, we test the impact of different fuzzy matching techniques, sub-word-level segmentation methods and alignment-based features on overall translation quality. Furthermore, we propose a fuzzy match combination technique that aims to maximise the coverage of source words. This is supplemented with an analysis of how translation quality is affected by input sentence length and fuzzy match score. The results show that applying a combination of the tested modifications leads to a significant increase in estimated translation quality over all baselines for all language combinations

    Improved treebank querying: a facelift for GrETEL

    Get PDF
    We describe the improvements to the interface of GrETEL, an online tool for querying treebanks. We demonstrate how we employed the results of two usability tests and individual user feedback in order to create a more user-friendly interface which meets the users’ needs

    Querying large treebanks : benchmarking GrETEL indexing

    Get PDF
    The amount of data that is available for research grows rapidly, yet technology to efficiently interpret and excavate these data lags behind. For instance, when using large treebanks for linguistic research, the speed of a query leaves much to be desired. GrETEL Indexing, or GrInding, tackles this issue. The idea behind GrInding is to make the search space as small as possible before actually starting the treebank search, by pre-processing the treebank at hand. We recursively divide the treebank into smaller parts, called subtree-banks, which are then converted into database files. All subtree-banks are organized according to their linguistic dependency pattern, and labeled as such. Additionally, general patterns are linked to more specific ones. By doing so, we create millions of databases, and given a linguistic structure we know in which databases that structure can occur, leading up to a significant efficiency boost. We present the results of a benchmark experiment, testing the effect of the GrInding procedure on the SoNaR-500 treebank

    Predicting syntactic equivalence between source and target sentences

    Get PDF
    The translation difficulty of a text is influenced by many different factors. Some of these are specific to the source text and related to readability while others more directly involve translation and the relation between the source and the target text. One such factor is syntactic equivalence, which can be calculated on the basis of a source sentence and its translation. When the expected syntactic form of the target sentence is dissimilar to its source, translating said source sentence proves more difficult for a translator. The degree of syntactic equivalence between a word-aligned source and target sentence can be derived from the crossing alignment links, averaged by the number of alignments, either at word or at sequence level. However, when predicting the translatability of a source sentence, its translation is not available. Therefore, we train machine learning systems on a parallel English-Dutch corpus to predict the expected syntactic equivalence of an English source sentence without having access to its Dutch translation. We use traditional machine learning systems (Random Forest Regression and Support Vector Regression) combined with syntactic sentence-level features as well as recurrent neural networks that utilise word embeddings and accurate morpho-syntactic features

    LT3 at SemEval-2020 Task 7 : comparing feature-based and transformer-based approaches to detect funny headlines

    Get PDF
    This paper presents two different systems for the SemEval shared task 7 on Assessing Humor in Edited News Headlines, sub-task 1, where the aim was to estimate the intensity of humor generated in edited headlines. Our first system is a feature-based machine learning system that combines different types of information (e.g. word embeddings, string similarity, part-of-speech tags, perplexity scores, named entity recognition) in a Nu Support Vector Regressor (NuSVR). The second system is a deep learning-based approach that uses the pre-trained language model RoBERTa to learn latent features in the news headlines that are useful to predict the funniness of each headline. The latter system was also our final submission to the competition and is ranked seventh among the 49 participating teams, with a root-mean-square error (RMSE) of 0.5253

    Treebank querying with GrETEL 3 : bigger, faster, stronger

    Get PDF
    We describe the new version of GrETEL (http://gretel.ccl.kuleuven.be/gretel3), an online tool which allows users to query treebanks by means of a natural language example (example-based search) or via a formal query (XPath search). The new release comprises an update to the interface and considerable improvements in the back-end search mechanism. The update of the front-end is based on user suggestions. In addition to an overall design update, major changes include a more intuitive query builder in the example-based search mode and a visualizer for syntax trees that is compatible with all modern browsers. Moreover, the results are presented to the user as soon as they are found, so users can browse the matching sentences before the treebank search is completed. We will demonstrate that those changes considerably improve the query procedure. The update of the back-end mainly includes optimizing the search algorithm for querying the (very) large SoNaR treebank. Querying this 500-million word treebank was already made possible in the previous version of GrETEL, but due to the complex search mechanism this often resulted in long query times or even a timeout before the search completed. The improved version of the search algorithm results in faster query times and more accurate search results, which greatly enhances the usability of the SoNaR treebank for linguistic research

    LeConTra : a learner corpus of English-to-Dutch news translation

    No full text
    We present LeConTra, a learner corpus consisting of English-to-Dutch news translations enriched with translation process data. Three students of a Master's programme in Translation were asked to translate 50 different English journalistic texts of approximately 250 tokens each. Because we also collected translation process data in the form of keystroke logging, our dataset can be used as part of different research strands such as translation process research, learner corpus research, and corpus-based translation studies. Reference translations, without process data, are also included. The data has been manually segmented and tokenized, and manually aligned at both segment and word level, leading to a high-quality corpus with token-level process data. The data is freely accessible via the Translation Process Research DataBase, which emphasises our commitment of distributing our dataset. The tool that was built for manual sentence segmentation and tokenization, Mantis, is also available as an open-source aid for data processing

    Crystallization of thin polymer layers confined between two adsorbing walls

    No full text
    Confined at the nanoscale level, polymers crystallize much slower than in bulk, and in some cases, the formation of ordered structure is inhibited within extremely long experimental time scales. Here, we report on the thickness dependence of the conversion rate of the amorphous fraction of ultrathin films of poly(ethylene terephthalate) during isothermal cold crystallization. We present a new analytical method assessing the impact of irreversible chain adsorption and permitting to disentangle finite size and interfacial effects. From the ÎĽm range down to a few tens of nm, we observed an increase in crystallization time scaling with the inverse of the film thickness, which is a fingerprint of finite size effects. Films thinner than 20 nm did not crystallize, even after prolonged annealing in the temperature range where the crystallization rate reaches its maximum value. Noticing that this threshold corresponds to the total thickness of the layer irreversibly adsorbed within our investigation time, we explain these findings considering that chain adsorption increases the entropic barrier required for the formation of crystalline structures.info:eu-repo/semantics/publishe
    corecore