Search CORE

605 research outputs found

Eesti keele ühendverbide automaattuvastus lingvistiliste ja statistiliste meetoditega

Author: Aedmaa Eleri
Publication venue
Publication date: 30/08/2019
Field of study

Tänapäeval on inimkeeli (kaasa arvatud eesti keelt) töötlevad tehnoloogiaseadmed igapäevaelu osa, kuid arvutite „keeleoskus“ pole kaugeltki täiuslik. Keele automaattöötluse kõige rohkem kasutust leidev rakendus on ilmselt masintõlge. Ikka ja jälle jagatakse sotsiaalmeedias, kuidas tuntud süsteemid (näiteks Google Translate) midagi valesti tõlgivad. Enamasti tekitavad absurdse olukorra mitmest sõnast koosnevad fraasid või laused. Näiteks ei suuda tõlkesüsteemid tabada lauses „Ta läks lepinguga alt“ ühendi alt minema tähendust petta saama, sest õige tähenduse edastamiseks ei saa selle ühendi komponente sõna-sõnalt tõlkida ja seetõttu satubki arvuti hätta. Selleks et nii masintõlkesüsteemide kui ka teiste kasulike rakenduste nagu libauudiste tuvastuse või küsimus-vastus süsteemide kvaliteet paraneks, on oluline, et arvuti oskaks tuvastada mitmesõnalisi üksuseid ja nende eri tähendusi, mida inimesed konteksti põhjal üpriski lihtalt teha suudavad. Püsiühendite (tähenduse) automaattuvastus on oluline kõikides keeltes ja on seetõttu pälvinud arvutilingvistikas rohkelt tähelepanu. Seega on eriti inglise keele põhjal välja pakutud terve hulk meetodeid, mida pole siiamaani eesti keele püsiühendite tuvastamiseks rakendatud. Doktoritöös kasutataksegi masinõppe meetodeid, mis on teiste keelte püsiühendite tuvastamisel edukad olnud, üht liiki eesti keele püsiühendi – ühendverbi – automaatseks tuvastamiseks. Töös demonstreeritakse suurte tekstiandmete põhjal, et seni eesti keele traditsioonilises käsitluses esitatud eesti keele ühendverbide jaotus ainukordseteks (ühendi komponentide koosesinemisel tekib uus tähendus) ja korrapärasteks (ühendi tähendus on tema komponentide summa) ei ole piisavalt põhjalik. Nimelt kinnitab töö arvutilingvistilistes uurimustes laialt levinud arusaama, et püsiühendid (k.a ühendverbid) jaotuvad skaalale, mille ühes otsas on ühendid, mille tähendus on selgelt komponentide tähenduste summa. ja teises need ühendid, mis saavad uue tähenduse. Uurimus näitab, et lisaks kontekstile aitavad arvutil tuvastada ühendverbi õiget tähendust mitmed teised tunnuseid, näiteks subjekti ja objekti elusus ja käänded. Doktoritöö raames valminud andmestikud ja vektoresitused on vajalikud uued ressursid, mis on avalikud edaspidisteks uurimusteks.Nowadays, applications that process human languages (including Estonian) are part of everyday life. However, computers are not yet able to understand every nuance of language. Machine translation is probably the most well-known application of natural language processing. Occasionally, the worst failures of machine translation systems (e.g. Google Translate) are shared on social media. Most of such cases happen when sequences longer than words are translated. For example, translation systems are not able to catch the correct meaning of the particle verb alt (‘from under’) minema (‘to go’) (‘to get deceived’) in the sentence Ta läks lepinguga alt because the literal translation of the components of the expression is not correct. In order to improve the quality of machine translation systems and other useful applications, e.g. spam detection or question answering systems, such (idiomatic) multi-word expressions and their meanings must be well detected. The detection of multi-word expressions and their meaning is important in all languages and therefore much research has been done in the field, especially in English. However, the suggested methods have not been applied to the detection of Estonian multi-word expressions before. The dissertation fills that gap and applies well-known machine learning methods to detect one type of Estonian multi-word expressions – the particle verbs. Based on large textual data, the thesis demonstrates that the traditional binary division of Estonian particle verbs to non-compositional (ainukordne, meaning is not predictable from the meaning of its components) and compositional (korrapärane, meaning is predictable from the meaning of its components) is not comprehensive enough. The research confirms the widely adopted view in computational linguistics that the multi-word expressions form a continuum between the compositional and non-compositional units. Moreover, it is shown that in addition to context, there are some linguistic features, e.g. the animacy and cases of subject and object that help computers to predict whether the meaning of a particle verb in a sentence is compositional or non-compositional. In addition, the research introduces novel resources for Estonian language – trained embeddings and created compositionality datasets are available for the future research.https://www.ester.ee/record=b5252157~S

DSpace at Tartu University Library

Unified Representation for Non-compositional and Compositional Expressions

Author: Bhat Suma
Zeng Ziheng
Publication venue
Publication date: 29/10/2023
Field of study

Accurate processing of non-compositional language relies on generating good representations for such expressions. In this work, we study the representation of language non-compositionality by proposing a language model, PIER, that builds on BART and can create semantically meaningful and contextually appropriate representations for English potentially idiomatic expressions (PIEs). PIEs are characterized by their non-compositionality and contextual ambiguity in their literal and idiomatic interpretations. Via intrinsic evaluation on embedding quality and extrinsic evaluation on PIE processing and NLU tasks, we show that representations generated by PIER result in 33% higher homogeneity score for embedding clustering than BART, whereas 3.12% and 3.29% gains in accuracy and sequence accuracy for PIE sense classification and span detection compared to the state-of-the-art IE representation model, GIEA. These gains are achieved without sacrificing PIER's performance on NLU tasks (+/- 1% accuracy) compared to BART.Comment: This work is accepted to EMNLP 2023 Finding

arXiv.org e-Print Archive

Recommended from our members

Learning to Live with Machine Translation

Author: Long Hoyt
Publication venue: 'Project Muse'
Publication date: 08/06/2023
Field of study

Rapid advancements in technologies of text and image generation have increasingly put the perceived autonomy of human creativity under threat. Even before ChatGPT and other large-language models sent such anxieties into overdrive, literary critics were arguing for a hermeneutics of automatic writing and revisiting long-held assumptions about artistic originality. Few, however, gave much thought to these model's quirky cousins—a family branch that once ruled over the utopian dreams invested in AI: machine translation (MT). This essay reflects on why translation has been lost in all the recent talk about these models and offers a necessary corrective. It considers what a critical response to MT might look like when reframed around an understanding of current technologies and a vision of MT as potential collaborator rather than human replacement. First, it offers an overview of current neural-based MT and the theories of translation that underwrite it. It then uses literary texts as a limit case for surveying the technology's most visible gaps, providing a deep, qualitative analysis of Japanese literary texts machine translated into English. Finally, it takes a speculative turn and considers what "good enough" machine translation of a large corpus of world literature might be good for in a future of ubiquitous and ever more accessible MT. The results hint at more immediate ways that MT invites inquiry into the present conditions of world literature, but also to a future where the entanglement of human translation and agency with the material agency of the technology bring forth potentials in both

Knowledge UChicago

YouTube-ASL: A Large-Scale, Open-Domain American Sign Language-English Parallel Corpus

Author: Georg Manfred
Tanzer Garrett
Uthus David
Publication venue
Publication date: 26/06/2023
Field of study

Machine learning for sign languages is bottlenecked by data. In this paper, we present YouTube-ASL, a large-scale, open-domain corpus of American Sign Language (ASL) videos and accompanying English captions drawn from YouTube. With ~1000 hours of videos and >2500 unique signers, YouTube-ASL is ~3x as large and has ~10x as many unique signers as the largest prior ASL dataset. We train baseline models for ASL to English translation on YouTube-ASL and evaluate them on How2Sign, where we achieve a new finetuned state of the art of 12.39 BLEU and, for the first time, report zero-shot results

arXiv.org e-Print Archive

Subtitling Humour from the Perspective of Relevance Theory: The Office in Traditional Chinese

Author: Pai Feng-shuo
Publication venue: UCL (University College London)
Publication date: 01/01/2017
Field of study

Subtitling the scenes containing humorous utterances in cinematic-televisual productions encounters a myriad of challenges, because the subtitler has to face the technical constraints that characterise the professional subtitling environment and the cultural barriers when reproducing humorous utterances for viewers inhabiting another culture. Past studies tend to explore more limited humour-related areas, which means that a more comprehensive picture of this specialised field is missing. The current research investigates the subtitling of humour, drawing on the framework of relevance theory and the British sitcom The Office, translated from English dialogue into Traditional Chinese subtitles. This research enquires into whether or not relevance theory can explain the subtitling strategies activated to deal with various humorous utterances in the sitcom, and, if so, to what extent. The English-Chinese Corpus of The Office (ECCO), which contains sample texts, media files and annotations, has been constructed to perform an empirical study. To enrich the corpus with valuable annotations, a typology of humour has been developed based on the concept of frame, and a taxonomy of subtitling strategies has also been proposed. The quantitative analysis demonstrates that the principle of relevance is the main benchmark for the choice of a subtitling micro-strategy within any given macro-strategy. With the chi-square test, it further proves the existence of a statistically significant association between humour types/frames and subtitling strategies at the global level. The qualitative analysis shows that the principle of relevance can operate in a subtle way, in which the subtitler invests more cognitive efforts to enhance the acceptability of subtitles. It also develops three levels of mutual dependency between the two variables, from strong, weak to null, to classify different examples. Overall, this study improves our understanding of humour translation and can facilitate a change in the curricula of translator training

UCL Discovery

Recommended from our members

The computer comprehension of systematic metaphor

Author: Hutchings Richard Charles
Publication venue: University of Cambridge
Publication date: 01/01/1990
Field of study

Digitisation of this thesis was sponsored by Arcadia Fund, a charitable fund of Lisbet Rausing and Peter Baldwin

Apollo (Cambridge)

OpenGrey Repository

To have done with theory? Baudrillard, or the literal confrontation with reality

Author: Zazzi Luca
Publication venue: The University of Edinburgh
Publication date: 13/11/2023
Field of study

Baudrillard, Eluding the temptation to reinterpret Jean Baudrillard once more, this work started from the ambition to consider his thought in its irreducibility, that is, in a radically literal way. Literalness is a recurring though overlooked term in Baudrillard’s oeuvre, and it is drawn from the direct concatenation of words in poetry or puns and other language games. It does not indicate a realist positivism but a principle that considers the metamorphoses and mutual alteration of things in their singularity without reducing them to a general equivalent (i.e. the meaning of words in a poem, which destroys its appearances). Reapplying the idea to Baudrillard and finding other singular routes through his “passwords” is a way to short-circuit its reductio ad realitatem and reaffirm its challenge to the hegemony of global integration. Even in the literature dedicated to it, this exercise has been rarer than the ‘hermeneutical’ one, where Baudrillard’s oeuvre was taken as a discourse to be interpreted and explained (finding an equivalent for its singularity). In plain polemic with any ideal of conformity between theory and reality (from which our present conformisms arguably derive, too), Baudrillard conceived thought not as something to be verified but as a series of hypotheses to be repeatedly radicalised – he often described it as a “spiral”, a form which challenges the codification of things, including its own. Coherent with this, the thesis does not consider Baudrillard’s work either a reflection or a prediction of reality but, instead, an out-and-out act, a precious singular object which, interrogated, ‘thinks’ us and our current events ‘back’. In the second part, Baudrillard’s hypotheses are taken further and measured in their capacity to challenge the reality of current events and phenomena. The thesis confronts the ‘hypocritical’ position of critical thinking, which accepts the present principle of reality. It questions the interminability of our condition, where death seems thinkable only as a senseless interruption of the apparatus. It also confronts the solidarity between orthodox and alternative realities of the COVID pandemic and the Ukrainian invasion, searching for what is irreducible to the perfect osmosis of “virtual and factual”. Drawing equally from the convulsions of globalisation and the psychopathologies of academics, from DeLillo’s fiction and Baudrillard’s lesser-studied influences, this study evaluates the irreversibility of our system against the increasingly silent challenges of radical thought. It looks for what an increasingly pessimistic late Baudrillard called ‘rogue singularities’: forms which, often outside the conventional realms one would expect to find them, constitute potential sources of the fragility of global power. ‘To have done with theory’ does not mean abandoning radical thought and, together with it, the singularity of humanity. It means, as the thesis concludes, the courage to leave conventional ideas of theory and listen to less audible voices which, at the heart of this “enormous conspiracy”, whisper — as a mysterious lady in Mariupol did to Putin — “It’s all not true! It’s all for show!”

Edinburgh Research Archive

Working Styles of Student Translators in Revision and Post-editing: an Empirical-Experimental Study with Eye-tracking, Keylogging and Cue-based Retrospection

Author: HUANG JIN
Publication venue
Publication date: 01/01/2016
Field of study

In today’s translation profession, being skilful at revision (including self-revision and other-revision) and post-editing tasks is becoming essential for translators. The exploration of the working styles of student translators in the revision and post-editing processes is vital in helping us to understand the nature of these tasks, and may help in improving pedagogy. Drawing on theories from translation-related studies, cognitive psychology, and text comprehension and production, the aims of this research were to: (1) identify the basic types of reading and typing activity (physical activities) of student translators in the processes of revision and post-editing, and to measure statistically and compare the duration of these activities within and across tasks; (2) identify the underlying purposes (mental activities) behind each type of reading and typing activity; (3) categorise the basic types of working style of student translators and compare the frequency of use of each working style both within and across tasks; (4) identify the personal working styles of student translators in carrying out different tasks, and (5) identify the most efficient working style in each task. Eighteen student translators from Durham University, with Chinese as L1 and English as L2, were invited to participate in the experiment. They were asked to translate, self-revise, other-revise and post-edit three comparable texts in Translog-II with the eye-tracking plugin activated. A cue-based retrospective interview was carried out after each session to collect the student translators’ subjective and conscious data for qualitative analysis. The raw logging data were transformed into User Activity Data and were analysed both quantitatively and qualitatively. This study identified seven types of reading and typing activity in the processes of self-revision, other-revision and post-editing. Three revision phases were defined and four types of working style were recognised. The student translators’ personal working styles were compared in all three tasks. In addition, a tentative model of their cognitive processes in self-revision, other-revision and post-editing was developed, and the efficiency of the four working styles in each task was tested

Durham e-Theses