605 research outputs found
Eesti keele ĂŒhendverbide automaattuvastus lingvistiliste ja statistiliste meetoditega
TĂ€napĂ€eval on inimkeeli (kaasa arvatud eesti keelt) töötlevad tehnoloogiaseadmed igapĂ€evaelu osa, kuid arvutite âkeeleoskusâ pole kaugeltki tĂ€iuslik. Keele automaattöötluse kĂ”ige rohkem kasutust leidev rakendus on ilmselt masintĂ”lge. Ikka ja jĂ€lle jagatakse sotsiaalmeedias, kuidas tuntud sĂŒsteemid (nĂ€iteks Google Translate) midagi valesti tĂ”lgivad. Enamasti tekitavad absurdse olukorra mitmest sĂ”nast koosnevad fraasid vĂ”i laused. NĂ€iteks ei suuda tĂ”lkesĂŒsteemid tabada lauses âTa lĂ€ks lepinguga altâ ĂŒhendi alt minema tĂ€hendust petta saama, sest Ă”ige tĂ€henduse edastamiseks ei saa selle ĂŒhendi komponente sĂ”na-sĂ”nalt tĂ”lkida ja seetĂ”ttu satubki arvuti hĂ€tta. Selleks et nii masintĂ”lkesĂŒsteemide kui ka teiste kasulike rakenduste nagu libauudiste tuvastuse vĂ”i kĂŒsimus-vastus sĂŒsteemide kvaliteet paraneks, on oluline, et arvuti oskaks tuvastada mitmesĂ”nalisi ĂŒksuseid ja nende eri tĂ€hendusi, mida inimesed konteksti pĂ”hjal ĂŒpriski lihtalt teha suudavad. PĂŒsiĂŒhendite (tĂ€henduse) automaattuvastus on oluline kĂ”ikides keeltes ja on seetĂ”ttu pĂ€lvinud arvutilingvistikas rohkelt tĂ€helepanu. Seega on eriti inglise keele pĂ”hjal vĂ€lja pakutud terve hulk meetodeid, mida pole siiamaani eesti keele pĂŒsiĂŒhendite tuvastamiseks rakendatud. Doktoritöös kasutataksegi masinĂ”ppe meetodeid, mis on teiste keelte pĂŒsiĂŒhendite tuvastamisel edukad olnud, ĂŒht liiki eesti keele pĂŒsiĂŒhendi â ĂŒhendverbi â automaatseks tuvastamiseks. Töös demonstreeritakse suurte tekstiandmete pĂ”hjal, et seni eesti keele traditsioonilises kĂ€sitluses esitatud eesti keele ĂŒhendverbide jaotus ainukordseteks (ĂŒhendi komponentide koosesinemisel tekib uus tĂ€hendus) ja korrapĂ€rasteks (ĂŒhendi tĂ€hendus on tema komponentide summa) ei ole piisavalt pĂ”hjalik. Nimelt kinnitab töö arvutilingvistilistes uurimustes laialt levinud arusaama, et pĂŒsiĂŒhendid (k.a ĂŒhendverbid) jaotuvad skaalale, mille ĂŒhes otsas on ĂŒhendid, mille tĂ€hendus on selgelt komponentide tĂ€henduste summa. ja teises need ĂŒhendid, mis saavad uue tĂ€henduse. Uurimus nĂ€itab, et lisaks kontekstile aitavad arvutil tuvastada ĂŒhendverbi Ă”iget tĂ€hendust mitmed teised tunnuseid, nĂ€iteks subjekti ja objekti elusus ja kÀÀnded. Doktoritöö raames valminud andmestikud ja vektoresitused on vajalikud uued ressursid, mis on avalikud edaspidisteks uurimusteks.Nowadays, applications that process human languages (including Estonian) are part of everyday life. However, computers are not yet able to understand every nuance of language. Machine translation is probably the most well-known application of natural language processing. Occasionally, the worst failures of machine translation systems (e.g. Google Translate) are shared on social media. Most of such cases happen when sequences longer than words are translated. For example, translation systems are not able to catch the correct meaning of the particle verb alt (âfrom underâ) minema (âto goâ) (âto get deceivedâ) in the sentence Ta lĂ€ks lepinguga alt because the literal translation of the components of the expression is not correct. In order to improve the quality of machine translation systems and other useful applications, e.g. spam detection or question answering systems, such (idiomatic) multi-word expressions and their meanings must be well detected. The detection of multi-word expressions and their meaning is important in all languages and therefore much research has been done in the field, especially in English. However, the suggested methods have not been applied to the detection of Estonian multi-word expressions before. The dissertation fills that gap and applies well-known machine learning methods to detect one type of Estonian multi-word expressions â the particle verbs. Based on large textual data, the thesis demonstrates that the traditional binary division of Estonian particle verbs to non-compositional (ainukordne, meaning is not predictable from the meaning of its components) and compositional (korrapĂ€rane, meaning is predictable from the meaning of its components) is not comprehensive enough. The research confirms the widely adopted view in computational linguistics that the multi-word expressions form a continuum between the compositional and non-compositional units. Moreover, it is shown that in addition to context, there are some linguistic features, e.g. the animacy and cases of subject and object that help computers to predict whether the meaning of a particle verb in a sentence is compositional or non-compositional. In addition, the research introduces novel resources for Estonian language â trained embeddings and created compositionality datasets are available for the future research.https://www.ester.ee/record=b5252157~S
Unified Representation for Non-compositional and Compositional Expressions
Accurate processing of non-compositional language relies on generating good
representations for such expressions. In this work, we study the representation
of language non-compositionality by proposing a language model, PIER, that
builds on BART and can create semantically meaningful and contextually
appropriate representations for English potentially idiomatic expressions
(PIEs). PIEs are characterized by their non-compositionality and contextual
ambiguity in their literal and idiomatic interpretations. Via intrinsic
evaluation on embedding quality and extrinsic evaluation on PIE processing and
NLU tasks, we show that representations generated by PIER result in 33% higher
homogeneity score for embedding clustering than BART, whereas 3.12% and 3.29%
gains in accuracy and sequence accuracy for PIE sense classification and span
detection compared to the state-of-the-art IE representation model, GIEA. These
gains are achieved without sacrificing PIER's performance on NLU tasks (+/- 1%
accuracy) compared to BART.Comment: This work is accepted to EMNLP 2023 Finding
Recommended from our members
Learning to Live with Machine Translation
Rapid advancements in technologies of text and image generation have increasingly put the perceived autonomy of human creativity under threat. Even before ChatGPT and other large-language models sent such anxieties into overdrive, literary critics were arguing for a hermeneutics of automatic writing and revisiting long-held assumptions about artistic originality. Few, however, gave much thought to these model's quirky cousinsâa family branch that once ruled over the utopian dreams invested in AI: machine translation (MT). This essay reflects on why translation has been lost in all the recent talk about these models and offers a necessary corrective. It considers what a critical response to MT might look like when reframed around an understanding of current technologies and a vision of MT as potential collaborator rather than human replacement. First, it offers an overview of current neural-based MT and the theories of translation that underwrite it. It then uses literary texts as a limit case for surveying the technology's most visible gaps, providing a deep, qualitative analysis of Japanese literary texts machine translated into English. Finally, it takes a speculative turn and considers what "good enough" machine translation of a large corpus of world literature might be good for in a future of ubiquitous and ever more accessible MT. The results hint at more immediate ways that MT invites inquiry into the present conditions of world literature, but also to a future where the entanglement of human translation and agency with the material agency of the technology bring forth potentials in both
YouTube-ASL: A Large-Scale, Open-Domain American Sign Language-English Parallel Corpus
Machine learning for sign languages is bottlenecked by data. In this paper,
we present YouTube-ASL, a large-scale, open-domain corpus of American Sign
Language (ASL) videos and accompanying English captions drawn from YouTube.
With ~1000 hours of videos and >2500 unique signers, YouTube-ASL is ~3x as
large and has ~10x as many unique signers as the largest prior ASL dataset. We
train baseline models for ASL to English translation on YouTube-ASL and
evaluate them on How2Sign, where we achieve a new finetuned state of the art of
12.39 BLEU and, for the first time, report zero-shot results
Subtitling Humour from the Perspective of Relevance Theory: The Office in Traditional Chinese
Subtitling the scenes containing humorous utterances in cinematic-televisual productions encounters a myriad of challenges, because the subtitler has to face the technical constraints that characterise the professional subtitling environment and the cultural barriers when reproducing humorous utterances for viewers inhabiting another culture. Past studies tend to explore more limited humour-related areas, which means that a more comprehensive picture of this specialised field is missing. The current research investigates the subtitling of humour, drawing on the framework of relevance theory and the British sitcom The Office, translated from English dialogue into Traditional Chinese subtitles. This research enquires into whether or not relevance theory can explain the subtitling strategies activated to deal with various humorous utterances in the sitcom, and, if so, to what extent. The English-Chinese Corpus of The Office (ECCO), which contains sample texts, media files and annotations, has been constructed to perform an empirical study. To enrich the corpus with valuable annotations, a typology of humour has been developed based on the concept of frame, and a taxonomy of subtitling strategies has also been proposed. The quantitative analysis demonstrates that the principle of relevance is the main benchmark for the choice of a subtitling micro-strategy within any given macro-strategy. With the chi-square test, it further proves the existence of a statistically significant association between humour types/frames and subtitling strategies at the global level. The qualitative analysis shows that the principle of relevance can operate in a subtle way, in which the subtitler invests more cognitive efforts to enhance the acceptability of subtitles. It also develops three levels of mutual dependency between the two variables, from strong, weak to null, to classify different examples. Overall, this study improves our understanding of humour translation and can facilitate a change in the curricula of translator training
Recommended from our members
The computer comprehension of systematic metaphor
Digitisation of this thesis was sponsored by Arcadia Fund, a charitable fund of Lisbet Rausing and Peter Baldwin
To have done with theory? Baudrillard, or the literal confrontation with reality
Baudrillard, Eluding the temptation to reinterpret Jean Baudrillard once more, this work started from the ambition to consider his thought in its irreducibility, that is, in a radically literal way. Literalness is a recurring though overlooked term in Baudrillardâs oeuvre, and it is drawn from the direct concatenation of words in poetry or puns and other language games. It does not indicate a realist positivism but a principle that considers the metamorphoses and mutual alteration of things in their singularity without reducing them to a general equivalent (i.e. the meaning of words in a poem, which destroys its appearances).
Reapplying the idea to Baudrillard and finding other singular routes through his âpasswordsâ is a way to short-circuit its reductio ad realitatem and reaffirm its challenge to the hegemony of global integration. Even in the literature dedicated to it, this exercise has been rarer than the âhermeneuticalâ one, where Baudrillardâs oeuvre was taken as a discourse to be interpreted and explained (finding an equivalent for its singularity).
In plain polemic with any ideal of conformity between theory and reality (from which our present conformisms arguably derive, too), Baudrillard conceived thought not as something to be verified but as a series of hypotheses to be repeatedly radicalised â he often described it as a âspiralâ, a form which challenges the codification of things, including its own. Coherent with this, the thesis does not consider Baudrillardâs work either a reflection or a prediction of reality but, instead, an out-and-out act, a precious singular object which, interrogated, âthinksâ us and our current events âbackâ.
In the second part, Baudrillardâs hypotheses are taken further and measured in their capacity to challenge the reality of current events and phenomena. The thesis confronts the âhypocriticalâ position of critical thinking, which accepts the present principle of reality. It questions the interminability of our condition, where death seems thinkable only as a senseless interruption of the apparatus. It also confronts the solidarity between orthodox and alternative realities of the COVID pandemic and the Ukrainian invasion, searching for what is irreducible to the perfect osmosis of âvirtual and factualâ.
Drawing equally from the convulsions of globalisation and the psychopathologies of academics, from DeLilloâs fiction and Baudrillardâs lesser-studied influences, this study evaluates the irreversibility of our system against the increasingly silent challenges of radical thought. It looks for what an increasingly pessimistic late Baudrillard called ârogue singularitiesâ: forms which, often outside the conventional realms one would expect to find them, constitute potential sources of the fragility of global power.
âTo have done with theoryâ does not mean abandoning radical thought and, together with it, the singularity of humanity. It means, as the thesis concludes, the courage to leave conventional ideas of theory and listen to less audible voices which, at the heart of this âenormous conspiracyâ, whisper â as a mysterious lady in Mariupol did to Putin â âItâs all not true! Itâs all for show!â
Working Styles of Student Translators in Revision and Post-editing: an Empirical-Experimental Study with Eye-tracking, Keylogging and Cue-based Retrospection
In todayâs translation profession, being skilful at revision (including self-revision and other-revision) and post-editing tasks is becoming essential for translators. The exploration of the working styles of student translators in the revision and post-editing processes is vital in helping us to understand the nature of these tasks, and may help in improving pedagogy. Drawing on theories from translation-related studies, cognitive psychology, and text comprehension and production, the aims of this research were to: (1) identify the basic types of reading and typing activity (physical activities) of student translators in the processes of revision and post-editing, and to measure statistically and compare the duration of these activities within and across tasks; (2) identify the underlying purposes (mental activities) behind each type of reading and typing activity; (3) categorise the basic types of working style of student translators and compare the frequency of use of each working style both within and across tasks; (4) identify the personal working styles of student translators in carrying out different tasks, and (5) identify the most efficient working style in each task.
Eighteen student translators from Durham University, with Chinese as L1 and English as L2, were invited to participate in the experiment. They were asked to translate, self-revise, other-revise and post-edit three comparable texts in Translog-II with the eye-tracking plugin activated. A cue-based retrospective interview was carried out after each session to collect the student translatorsâ subjective and conscious data for qualitative analysis. The raw logging data were transformed into User Activity Data and were analysed both quantitatively and qualitatively.
This study identified seven types of reading and typing activity in the processes of self-revision, other-revision and post-editing. Three revision phases were defined and four types of working style were recognised. The student translatorsâ personal working styles were compared in all three tasks. In addition, a tentative model of their cognitive processes in self-revision, other-revision and post-editing was developed, and the efficiency of the four working styles in each task was tested
- âŠ