605 research outputs found

    Eesti keele ĂŒhendverbide automaattuvastus lingvistiliste ja statistiliste meetoditega

    Get PDF
    TĂ€napĂ€eval on inimkeeli (kaasa arvatud eesti keelt) töötlevad tehnoloogiaseadmed igapĂ€evaelu osa, kuid arvutite „keeleoskus“ pole kaugeltki tĂ€iuslik. Keele automaattöötluse kĂ”ige rohkem kasutust leidev rakendus on ilmselt masintĂ”lge. Ikka ja jĂ€lle jagatakse sotsiaalmeedias, kuidas tuntud sĂŒsteemid (nĂ€iteks Google Translate) midagi valesti tĂ”lgivad. Enamasti tekitavad absurdse olukorra mitmest sĂ”nast koosnevad fraasid vĂ”i laused. NĂ€iteks ei suuda tĂ”lkesĂŒsteemid tabada lauses „Ta lĂ€ks lepinguga alt“ ĂŒhendi alt minema tĂ€hendust petta saama, sest Ă”ige tĂ€henduse edastamiseks ei saa selle ĂŒhendi komponente sĂ”na-sĂ”nalt tĂ”lkida ja seetĂ”ttu satubki arvuti hĂ€tta. Selleks et nii masintĂ”lkesĂŒsteemide kui ka teiste kasulike rakenduste nagu libauudiste tuvastuse vĂ”i kĂŒsimus-vastus sĂŒsteemide kvaliteet paraneks, on oluline, et arvuti oskaks tuvastada mitmesĂ”nalisi ĂŒksuseid ja nende eri tĂ€hendusi, mida inimesed konteksti pĂ”hjal ĂŒpriski lihtalt teha suudavad. PĂŒsiĂŒhendite (tĂ€henduse) automaattuvastus on oluline kĂ”ikides keeltes ja on seetĂ”ttu pĂ€lvinud arvutilingvistikas rohkelt tĂ€helepanu. Seega on eriti inglise keele pĂ”hjal vĂ€lja pakutud terve hulk meetodeid, mida pole siiamaani eesti keele pĂŒsiĂŒhendite tuvastamiseks rakendatud. Doktoritöös kasutataksegi masinĂ”ppe meetodeid, mis on teiste keelte pĂŒsiĂŒhendite tuvastamisel edukad olnud, ĂŒht liiki eesti keele pĂŒsiĂŒhendi – ĂŒhendverbi – automaatseks tuvastamiseks. Töös demonstreeritakse suurte tekstiandmete pĂ”hjal, et seni eesti keele traditsioonilises kĂ€sitluses esitatud eesti keele ĂŒhendverbide jaotus ainukordseteks (ĂŒhendi komponentide koosesinemisel tekib uus tĂ€hendus) ja korrapĂ€rasteks (ĂŒhendi tĂ€hendus on tema komponentide summa) ei ole piisavalt pĂ”hjalik. Nimelt kinnitab töö arvutilingvistilistes uurimustes laialt levinud arusaama, et pĂŒsiĂŒhendid (k.a ĂŒhendverbid) jaotuvad skaalale, mille ĂŒhes otsas on ĂŒhendid, mille tĂ€hendus on selgelt komponentide tĂ€henduste summa. ja teises need ĂŒhendid, mis saavad uue tĂ€henduse. Uurimus nĂ€itab, et lisaks kontekstile aitavad arvutil tuvastada ĂŒhendverbi Ă”iget tĂ€hendust mitmed teised tunnuseid, nĂ€iteks subjekti ja objekti elusus ja kÀÀnded. Doktoritöö raames valminud andmestikud ja vektoresitused on vajalikud uued ressursid, mis on avalikud edaspidisteks uurimusteks.Nowadays, applications that process human languages (including Estonian) are part of everyday life. However, computers are not yet able to understand every nuance of language. Machine translation is probably the most well-known application of natural language processing. Occasionally, the worst failures of machine translation systems (e.g. Google Translate) are shared on social media. Most of such cases happen when sequences longer than words are translated. For example, translation systems are not able to catch the correct meaning of the particle verb alt (‘from under’) minema (‘to go’) (‘to get deceived’) in the sentence Ta lĂ€ks lepinguga alt because the literal translation of the components of the expression is not correct. In order to improve the quality of machine translation systems and other useful applications, e.g. spam detection or question answering systems, such (idiomatic) multi-word expressions and their meanings must be well detected. The detection of multi-word expressions and their meaning is important in all languages and therefore much research has been done in the field, especially in English. However, the suggested methods have not been applied to the detection of Estonian multi-word expressions before. The dissertation fills that gap and applies well-known machine learning methods to detect one type of Estonian multi-word expressions – the particle verbs. Based on large textual data, the thesis demonstrates that the traditional binary division of Estonian particle verbs to non-compositional (ainukordne, meaning is not predictable from the meaning of its components) and compositional (korrapĂ€rane, meaning is predictable from the meaning of its components) is not comprehensive enough. The research confirms the widely adopted view in computational linguistics that the multi-word expressions form a continuum between the compositional and non-compositional units. Moreover, it is shown that in addition to context, there are some linguistic features, e.g. the animacy and cases of subject and object that help computers to predict whether the meaning of a particle verb in a sentence is compositional or non-compositional. In addition, the research introduces novel resources for Estonian language – trained embeddings and created compositionality datasets are available for the future research.https://www.ester.ee/record=b5252157~S

    Unified Representation for Non-compositional and Compositional Expressions

    Full text link
    Accurate processing of non-compositional language relies on generating good representations for such expressions. In this work, we study the representation of language non-compositionality by proposing a language model, PIER, that builds on BART and can create semantically meaningful and contextually appropriate representations for English potentially idiomatic expressions (PIEs). PIEs are characterized by their non-compositionality and contextual ambiguity in their literal and idiomatic interpretations. Via intrinsic evaluation on embedding quality and extrinsic evaluation on PIE processing and NLU tasks, we show that representations generated by PIER result in 33% higher homogeneity score for embedding clustering than BART, whereas 3.12% and 3.29% gains in accuracy and sequence accuracy for PIE sense classification and span detection compared to the state-of-the-art IE representation model, GIEA. These gains are achieved without sacrificing PIER's performance on NLU tasks (+/- 1% accuracy) compared to BART.Comment: This work is accepted to EMNLP 2023 Finding

    YouTube-ASL: A Large-Scale, Open-Domain American Sign Language-English Parallel Corpus

    Full text link
    Machine learning for sign languages is bottlenecked by data. In this paper, we present YouTube-ASL, a large-scale, open-domain corpus of American Sign Language (ASL) videos and accompanying English captions drawn from YouTube. With ~1000 hours of videos and >2500 unique signers, YouTube-ASL is ~3x as large and has ~10x as many unique signers as the largest prior ASL dataset. We train baseline models for ASL to English translation on YouTube-ASL and evaluate them on How2Sign, where we achieve a new finetuned state of the art of 12.39 BLEU and, for the first time, report zero-shot results

    Subtitling Humour from the Perspective of Relevance Theory: The Office in Traditional Chinese

    Get PDF
    Subtitling the scenes containing humorous utterances in cinematic-televisual productions encounters a myriad of challenges, because the subtitler has to face the technical constraints that characterise the professional subtitling environment and the cultural barriers when reproducing humorous utterances for viewers inhabiting another culture. Past studies tend to explore more limited humour-related areas, which means that a more comprehensive picture of this specialised field is missing. The current research investigates the subtitling of humour, drawing on the framework of relevance theory and the British sitcom The Office, translated from English dialogue into Traditional Chinese subtitles. This research enquires into whether or not relevance theory can explain the subtitling strategies activated to deal with various humorous utterances in the sitcom, and, if so, to what extent. The English-Chinese Corpus of The Office (ECCO), which contains sample texts, media files and annotations, has been constructed to perform an empirical study. To enrich the corpus with valuable annotations, a typology of humour has been developed based on the concept of frame, and a taxonomy of subtitling strategies has also been proposed. The quantitative analysis demonstrates that the principle of relevance is the main benchmark for the choice of a subtitling micro-strategy within any given macro-strategy. With the chi-square test, it further proves the existence of a statistically significant association between humour types/frames and subtitling strategies at the global level. The qualitative analysis shows that the principle of relevance can operate in a subtle way, in which the subtitler invests more cognitive efforts to enhance the acceptability of subtitles. It also develops three levels of mutual dependency between the two variables, from strong, weak to null, to classify different examples. Overall, this study improves our understanding of humour translation and can facilitate a change in the curricula of translator training

    To have done with theory? Baudrillard, or the literal confrontation with reality

    Get PDF
    Baudrillard, Eluding the temptation to reinterpret Jean Baudrillard once more, this work started from the ambition to consider his thought in its irreducibility, that is, in a radically literal way. Literalness is a recurring though overlooked term in Baudrillard’s oeuvre, and it is drawn from the direct concatenation of words in poetry or puns and other language games. It does not indicate a realist positivism but a principle that considers the metamorphoses and mutual alteration of things in their singularity without reducing them to a general equivalent (i.e. the meaning of words in a poem, which destroys its appearances). Reapplying the idea to Baudrillard and finding other singular routes through his “passwords” is a way to short-circuit its reductio ad realitatem and reaffirm its challenge to the hegemony of global integration. Even in the literature dedicated to it, this exercise has been rarer than the ‘hermeneutical’ one, where Baudrillard’s oeuvre was taken as a discourse to be interpreted and explained (finding an equivalent for its singularity). In plain polemic with any ideal of conformity between theory and reality (from which our present conformisms arguably derive, too), Baudrillard conceived thought not as something to be verified but as a series of hypotheses to be repeatedly radicalised – he often described it as a “spiral”, a form which challenges the codification of things, including its own. Coherent with this, the thesis does not consider Baudrillard’s work either a reflection or a prediction of reality but, instead, an out-and-out act, a precious singular object which, interrogated, ‘thinks’ us and our current events ‘back’. In the second part, Baudrillard’s hypotheses are taken further and measured in their capacity to challenge the reality of current events and phenomena. The thesis confronts the ‘hypocritical’ position of critical thinking, which accepts the present principle of reality. It questions the interminability of our condition, where death seems thinkable only as a senseless interruption of the apparatus. It also confronts the solidarity between orthodox and alternative realities of the COVID pandemic and the Ukrainian invasion, searching for what is irreducible to the perfect osmosis of “virtual and factual”. Drawing equally from the convulsions of globalisation and the psychopathologies of academics, from DeLillo’s fiction and Baudrillard’s lesser-studied influences, this study evaluates the irreversibility of our system against the increasingly silent challenges of radical thought. It looks for what an increasingly pessimistic late Baudrillard called ‘rogue singularities’: forms which, often outside the conventional realms one would expect to find them, constitute potential sources of the fragility of global power. ‘To have done with theory’ does not mean abandoning radical thought and, together with it, the singularity of humanity. It means, as the thesis concludes, the courage to leave conventional ideas of theory and listen to less audible voices which, at the heart of this “enormous conspiracy”, whisper — as a mysterious lady in Mariupol did to Putin — “It’s all not true! It’s all for show!”

    Working Styles of Student Translators in Revision and Post-editing: an Empirical-Experimental Study with Eye-tracking, Keylogging and Cue-based Retrospection

    Get PDF
    In today’s translation profession, being skilful at revision (including self-revision and other-revision) and post-editing tasks is becoming essential for translators. The exploration of the working styles of student translators in the revision and post-editing processes is vital in helping us to understand the nature of these tasks, and may help in improving pedagogy. Drawing on theories from translation-related studies, cognitive psychology, and text comprehension and production, the aims of this research were to: (1) identify the basic types of reading and typing activity (physical activities) of student translators in the processes of revision and post-editing, and to measure statistically and compare the duration of these activities within and across tasks; (2) identify the underlying purposes (mental activities) behind each type of reading and typing activity; (3) categorise the basic types of working style of student translators and compare the frequency of use of each working style both within and across tasks; (4) identify the personal working styles of student translators in carrying out different tasks, and (5) identify the most efficient working style in each task. Eighteen student translators from Durham University, with Chinese as L1 and English as L2, were invited to participate in the experiment. They were asked to translate, self-revise, other-revise and post-edit three comparable texts in Translog-II with the eye-tracking plugin activated. A cue-based retrospective interview was carried out after each session to collect the student translators’ subjective and conscious data for qualitative analysis. The raw logging data were transformed into User Activity Data and were analysed both quantitatively and qualitatively. This study identified seven types of reading and typing activity in the processes of self-revision, other-revision and post-editing. Three revision phases were defined and four types of working style were recognised. The student translators’ personal working styles were compared in all three tasks. In addition, a tentative model of their cognitive processes in self-revision, other-revision and post-editing was developed, and the efficiency of the four working styles in each task was tested
    • 

    corecore