68 research outputs found

    Eesti emotsionaalse kõne korpuse loomine ja emotsioonide taju

    Get PDF
    Väitekirja elektrooniline versioon ei sisalda publikatsioone.Väitekirja eesmärk oli luua Eesti emotsionaalse kõne korpuse teoreetiline alus ja kontrollida loodud korpuse materjali põhjal teoreetiliste seisukohtade õigsust. Uurimus näitas, kui oluline on korpust enne selle loomist hoolikalt planeerida ja tulemust analüüsida. Saadud teadmisi saavad rakendada nii emotsiooniuurijad kui ka kõnekorpuste arendajad. Eesti korpuse teeb teiste kõneemotsioonikorpuste seas ainulaadseks asjaolu, et lausete emotsioon on märgendatud selle järgi, kas emotsiooni kannab lause heli või mõjutab emotsiooni äratundmist häälest lause verbaalne sisu. Selline jaotus teeb võimalikuks emotsioonide uurimise nii kõnes kui kirjas. Eesti emotsionaalse kõne korpus on üks väheseid esilekutsutud mõõdukalt väljendunud emotsioone sisaldavaid kõnekorpusi, mis on dokumenteeritud, avalikult ja tasuta kättesaadav. Korpuse jaoks on salvestatud n-ö tavalise inimese etteloetud tekstid, kellele ei ole öeldud, millise emotsiooniga tuleb tekste lugeda. Kuna Eesti emotsionaalse kõne korpuses olevate lausete emotsioonid on määranud kuulajad testidega, on töös olulised emotsioonide tajuga seotud küsimused. Väitekirja raames on leidnud kinnitust, et kuulajad suudavad hästi ära tunda mõõdukalt väljendatud emotsioone mitteprofessionaalse lugeja häälest. Uurimistulemused toetavad otsust valida Eesti emotsionaalse kõne korpuse lausete emotsiooni määrajateks üle 30-aastased eesti keelt emakeelena rääkivad täiskasvanud eestlased, kuna nad suudavad noortest paremini dekodeerida sõnumi emotsiooni. Samuti näitasid tulemused, et emotsioonidest arusaamine on kultuurisõltlik Uurimistulemused ei kinnitanud empaatia olulist rolli emotsioonide tuvastamisel häälest, küll aga näitasid meeste ja naiste erinevust emotsioonide tuvastamisel. Korpus on niisugusena, nagu ta teoreetiliselt kavandati olemas ja sisaldab praegu ühe naishääle lauseid, mis on klassifitseeritud vihaks, rõõmuks, kurbuseks ja neutraalsuseks (vt http://peeter.eki.ee:5000). Kuna Eesti emotsionaalse kõne korpus on kergesti laiendatav, arendatakse seda edasi vastavalt uutele uurimissuundadele.The aim of the thesis was to develop a theoretical base for the Estonian Emotional Speech Corpus and to test the validity of the theoretical starting-points on the Corpus material. The Corpus is now ready as designed (see http://peeter.eki.ee:5000). The results of the research reveal the importance of detailed planning and of the design elements of the Corpus. The theoretical starting-points of the study are relevant and applicable in real situations. Therefore these results could be taken into consideration in the creation of other emotional speech corpora. What makes this Corpus unique among the other corpora of its kind is the fact that its sentences have different labels according to whether their emotion is carried just by the sound of the sentence or whether the recognition of their emotion from vocal expression may be influenced by the verbal-semantic content. This classification enables the research of emotions both in speech as well as in writing. Estonian Emotional Speech Corpus is one of the few freely available documented ones that reviews moderately expressed emotions. The Corpus abandoned acted emotions because of their possible stereotypicality and overactedness. The sentences recorded for the Corpus were read out by a so-called ordinary person, who was not dictated what emotion to use while reading. The Corpus contains 1,234 Estonian sentences that have passed both reading and listening tests. Test takers identified 908 sentences that expressed anger, joy, sadness, or were neutral. As the emotions of the sentences contained in the Corpus were determined by listeners, some issues of emotion perception came to the fore: 1) Is sentence emotion identifiable purely from vocal cues, without the speaker being seen? 2) Can age affect the identification of emotion? 3) Is the identification of emotion culturally bound? 4) Does identification depend on the listeners’ empathy? For the first question asking if the emotion of a sentence can be identified from non-acted vocal expression without the speaker being seen, results confirmed the supposition that listeners can recognize the moderate expression of non-acted emotions from the voice of a non-professional reeder. Also, the results support the decision that the emotions of the sentences in the Estonian Emotional Speech Corpus should be determined by Estonian adults aged over 30 who speak Estonian as their native language because they are more likely to have acquired the skills for decoding the culture-specific expression of emotions. Furthermore, the results imply that the understanding of emotions depends on cultural factors and social interactions, including the social norms specific to one culture. The interpretation of emotional messages is therefore learned in the course of social interactions. Research has shown, that in the recognition of emotion from vocal cues, empathy is less important than clinical results would suggest. In conducting emotion studies for speech technological purposes, it is obviously unnecessary to exclude non-empathic people from the testers for the reason that they may not recognize the emotions expressed if their low empathy level is not due to mental or developmental disorders. The Corpus continues to be developed according to the requirements of new research directions. As the Corpus is publicly available and accessible for free, its data can be used for tackling different research challenges

    Rendering multilingualism through audio subtitles : shaping a categorisation for aural strategies

    Get PDF
    Multilingualism in films has increased in recent productions as a reflection of today's globalised word. Different translation transfer modes such as dubbing or subtitling are combined to maintain the film's multilingual essence when translated into other languages. Within media accessibility, audio subtitles, an aurally-rendered version of written subtitles, is used to make access possible for audiences with vision or reading difficulties. By taking Sternberg's representation of polylingualism (1981. Polylingualism as reality and translation as mimesis. Poetics Today, 2(4), 221-239), this article offers a categorisation of the strategies that may be used to reveal multilingualism in audiovisual content through audio subtitles similar to the way Szarkowska, Zbikowska, & Krejtz (2013. Subtitling for the deaf and hard of hearing in multilingual films. International Journal of Multilingualism, 10(3), 292-312) did with subtitles for the deaf and the hard of hearing. By taking a descriptive approach, two main strategies or effects for the delivery of audio subtitles - dubbing and voice-over - are highlighted and explained. By combining these two effects with the information provided by the audio description, the levels of the categorisation are defined from more to less multilingualism-revealing: vehicular matching, selective reproduction, verbal transposition, explicit attribution and homogenising convention

    Design and evaluation of mobile computer-assisted pronunciation training tools for second language learning

    Get PDF
    The quality of speech technology (automatic speech recognition, ASR, and textto- speech, TTS) has considerably improved and, consequently, an increasing number of computer-assisted pronunciation (CAPT) tools has included it. However, pronunciation is one area of teaching that has not been developed enough since there is scarce empirical evidence assessing the effectiveness of tools and games that include speech technology in the field of pronunciation training and teaching. This PhD thesis addresses the design and validation of an innovative CAPT system for smart devices for training second language (L2) pronunciation. Particularly, it aims to improve learner’s L2 pronunciation at the segmental level with a specific set of methodological choices, such as learner’s first and second language connection (L1– L2), minimal pairs, a training cycle of exposure–perception–production, individualistic and social approaches, and the inclusion of ASR and TTS technology. The experimental research conducted applying these methodological choices with real users validates the efficiency of the CAPT prototypes developed for the four main experiments of this dissertation. Data is automatically gathered by the CAPT systems to give an immediate specific feedback to users and to analyze all results. The protocols, metrics, algorithms, and methods necessary to statistically analyze and discuss the results are also detailed. The two main L2 tested during the experimental procedure are American English and Spanish. The different CAPT prototypes designed and validated in this thesis, and the methodological choices that they implement, allow to accurately measuring the relative pronunciation improvement of the individuals who trained with them. Both rater’s subjective scores and CAPT’s objective scores show a strong correlation, being useful in the future to be able to assess a large amount of data and reducing human costs. Results also show an intensive practice supported by a significant number of activities carried out. In the case of the controlled experiments, students who worked with the CAPT tool achieved better pronunciation improvement values than their peers in the traditional in-classroom instruction group. In the case of the challenge-based CAPT learning game proposed, the most active players in the competition kept on playing until the end and achieved significant pronunciation improvement results.Departamento de Informática (Arquitectura y Tecnología de Computadores, Ciencias de la Computación e Inteligencia Artificial, Lenguajes y Sistemas Informáticos)Doctorado en Informátic

    SeamlessM4T-Massively Multilingual & Multimodal Machine Translation

    Full text link
    What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communicatio

    Visual Speech Synthesis using Dynamic Visemes and Deep Learning Architectures

    Get PDF
    The aim of this work is to improve the naturalness of visual speech synthesis produced automatically from a linguistic input over existing methods. Firstly, the most important contribution is on the investigation of the most suitable speech units for the visual speech synthesis. We propose the use of dynamic visemes instead of phonemes or static visemes and found that dynamic visemes can generate better visual speech than either phone or static viseme units. Moreover, best performance is obtained by a combined phoneme-dynamic viseme system. Secondly, we examine the most appropriate model between hidden Markov model (HMM) and different deep learning models that include feedforward and recurrent structures consisting of one-to-one, many-to-one and many-to-many architectures. Results suggested that that frame-by-frame synthesis from deep learning approach outperforms state-based synthesis from HMM approaches and an encoder-decoder many-to-many architecture is better than the one-to-one and many-to-one architectures. Thirdly, we explore the importance of contextual features that include information at varying linguistic levels, from frame level up to the utterance level. Our findings found that frame level information is the most valuable feature, as it is able to avoid discontinuities in the visual feature sequence and produces a smooth and realistic animation output. Fourthly, we found that the two most common objective measures of correlation and root mean square error are not able to indicate realism and naturalness of human perceived quality. We introduce an alternative objective measure and show that the global variance is a better indicator of human perception of quality. Finally, we propose a novel method to convert a given text input and phoneme transcription into a dynamic viseme transcription in the case when a reference dynamic viseme sequence is not available. Subjective preference tests confirmed that our proposed method is able to produce animation, that are statistically indistinguishable from animation produced using reference data

    European Language Grid: An Overview

    Get PDF
    With 24 official EU and many additional languages, multilingualism in Europe and an inclusive Digital Single Market can only be enabled through Language Technologies (LTs). European LT business is dominated by hundreds of SMEs and a few large players. Many are world-class, with technologies that outperform the global players. However, European LT business is also fragmented – by nation states, languages, verticals and sectors, significantly holding back its impact. The European Language Grid (ELG) project addresses this fragmentation by establishing the ELG as the primary platform for LT in Europe. The ELG is a scalable cloud platform, providing, in an easy-to-integrate way, access to hundreds of commercial and non-commercial LTs for all European languages, including running tools and services as well as data sets and resources. Once fully operational, it will enable the commercial and non-commercial European LT community to deposit and upload their technologies and data sets into the ELG, to deploy them through the grid, and to connect with other resources. The ELG will boost the Multilingual Digital Single Market towards a thriving European LT community, creating new jobs and opportunities. Furthermore, the ELG project organises two open calls for up to 20 pilot projects. It also sets up 32 national competence centres and the European LT Council for outreach and coordination purposes

    Simple identification tools in FishBase

    Get PDF
    Simple identification tools for fish species were included in the FishBase information system from its inception. Early tools made use of the relational model and characters like fin ray meristics. Soon pictures and drawings were added as a further help, similar to a field guide. Later came the computerization of existing dichotomous keys, again in combination with pictures and other information, and the ability to restrict possible species by country, area, or taxonomic group. Today, www.FishBase.org offers four different ways to identify species. This paper describes these tools with their advantages and disadvantages, and suggests various options for further development. It explores the possibility of a holistic and integrated computeraided strategy

    The Pragmatic Particles enfin and écoute in French Film and TV Dialogue

    Get PDF
    This thesis investigates the use of the pragmatic particles (PPs) 'enfin' and 'écoute' in French film dialogue, and their translations in British English subtitles. Using a corpus of nine films and eight episodes drawn from two television series – all released in the UK between 2005 and 2015, and equating to approximately twenty-two hours – the study identifies tokens across a much wider range of contexts than has previously been possible using traditional corpora. The main contribution is an analysis of PP functions. The results for 'enfin' show a different functional distribution of the particle to other corpora, with corrective 'enfin' occurring significantly less frequently. The relatively large number of tokens of performative and emotional (or affective) 'enfin' allows for an elaboration of these two categories, and a tendency is observed for 'enfin' to appear as an apparent disagreement mitigator in discussions between peers. With regard to 'écoute', it is argued that écoute1 functions as a face-threat mitigator in unequal relationships and écoute2 as an FTA, although the particle is multifunctional and some tokens exhibit characteristics of both categories. Attention is given to combinations of 'enfin' and 'écoute' with other particles: while there is a clear tendency for disagreement-mitigating 'enfin' to co-occur with 'mais', and for the precision and restrictive subcategories of the corrective to co-occur with 'je veux dire', other previously documented combinations ('enfin bon' and 'ben écoute') are not frequently occurring in the present corpus. The thesis also makes a significant contribution to the field of Audiovisual Translation (AVT). The English subtitles show high rates of omission for both particles consistent with previous research, with disagreement-mitigating 'enfin' particularly vulnerable to omission. However, the analysis reveals a surprising pattern regarding 'écoute': a clear division of labour between ‘look’ (used to translate more confrontational tokens) and ‘listen’ (more conciliatory and socially distant). The study includes an experimental analysis of the subtitles relative to their character limits, demonstrating a potential new approach for researchers wishing to investigate the impact of various subtitling constraints.Arts and Humanities Research Counci
    corecore