Search CORE

770 research outputs found

Large-scale unsupervised audio pre-training for video-to-speech synthesis

Author: Kefalas Triantafyllos
Panagakis Yannis
Pantic Maja
Publication venue
Publication date: 27/06/2023
Field of study

Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker. Most established approaches to date involve a two-step process, whereby an intermediate representation from the video, such as a spectrogram, is extracted first and then passed to a vocoder to produce the raw audio. Some recent work has focused on end-to-end synthesis, whereby the generation of raw audio and any intermediate representations is performed jointly. All such approaches involve training on data from almost exclusively audio-visual datasets, i.e. every audio sample has a corresponding video sample. This precludes the use of abundant audio-only datasets which may not have a corresponding visual modality (e.g. audiobooks, radio podcasts, speech recognition datasets etc.), as well as audio-only architectures that have been developed by the audio machine learning community over the years. In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz, and then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task. The pre-training step uses audio samples only and does not require labels or corresponding samples from other modalities (visual, text). We demonstrate that this pre-training step improves the reconstructed speech and that it is an unexplored way to improve the quality of the generator in a cross-modal task while only requiring samples from one of the modalities. We conduct experiments using both raw audio and mel spectrograms as target outputs and benchmark our models with existing work.Comment: Submitted to IEE

arXiv.org e-Print Archive

Building and Designing Expressive Speech Synthesis

Author: Leigh Clark
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2021
Field of study

We know there is something special about speech. Our voices are not just a means of communicating. They also give a deep impression of who we are and what we might know. They can betray our upbringing, our emotional state, our state of health. They can be used to persuade and convince, to calm and to excite. As speech systems enter the social domain they are required to interact, support and mediate our social relationships with 1) each other, 2) with digital information, and, increasingly, 3) with AI-based algorithms and processes. Socially Interactive Agents (SIAs) are at the fore- front of research and innovation in this area. There is an assumption that in the future “spoken language will provide a natural conversational interface between human beings and so-called intelligent systems.” [Moore 2017, p. 283]. A considerable amount of previous research work has tested this assumption with mixed results. However, as pointed out “voice interfaces have become notorious for fostering frustration and failure” [Nass and Brave 2005, p.6]. It is within this context, between our exceptional and intelligent human use of speech to communicate and interact with other humans, and our desire to leverage this means of communication for artificial systems, that the technology, often termed expressive speech synthesis uncomfortably falls. Uncomfortably, because it is often overshadowed by issues in interactivity and the underlying intelligence of the system which is something that emerges from the interaction of many of the components in a SIA. This is especially true of what we might term conversational speech, where decoupling how things are spoken, from when and to whom they are spoken, can seem an impossible task. This is an even greater challenge in evaluation and in characterising full systems which have made use of expressive speech. Furthermore when designing an interaction with a SIA, we must not only consider how SIAs should speak but how much, and whether they should even speak at all. These considerations cannot be ignored. Any speech synthesis that is used in the context of an artificial agent will have a perceived accent, a vocal style, an underlying emotion and an intonational model. Dimensions like accent and personality (cross speaker parameters) as well as vocal style, emotion and intonation during an interaction (within-speaker parameters) need to be built in the design of a synthetic voice. Even a default or neutral voice has to consider these same expressive speech synthesis components. Such design parameters have a strong influence on how effectively a system will interact, how it is perceived and its assumed ability to perform a task or function. To ignore these is to blindly accept a set of design decisions that ignores the complex effect speech has on the user’s successful interaction with a system. Thus expressive speech synthesis is a key design component in SIAs. This chapter explores the world of expressive speech synthesis, aiming to act as a starting point for those interested in the design, building and evaluation of such artificial speech. The debates and literature within this topic are vast and are fundamentally multidisciplinary in focus, covering a wide range of disciplines such as linguistics, pragmatics, psychology, speech and language technology, robotics and human-computer interaction (HCI), to name a few. It is not our aim to synthesise these areas but to give a scaffold and a starting point for the reader by exploring the critical dimensions and decisions they may need to consider when choosing to use expressive speech. To do this, the chapter explores the building of expressive synthesis, highlighting key decisions and parameters as well as emphasising future challenges in expressive speech research and development. Yet, before these are expanded upon we must first try and define what we actually mean by expressive speech

Cronfa at Swansea University

Specialised Languages and Multimedia. Linguistic and Cross-cultural Issues

Author
Publication venue: ESE - Università del Salento
Publication date: 19/01/2021
Field of study

none2noThis book collects academic works focusing on scientific and technical discourse and on the ways in which this type of discourse appears in or is shaped by multimedia products. The originality of this book is to be seen in the variety of approaches used and of the specialised languages investigated in relation to multimodal and multimedia genres. Contributions will particularly focus on new multimodal or multimedia forms of specialised discourse (in institutional, academic, technical, scientific, social or popular settings), linguistic features of specialised discourse in multimodal or multimedia genres, the popularisation of specialised knowledge in multimodal or multimedia genres, the impact of multimodality and multimediality on the construction of scientific and technical discourse, the impact of multimodality/multimediality in the practice and teaching of language, the impact of multimodality/multimediality in the practice and teaching of translation, new multimedia modes of knowledge dissemination, the translation/adaptation of scientific discourse in multimedia products. This volume contributes to the theory and practice of multimodal studies and translation, with a specific focus on specialized discourse.Rivista di Classe A - Volume specialeopenManca E., Bianchi F.Manca, E.; Bianchi, F

Archivio Istituzionale della Ricerca- Università del Salento

Recommended from our members

Textual Analysis of Two Translated Transcripts: 2012 Presidential Debate and a Speech Presented by Cyrille de Lasteyrie

Author: Witty Laryssa M
Publication venue: ScholarWorks@UMass Amherst
Publication date: 07/11/2014
Field of study

Delia Chiaro (2010) describes humor in two broad categories: referential and verbal. The former focuses on the meaning of a story or event and the humor embedded within. In the case of the latter, idiosyncratic features such as word play displays humorous undertones. This Master’s thesis examines oral text transformation to another language via transcription. The transcripts themselves consist of 10 minutes of the 2012 Presidential debate between François Hollande and Nicolas Sarkozy and 10 minutes of a monologue presented by French animator Cyrille de Lasteyrie. Both transcripts are linked by the commonality of humor and exhibit the two categories previously outlined. Additional attention will be given to the translation challenges that arose such as: transferring the overall meaning of each idea, maintaining as much of the humor within the text as possible and conveying each speaker’s style. This study aims to provide future translators guidance in their translation endeavors by pinpointing scholarly research and discussing the various translator techniques implemented in overcoming challenges such as metaphors and collocations

ScholarWorks@UMass Amherst

AXMEDIS 2008

Author
Publication venue: 'Firenze University Press'
Publication date: 31/05/2022
Field of study

The AXMEDIS International Conference series aims to explore all subjects and topics related to cross-media and digital-media content production, processing, management, standards, representation, sharing, protection and rights management, to address the latest developments and future trends of the technologies and their applications, impacts and exploitation. The AXMEDIS events offer venues for exchanging concepts, requirements, prototypes, research ideas, and findings which could contribute to academic research and also benefit business and industrial communities. In the Internet as well as in the digital era, cross-media production and distribution represent key developments and innovations that are fostered by emergent technologies to ensure better value for money while optimising productivity and market coverage

Directory of Open Access Books (DOAB)

Telepresence and Transgenic Art

Author: Kac Eduardo
Publication venue
Publication date: 01/10/2002
Field of study

University of South Wales Research Explorer

Subtitling the films of Volker Schlöndorff in English

Author: Bywood LA
Publication venue: UCL (University College London)
Publication date: 28/03/2017
Field of study

Volker Schlöndorff is one of Germany’s foremost directors, and a prominent member of the group who formed the New German Cinema in the 1960s, a movement which rejected the ‘old film-making’ in Germany and embraced a new way of working whose main thrust was artistic, rather than commercial. This thesis seeks to use the Descriptive Translation Studies framework to examine the English subtitles for two of Schlöndorff’s best known films: Die verlorene Ehre der Katharina Blum [The Lost Honour of Katharina Blum], directed in 1975, and Die Blechtrommel [The Tin Drum], from 1979. Using the concept of translational norms as one of its main heuristic tools, this research examines an audiovisual corpus consisting of five different sets of DVD subtitles from the two films: three from Die Blechtrommel, dating from 1995, 2002 and 2010, and two from Katharina Blum, dated 2003 and 2009, thus spanning the era from the advent of digitisation and the beginning of DVD to the rise of TV and film streaming services. The data is analysed to investigate the translation strategies that have been activated by the subtitlers when encountering culture-specific references, and then to pinpoint any diachronic trends that come to the fore, with a view to testing the earliest concept of the retranslation hypothesis (Bensimon, 1990; Berman, 1990)

UCL Discovery