770 research outputs found
Large-scale unsupervised audio pre-training for video-to-speech synthesis
Video-to-speech synthesis is the task of reconstructing the speech signal
from a silent video of a speaker. Most established approaches to date involve a
two-step process, whereby an intermediate representation from the video, such
as a spectrogram, is extracted first and then passed to a vocoder to produce
the raw audio. Some recent work has focused on end-to-end synthesis, whereby
the generation of raw audio and any intermediate representations is performed
jointly. All such approaches involve training on data from almost exclusively
audio-visual datasets, i.e. every audio sample has a corresponding video
sample. This precludes the use of abundant audio-only datasets which may not
have a corresponding visual modality (e.g. audiobooks, radio podcasts, speech
recognition datasets etc.), as well as audio-only architectures that have been
developed by the audio machine learning community over the years. In this paper
we propose to train encoder-decoder models on more than 3,500 hours of audio
data at 24kHz, and then use the pre-trained decoders to initialize the audio
decoders for the video-to-speech synthesis task. The pre-training step uses
audio samples only and does not require labels or corresponding samples from
other modalities (visual, text). We demonstrate that this pre-training step
improves the reconstructed speech and that it is an unexplored way to improve
the quality of the generator in a cross-modal task while only requiring samples
from one of the modalities. We conduct experiments using both raw audio and mel
spectrograms as target outputs and benchmark our models with existing work.Comment: Submitted to IEE
Building and Designing Expressive Speech Synthesis
We know there is something special about speech. Our voices are not just a means of communicating. They also give a deep impression of who we are and what we might know. They can betray our upbringing, our emotional state, our state of health. They can be used to persuade and convince, to calm and to excite. As speech systems enter the social domain they are required to interact, support and mediate our social relationships with 1) each other, 2) with digital information, and, increasingly, 3) with AI-based algorithms and processes. Socially Interactive Agents (SIAs) are at the fore- front of research and innovation in this area. There is an assumption that in the future âspoken language will provide a natural conversational interface between human beings and so-called intelligent systems.â [Moore 2017, p. 283]. A considerable amount of previous research work has tested this assumption with mixed results. However, as pointed out âvoice interfaces have become notorious for fostering frustration and failureâ [Nass and Brave 2005, p.6]. It is within this context, between our exceptional and intelligent human use of speech to communicate and interact with other humans, and our desire to leverage this means of communication for artificial systems, that the technology, often termed expressive speech synthesis uncomfortably falls. Uncomfortably, because it is often overshadowed by issues in interactivity and the underlying intelligence of the system which is something that emerges from the interaction of many of the components in a SIA. This is especially true of what we might term conversational speech, where decoupling how things are spoken, from when and to whom they are spoken, can seem an impossible task. This is an even greater challenge in evaluation and in characterising full systems which have made use of expressive speech. Furthermore when designing an interaction with a SIA, we must not only consider how SIAs should speak but how much, and whether they should even speak at all. These considerations cannot be ignored. Any speech synthesis that is used in the context of an artificial agent will have a perceived accent, a vocal style, an underlying emotion and an intonational model. Dimensions like accent and personality (cross speaker parameters) as well as vocal style, emotion and intonation during an interaction (within-speaker parameters) need to be built in the design of a synthetic voice. Even a default or neutral voice has to consider these same expressive speech synthesis components. Such design parameters have a strong influence on how effectively a system will interact, how it is perceived and its assumed ability to perform a task or function. To ignore these is to blindly accept a set of design decisions that ignores the complex effect speech has on the userâs successful interaction with a system. Thus expressive speech synthesis is a key design component in SIAs. This chapter explores the world of expressive speech synthesis, aiming to act as a starting point for those interested in the design, building and evaluation of such artificial speech. The debates and literature within this topic are vast and are fundamentally multidisciplinary in focus, covering a wide range of disciplines such as linguistics, pragmatics, psychology, speech and language technology, robotics and human-computer interaction (HCI), to name a few. It is not our aim to synthesise these areas but to give a scaffold and a starting point for the reader by exploring the critical dimensions and decisions they may need to consider when choosing to use expressive speech. To do this, the chapter explores the building of expressive synthesis, highlighting key decisions and parameters as well as emphasising future challenges in expressive speech research and development. Yet, before these are expanded upon we must first try and define what we actually mean by expressive speech
Specialised Languages and Multimedia. Linguistic and Cross-cultural Issues
none2noThis book collects academic works focusing on scientific and technical discourse and on the ways in which this type of discourse appears in or is shaped by multimedia products. The originality of this book is to be seen in the variety of approaches used and of the specialised languages investigated in relation to multimodal and multimedia genres. Contributions will particularly focus on new multimodal or multimedia forms of specialised discourse (in institutional, academic, technical, scientific, social or popular settings), linguistic features of specialised discourse in multimodal or multimedia genres, the popularisation of specialised knowledge in multimodal or multimedia genres, the impact of multimodality and multimediality on the construction of scientific and technical discourse, the impact of multimodality/multimediality in the practice and teaching of language, the impact of multimodality/multimediality in the practice and teaching of translation, new multimedia modes of knowledge dissemination, the translation/adaptation of scientific discourse in multimedia products. This volume contributes to the theory and practice of multimodal studies and translation, with a specific focus on specialized discourse.Rivista di Classe A - Volume specialeopenManca E., Bianchi F.Manca, E.; Bianchi, F
Recommended from our members
Textual Analysis of Two Translated Transcripts: 2012 Presidential Debate and a Speech Presented by Cyrille de Lasteyrie
Delia Chiaro (2010) describes humor in two broad categories: referential and verbal. The former focuses on the meaning of a story or event and the humor embedded within. In the case of the latter, idiosyncratic features such as word play displays humorous undertones. This Masterâs thesis examines oral text transformation to another language via transcription. The transcripts themselves consist of 10 minutes of the 2012 Presidential debate between François Hollande and Nicolas Sarkozy and 10 minutes of a monologue presented by French animator Cyrille de Lasteyrie. Both transcripts are linked by the commonality of humor and exhibit the two categories previously outlined. Additional attention will be given to the translation challenges that arose such as: transferring the overall meaning of each idea, maintaining as much of the humor within the text as possible and conveying each speakerâs style. This study aims to provide future translators guidance in their translation endeavors by pinpointing scholarly research and discussing the various translator techniques implemented in overcoming challenges such as metaphors and collocations
AXMEDIS 2008
The AXMEDIS International Conference series aims to explore all subjects and topics related to cross-media and digital-media content production, processing, management, standards, representation, sharing, protection and rights management, to address the latest developments and future trends of the technologies and their applications, impacts and exploitation. The AXMEDIS events offer venues for exchanging concepts, requirements, prototypes, research ideas, and findings which could contribute to academic research and also benefit business and industrial communities. In the Internet as well as in the digital era, cross-media production and distribution represent key developments and innovations that are fostered by emergent technologies to ensure better value for money while optimising productivity and market coverage
Subtitling the films of Volker Schlöndorff in English
Volker Schlöndorff is one of Germanyâs foremost directors, and a prominent member of the group who formed the New German Cinema in the 1960s, a movement which rejected the âold film-makingâ in Germany and embraced a new way of working whose main thrust was artistic, rather than commercial. This thesis seeks to use the Descriptive Translation Studies framework to examine the English subtitles for two of Schlöndorffâs best known films: Die verlorene Ehre der Katharina Blum [The Lost Honour of Katharina Blum], directed in 1975, and Die Blechtrommel [The Tin Drum], from 1979. Using the concept of translational norms as one of its main heuristic tools, this research examines an audiovisual corpus consisting of five different sets of DVD subtitles from the two films: three from Die Blechtrommel, dating from 1995, 2002 and 2010, and two from Katharina Blum, dated 2003 and 2009, thus spanning the era from the advent of digitisation and the beginning of DVD to the rise of TV and film streaming services. The data is analysed to investigate the translation strategies that have been activated by the subtitlers when encountering culture-specific references, and then to pinpoint any diachronic trends that come to the fore, with a view to testing the earliest concept of the retranslation hypothesis (Bensimon, 1990; Berman, 1990)
- âŠ