770 research outputs found

    Large-scale unsupervised audio pre-training for video-to-speech synthesis

    Full text link
    Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker. Most established approaches to date involve a two-step process, whereby an intermediate representation from the video, such as a spectrogram, is extracted first and then passed to a vocoder to produce the raw audio. Some recent work has focused on end-to-end synthesis, whereby the generation of raw audio and any intermediate representations is performed jointly. All such approaches involve training on data from almost exclusively audio-visual datasets, i.e. every audio sample has a corresponding video sample. This precludes the use of abundant audio-only datasets which may not have a corresponding visual modality (e.g. audiobooks, radio podcasts, speech recognition datasets etc.), as well as audio-only architectures that have been developed by the audio machine learning community over the years. In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz, and then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task. The pre-training step uses audio samples only and does not require labels or corresponding samples from other modalities (visual, text). We demonstrate that this pre-training step improves the reconstructed speech and that it is an unexplored way to improve the quality of the generator in a cross-modal task while only requiring samples from one of the modalities. We conduct experiments using both raw audio and mel spectrograms as target outputs and benchmark our models with existing work.Comment: Submitted to IEE

    Building and Designing Expressive Speech Synthesis

    Get PDF
    We know there is something special about speech. Our voices are not just a means of communicating. They also give a deep impression of who we are and what we might know. They can betray our upbringing, our emotional state, our state of health. They can be used to persuade and convince, to calm and to excite. As speech systems enter the social domain they are required to interact, support and mediate our social relationships with 1) each other, 2) with digital information, and, increasingly, 3) with AI-based algorithms and processes. Socially Interactive Agents (SIAs) are at the fore- front of research and innovation in this area. There is an assumption that in the future “spoken language will provide a natural conversational interface between human beings and so-called intelligent systems.” [Moore 2017, p. 283]. A considerable amount of previous research work has tested this assumption with mixed results. However, as pointed out “voice interfaces have become notorious for fostering frustration and failure” [Nass and Brave 2005, p.6]. It is within this context, between our exceptional and intelligent human use of speech to communicate and interact with other humans, and our desire to leverage this means of communication for artificial systems, that the technology, often termed expressive speech synthesis uncomfortably falls. Uncomfortably, because it is often overshadowed by issues in interactivity and the underlying intelligence of the system which is something that emerges from the interaction of many of the components in a SIA. This is especially true of what we might term conversational speech, where decoupling how things are spoken, from when and to whom they are spoken, can seem an impossible task. This is an even greater challenge in evaluation and in characterising full systems which have made use of expressive speech. Furthermore when designing an interaction with a SIA, we must not only consider how SIAs should speak but how much, and whether they should even speak at all. These considerations cannot be ignored. Any speech synthesis that is used in the context of an artificial agent will have a perceived accent, a vocal style, an underlying emotion and an intonational model. Dimensions like accent and personality (cross speaker parameters) as well as vocal style, emotion and intonation during an interaction (within-speaker parameters) need to be built in the design of a synthetic voice. Even a default or neutral voice has to consider these same expressive speech synthesis components. Such design parameters have a strong influence on how effectively a system will interact, how it is perceived and its assumed ability to perform a task or function. To ignore these is to blindly accept a set of design decisions that ignores the complex effect speech has on the user’s successful interaction with a system. Thus expressive speech synthesis is a key design component in SIAs. This chapter explores the world of expressive speech synthesis, aiming to act as a starting point for those interested in the design, building and evaluation of such artificial speech. The debates and literature within this topic are vast and are fundamentally multidisciplinary in focus, covering a wide range of disciplines such as linguistics, pragmatics, psychology, speech and language technology, robotics and human-computer interaction (HCI), to name a few. It is not our aim to synthesise these areas but to give a scaffold and a starting point for the reader by exploring the critical dimensions and decisions they may need to consider when choosing to use expressive speech. To do this, the chapter explores the building of expressive synthesis, highlighting key decisions and parameters as well as emphasising future challenges in expressive speech research and development. Yet, before these are expanded upon we must first try and define what we actually mean by expressive speech

    Specialised Languages and Multimedia. Linguistic and Cross-cultural Issues

    Get PDF
    none2noThis book collects academic works focusing on scientific and technical discourse and on the ways in which this type of discourse appears in or is shaped by multimedia products. The originality of this book is to be seen in the variety of approaches used and of the specialised languages investigated in relation to multimodal and multimedia genres. Contributions will particularly focus on new multimodal or multimedia forms of specialised discourse (in institutional, academic, technical, scientific, social or popular settings), linguistic features of specialised discourse in multimodal or multimedia genres, the popularisation of specialised knowledge in multimodal or multimedia genres, the impact of multimodality and multimediality on the construction of scientific and technical discourse, the impact of multimodality/multimediality in the practice and teaching of language, the impact of multimodality/multimediality in the practice and teaching of translation, new multimedia modes of knowledge dissemination, the translation/adaptation of scientific discourse in multimedia products. This volume contributes to the theory and practice of multimodal studies and translation, with a specific focus on specialized discourse.Rivista di Classe A - Volume specialeopenManca E., Bianchi F.Manca, E.; Bianchi, F

    AXMEDIS 2008

    Get PDF
    The AXMEDIS International Conference series aims to explore all subjects and topics related to cross-media and digital-media content production, processing, management, standards, representation, sharing, protection and rights management, to address the latest developments and future trends of the technologies and their applications, impacts and exploitation. The AXMEDIS events offer venues for exchanging concepts, requirements, prototypes, research ideas, and findings which could contribute to academic research and also benefit business and industrial communities. In the Internet as well as in the digital era, cross-media production and distribution represent key developments and innovations that are fostered by emergent technologies to ensure better value for money while optimising productivity and market coverage

    Telepresence and Transgenic Art

    Get PDF

    Subtitling the films of Volker Schlöndorff in English

    Get PDF
    Volker Schlöndorff is one of Germany’s foremost directors, and a prominent member of the group who formed the New German Cinema in the 1960s, a movement which rejected the ‘old film-making’ in Germany and embraced a new way of working whose main thrust was artistic, rather than commercial. This thesis seeks to use the Descriptive Translation Studies framework to examine the English subtitles for two of Schlöndorff’s best known films: Die verlorene Ehre der Katharina Blum [The Lost Honour of Katharina Blum], directed in 1975, and Die Blechtrommel [The Tin Drum], from 1979. Using the concept of translational norms as one of its main heuristic tools, this research examines an audiovisual corpus consisting of five different sets of DVD subtitles from the two films: three from Die Blechtrommel, dating from 1995, 2002 and 2010, and two from Katharina Blum, dated 2003 and 2009, thus spanning the era from the advent of digitisation and the beginning of DVD to the rise of TV and film streaming services. The data is analysed to investigate the translation strategies that have been activated by the subtitlers when encountering culture-specific references, and then to pinpoint any diachronic trends that come to the fore, with a view to testing the earliest concept of the retranslation hypothesis (Bensimon, 1990; Berman, 1990)
    • 

    corecore