Search CORE

149 research outputs found

Audio for Virtual, Augmented and Mixed Realities: Proceedings of ICSA 2019 ; 5th International Conference on Spatial Audio ; September 26th to 28th, 2019, Ilmenau, Germany

Author: Verband Deutscher Tonmeister
Publication venue
Publication date: 20/11/2019
Field of study

The ICSA 2019 focuses on a multidisciplinary bringing together of developers, scientists, users, and content creators of and for spatial audio systems and services. A special focus is on audio for so-called virtual, augmented, and mixed realities. The fields of ICSA 2019 are: - Development and scientific investigation of technical systems and services for spatial audio recording, processing and reproduction / - Creation of content for reproduction via spatial audio systems and services / - Use and application of spatial audio systems and content presentation services / - Media impact of content and spatial audio systems and services from the point of view of media science. The ICSA 2019 is organized by VDT and TU Ilmenau with support of Fraunhofer Institute for Digital Media Technology IDMT

Digitale Bibliothek Thüringen

Single-Microphone Speech Enhancement and Separation Using Deep Learning

Author: Kolbæk Morten
Publication venue: Aalborg Universitetsforlag
Publication date: 01/01/2018
Field of study

VBN

Single-Microphone Speech Enhancement and Separation Using Deep Learning

Author: Kolbæk Morten
Publication venue
Publication date: 01/01/2018
Field of study

The cocktail party problem comprises the challenging task of understanding a speech signal in a complex acoustic environment, where multiple speakers and background noise signals simultaneously interfere with the speech signal of interest. A signal processing algorithm that can effectively increase the speech intelligibility and quality of speech signals in such complicated acoustic situations is highly desirable. Especially for applications involving mobile communication devices and hearing assistive devices. Due to the re-emergence of machine learning techniques, today, known as deep learning, the challenges involved with such algorithms might be overcome. In this PhD thesis, we study and develop deep learning-based techniques for two sub-disciplines of the cocktail party problem: single-microphone speech enhancement and single-microphone multi-talker speech separation. Specifically, we conduct in-depth empirical analysis of the generalizability capability of modern deep learning-based single-microphone speech enhancement algorithms. We show that performance of such algorithms is closely linked to the training data, and good generalizability can be achieved with carefully designed training data. Furthermore, we propose uPIT, a deep learning-based algorithm for single-microphone speech separation and we report state-of-the-art results on a speaker-independent multi-talker speech separation task. Additionally, we show that uPIT works well for joint speech separation and enhancement without explicit prior knowledge about the noise type or number of speakers. Finally, we show that deep learning-based speech enhancement algorithms designed to minimize the classical short-time spectral amplitude mean squared error leads to enhanced speech signals which are essentially optimal in terms of STOI, a state-of-the-art speech intelligibility estimator.Comment: PhD Thesis. 233 page

arXiv.org e-Print Archive

VBN

Acoustics of ancient Greek and Roman theaters in use today

Author: Angelakis Konstantinos
Gade Anders Christian
Publication venue
Publication date: 01/01/2006
Field of study

Crossref

Online Research Database In Technology

Evaluating the Perceived Quality of Binaural Technology

Author: Pike Christopher William
Publication venue: University of York
Publication date: 02/01/2019
Field of study

This thesis studies binaural sound reproduction from both a technical and a perceptual perspective, with the aim of improving the headphone listening experience for entertainment media audiences. A detailed review is presented of the relevant binaural technology and of the concepts and methods for evaluating perceived quality. A pilot study assesses the application of state-of-the-art binaural rendering systems to existing broadcast programmes, finding no substantial improvements in quality over conventional stereo signals. A second study gives evidence that realistic binaural simulation can be achieved without personalised acoustic calibration, showing promise for the application of binaural technology. Flexible technical apparatus is presented to allow further investigation of rendering techniques and content production processes. Two web-based studies show that appropriate combination of techniques can lead to improved experience for typical audience members, compared to stereo signals, even without personalised rendering or listener head-tracking. Recent developments in spatial audio applications are then discussed. These have made dynamic client-side binaural rendering with listener head-tracking feasible for mass audiences, but also present technical constraints. To limit distribution bandwidth and computational complexity during rendering, loudspeaker virtualisation is widely used. The effects on perceived quality of these techniques are studied in depth for the first time. A descriptive analysis experiment demonstrates that loudspeaker virtualisation during binaural rendering causes degradations to a range of perceptual characteristics and that these vary across other system conditions. A final experiment makes novel use of the check-all-that-apply method to efficiently characterise the quality of seven spatial audio representations and associated dynamic binaural rendering techniques, using single sound sources and complex dramatic scenes. The perceived quality of these different representations varies significantly across a wide range of characteristics and with programme material. These methods and findings can be used to improve the quality of current binaural technology applications

White Rose E-theses Online

Quantitative assessment of spatial sound distortion by the semi-ideal recording point of a hear-through device

Author: Christensen Flemming
Hammershøi Dorte
Hoffmann Pablo F.
Publication venue: 'Acoustical Society of America (ASA)'
Publication date: 01/01/2013
Field of study

Crossref

VBN

Recommended from our members

Musical source separation with deep learning and large-scale datasets

Author: Jansson A.
Publication venue
Publication date
Field of study

Throughout this thesis we will explore automatic music source separation by utilizing modern (at the time of writing) techniques and tools from machine learning and big data processing. The bulk of this work was carried out between 2016 and 2019. In Chapter 2 we conduct a review of source separation literature. We start by outlining a subset of applications of source separation in some depth. We describe some of the early, pioneering work in automatic source separation: Auditory Scene Analysis, and its digital counterpart, Computational Auditory Scene Analysis. We then introduce matrix decomposition-based methods such as Independent Component Analysis and Non-Negative Matrix factorization, and pitch informed methods where the separation algorithm is guided by pitch information that is known a priori. We brie y discuss user-guided methods, before conducting a thorough review of Deep Learning based source separation, including recurrent, convolutional, deep clustering-based, and Generative Adversarial Networks. We then proceed to describe common evaluation metrics and training datasets. Finally, we list a number of current challenges and drawbacks of current systems. Chapter 3 focuses on datasets for musical source separation. First we show the growth of dataset sizes for both machine learning in general and music information retrieval specifically. We give several examples of the complexities and idiosyncrasies that are intrinsic to music datasets. We then proceed to present a method for extracting ground truth data for source separation from large unstructured musical catalogs. In Chapter 4 we design a novel deep learning-based source separation algorithm. Motivation is provided by means of a musicological study1 that showed the high importance of vocals relative to other musical factors, in the minds of listeners. At the core of the vocal separation algorithm is the U-Net, a deep learning architecture that uses skip connections to preserve fine-grained detail. It was originally developed in the biomedical imaging domain, and later adapted to image-to-image translation. We adapt it to the source separation domain by treating spectrograms as images, and we use the dataset mining methods from Chapter 3 to generate sufficiently large training data. We evaluate our model objectively using standard evaluation metrics, subjectively using \crowdsourced" human subjects. To the best of our knowledge, this is the first use of U-Nets for source separation. In the introduction above we proposed joint learning to optimize source separation and other objectives. In Chapter 5 we investigate one such instance: multi-task learning of vocal removal and vocal pitch tracking. We combine the vocal separation model from Chapter 4 with a state of the art pitch salience estimation model2, exploring several ways of combining the two models. We find that vocal pitch estimation benefits from joint learning when the two tasks are trained in sequence, with the source separation model preceding the pitch estimation model. We also report benefits from fine-tuning by iteratively applying the model. Chapter 6 extends the U-Net model to multiple instruments. In order to minimize the phase artifacts that were a common issue in Chapter 4, we modify the model to operate in the complex domain. We run experiments with several loss functions: Time-domain loss, magnitude-only frequency domain loss, and joint time and frequency-domain loss. Our experiments are evaluated both objectively and subjectively, and we carry out extensive qualitative analysis to investigate the effects of complex masking. Finally, we conclude the thesis in Chapter 7 by summarizing this work and highlighting several future directions of research

City Research Online

Deep Learning for Distant Speech Recognition

Author: Ravanelli Mirco
Publication venue
Publication date: 15/12/2017
Field of study

Deep learning is an emerging technology that is considered one of the most promising directions for reaching higher levels of artificial intelligence. Among the other achievements, building computers that understand speech represents a crucial leap towards intelligent machines. Despite the great efforts of the past decades, however, a natural and robust human-machine speech interaction still appears to be out of reach, especially when users interact with a distant microphone in noisy and reverberant environments. The latter disturbances severely hamper the intelligibility of a speech signal, making Distant Speech Recognition (DSR) one of the major open challenges in the field. This thesis addresses the latter scenario and proposes some novel techniques, architectures, and algorithms to improve the robustness of distant-talking acoustic models. We first elaborate on methodologies for realistic data contamination, with a particular emphasis on DNN training with simulated data. We then investigate on approaches for better exploiting speech contexts, proposing some original methodologies for both feed-forward and recurrent neural networks. Lastly, inspired by the idea that cooperation across different DNNs could be the key for counteracting the harmful effects of noise and reverberation, we propose a novel deep learning paradigm called network of deep neural networks. The analysis of the original concepts were based on extensive experimental validations conducted on both real and simulated data, considering different corpora, microphone configurations, environments, noisy conditions, and ASR tasks.Comment: PhD Thesis Unitn, 201

arXiv.org e-Print Archive

Unitn-eprints PhD