48 research outputs found

    Multi-scale Multi-band DenseNets for Audio Source Separation

    Full text link
    This paper deals with the problem of audio source separation. To handle the complex and ill-posed nature of the problems of audio source separation, the current state-of-the-art approaches employ deep neural networks to obtain instrumental spectra from a mixture. In this study, we propose a novel network architecture that extends the recently developed densely connected convolutional network (DenseNet), which has shown excellent results on image classification tasks. To deal with the specific problem of audio source separation, an up-sampling layer, block skip connection and band-dedicated dense blocks are incorporated on top of DenseNet. The proposed approach takes advantage of long contextual information and outperforms state-of-the-art results on SiSEC 2016 competition by a large margin in terms of signal-to-distortion ratio. Moreover, the proposed architecture requires significantly fewer parameters and considerably less training time compared with other methods.Comment: to appear at WASPAA 201

    Studies on binaural and monaural signal analysis methods and applications

    Get PDF
    Sound signals can contain a lot of information about the environment and the sound sources present in it. This thesis presents novel contributions to the analysis of binaural and monaural sound signals. Some new applications are introduced in this work, but the emphasis is on analysis methods. The three main topics of the thesis are computational estimation of sound source distance, analysis of binaural room impulse responses, and applications intended for augmented reality audio. A novel method for binaural sound source distance estimation is proposed. The method is based on learning the coherence between the sounds entering the left and right ears. Comparisons to an earlier approach are also made. It is shown that these kinds of learning methods can correctly recognize the distance of a speech sound source in most cases. Methods for analyzing binaural room impulse responses are investigated. These methods are able to locate the early reflections in time and also to estimate their directions of arrival. This challenging problem could not be tackled completely, but this part of the work is an important step towards accurate estimation of the individual early reflections from a binaural room impulse response. As the third part of the thesis, applications of sound signal analysis are studied. The most notable contributions are a novel eyes-free user interface controlled by finger snaps, and an investigation on the importance of features in audio surveillance. The results of this thesis are steps towards building machines that can obtain information on the surrounding environment based on sound. In particular, the research into sound source distance estimation functions as important basic research in this area. The applications presented could be valuable in future telecommunications scenarios, such as augmented reality audio

    Joint estimation of reverberation time and early-to-late reverberation ratio from single-channel speech signals

    Get PDF
    The reverberation time (RT) and the early-to-late reverberation ratio (ELR) are two key parameters commonly used to characterize acoustic room environments. In contrast to conventional blind estimation methods that process the two parameters separately, we propose a model for joint estimation to predict the RT and the ELR simultaneously from single-channel speech signals from either full-band or sub-band frequency data, which is referred to as joint room parameter estimator (jROPE). An artificial neural network is employed to learn the mapping from acoustic observations to the RT and the ELR classes. Auditory-inspired acoustic features obtained by temporal modulation filtering of the speech time-frequency representations are used as input for the neural network. Based on an in-depth analysis of the dependency between the RT and the ELR, a two-dimensional (RT, ELR) distribution with constrained boundaries is derived, which is then exploited to evaluate four different configurations for jROPE. Experimental results show that-in comparison to the single-task ROPE system which individually estimates the RT or the ELR-jROPE provides improved results for both tasks in various reverberant and (diffuse) noisy environments. Among the four proposed joint types, the one incorporating multi-task learning with shared input and hidden layers yields the best estimation accuracies on average. When encountering extreme reverberant conditions with RTs and ELRs lying beyond the derived (RT, ELR) distribution, the type considering RT and ELR as a joint parameter performs robustly, in particular. From state-of-the-art algorithms that were tested in the acoustic characterization of environments challenge, jROPE achieves comparable results among the best for all individual tasks (RT and ELR estimation from full-band and sub-band signals)

    Direct-to-Reverberant Energy Ratio Estimation Using a First-Order Microphone

    Get PDF
    The direct-to-reverberant ratio (DRR) is an important characterization of a reverberant environment. This paper presents a novel blind DRR estimation method based on the coherence function between the sound pressure and particle velocity at a point. First, a general expression of coherence function and DRR is derived in the spherical harmonic domain, without imposing assumptions on the reverberation. In this paper, DRR is expressed in terms of the coherence function as well as two parameters that are related to statistical characteristics of the reverberant environment. Then, a method to estimate the values of these two parameters using a microphone system capable of capturing first-order spherical harmonics is proposed, under three assumptions which are more realistic than the diffuse field model. Furthermore, a theoretical analysis on the use of plane wave model for direct path signal and its effect on DRR estimation is presented, and a rule of thumb is provided for determining whether the point source model should be used for the direct path signal. Finally, the ACE challenge dataset is used to validate the proposed DRR estimation method. The results show that the average full band estimation error is within 2 dB, with no clear trend of bias.DP14010341

    Microphone array processing for parametric spatial audio techniques

    Get PDF
    Reproduction of spatial properties of recorded sound scenes is increasingly recognised as a crucial element of all emerging immersive applications, with domestic or cinema-oriented audiovisual reproduction for entertainment, telepresence and immersive teleconferencing, and augmented and virtual reality being key examples. Such applications benefit from a general spatial audio processing framework, being able to exploit spatial information from a variety of recording formats in order to reproduce the original sound scene in a perceptually transparent way. Directional Audio Coding (DirAC) is a recent parametric spatial sound reproduction method that fulfils many of the requirements of such a framework. It is based on a universal 3D audio format known as B-format and achieves flexible and effective perceptual reproduction for loudspeakers or headphones. Part of this work focuses on the model of DirAC and aims to extend it. Firstly, it is shown that by taking into account information of the four-channel recording array that generates the B-format signals, it is possible to improve both analysis of the sound scene and reproduction. Secondly, these findings are generalised for various recording configurations. A further generalisation of DirAC is attempted in a spatial transform domain, the spherical harmonic domain (SHD), with higher-order B-format signals. Formulating the DirAC model in the SHD combines the perceptual effectiveness of DirAC with the increased resolution of higher-order B-format and overcomes most limitations of traditional DirAC. Some novel applications of parametric processing of spatial sound are demonstrated for sound and music engineering. The first shows the potential of modifying the spatial information in the recording for creative manipulation of sound scenes, while the second shows improvement of music reproduction captured with established surround recording techniques.The effectiveness of parametric techniques in conveying distance and externalisation cues over headphones, led to research in controlling the perceived distance using loudspeakers in a room. This is achieved by manipulating the direct-to-reverberant energy ratio using a compact loudspeaker array with a variable directivity pattern. Lastly, apart from reproduction of recorded sound scenes, auralisation of the spatial properties of acoustical spaces are of interest. We demonstrate that this problem is well-suited to parametric spatial analysis. The nature of room impulse responses captured with a large microphone array allows very high-resolution approaches, and such approaches for detection and localisation of multiple reflections in a single short observation window are applied and compared

    Fusion of Multimodal Information in Music Content Analysis

    Get PDF
    Music is often processed through its acoustic realization. This is restrictive in the sense that music is clearly a highly multimodal concept where various types of heterogeneous information can be associated to a given piece of music (a musical score, musicians\u27 gestures, lyrics, user-generated metadata, etc.). This has recently led researchers to apprehend music through its various facets, giving rise to "multimodal music analysis" studies. This article gives a synthetic overview of methods that have been successfully employed in multimodal signal analysis. In particular, their use in music content processing is discussed in more details through five case studies that highlight different multimodal integration techniques. The case studies include an example of cross-modal correlation for music video analysis, an audiovisual drum transcription system, a description of the concept of informed source separation, a discussion of multimodal dance-scene analysis, and an example of user-interactive music analysis. In the light of these case studies, some perspectives of multimodality in music processing are finally suggested

    Enabling technologies for audio augmented reality systems

    Get PDF
    Audio augmented reality (AAR) refers to technology that embeds computer-generated auditory content into a user's real acoustic environment. An AAR system has specific requirements that set it apart from regular human--computer interfaces: an audio playback system to allow the simultaneous perception of real and virtual sounds; motion tracking to enable interactivity and location-awareness; the design and implementation of auditory display to deliver AAR content; and spatial rendering to display spatialised AAR content. This thesis presents a series of studies on enabling technologies to meet these requirements. A binaural headset with integrated microphones is assumed as the audio playback system, as it allows mobility and precise control over the ear input signals. Here, user position and orientation tracking methods are proposed that rely on speech signals recorded at the binaural headset microphones. To evaluate the proposed methods, the head orientations and positions of three conferees engaged in a discussion were tracked. The binaural microphones improved tracking performance substantially. The proposed methods are applicable to acoustic tracking with other forms of user-worn microphones. Results from a listening test investigating the effect of auditory display parameters on user performance are reported. The parameters studied were derived from the design choices to be made when implementing auditory display. The results indicate that users are able to detect a sound sample among distractors and estimate sample numerosity accurately with both speech and non-speech audio, if the samples are presented with adequate temporal separation. Whether or not samples were separated spatially had no effect on user performance. However, with spatially separated samples, users were able to detect a sample among distractors and simultaneously localise it. The results of this study are applicable to a variety of AAR applications that require conveying sample presence or numerosity. Spatial rendering is commonly implemented by convolving virtual sounds with head-related transfer functions (HRTFs). Here, a framework is proposed that interpolates HRTFs measured at arbitrary directions and distances. The framework employs Delaunay triangulation to group HRTFs into subsets suitable for interpolation and barycentric coordinates as interpolation weights. The proposed interpolation framework allows the realtime rendering of virtual sources in the near-field via HRTFs measured at various distances

    Principled methods for mixtures processing

    Get PDF
    This document is my thesis for getting the habilitation à diriger des recherches, which is the french diploma that is required to fully supervise Ph.D. students. It summarizes the research I did in the last 15 years and also provides the short­term research directions and applications I want to investigate. Regarding my past research, I first describe the work I did on probabilistic audio modeling, including the separation of Gaussian and α­stable stochastic processes. Then, I mention my work on deep learning applied to audio, which rapidly turned into a large effort for community service. Finally, I present my contributions in machine learning, with some works on hardware compressed sensing and probabilistic generative models.My research programme involves a theoretical part that revolves around probabilistic machine learning, and an applied part that concerns the processing of time series arising in both audio and life sciences
    corecore