22 research outputs found

    An evaluation framework for event detection using a morphological model of acoustic scenes

    Get PDF
    This paper introduces a model of environmental acoustic scenes which adopts a morphological approach by ab-stracting temporal structures of acoustic scenes. To demonstrate its potential, this model is employed to evaluate the performance of a large set of acoustic events detection systems. This model allows us to explicitly control key morphological aspects of the acoustic scene and isolate their impact on the performance of the system under evaluation. Thus, more information can be gained on the behavior of evaluated systems, providing guidance for further improvements. The proposed model is validated using submitted systems from the IEEE DCASE Challenge; results indicate that the proposed scheme is able to successfully build datasets useful for evaluating some aspects the performance of event detection systems, more particularly their robustness to new listening conditions and the increasing level of background sounds.Research project partly funded by ANR-11-JS03-005-01

    A morphological model for simulating acoustic scenes and its application to sound event detection

    Get PDF
    This paper introduces a model for simulating environmental acoustic scenes that abstracts temporal structures from audio recordings. This model allows us to explicitly control key morphological aspects of the acoustic scene and to isolate their impact on the performance of the system under evaluation. Thus, more information can be gained on the behavior of an evaluated system, providing guidance for further improvements. To demonstrate its potential, this model is employed to evaluate the performance of nine state of the art sound event detection systems submitted to the IEEE DCASE 2013 Challenge. Results indicate that the proposed scheme is able to successfully build datasets useful for evaluating important aspects of the performance of sound event detection systems, such as their robustness to new recording conditions and to varying levels of background audio.This paper introduces a model for simulating environmental acoustic scenes that abstracts temporal structures from audio recordings. This model allows us to explicitly control key morphological aspects of the acoustic scene and to isolate their impact on the performance of the system under evaluation. Thus, more information can be gained on the behavior of an evaluated system, providing guidance for further improvements. To demonstrate its potential, this model is employed to evaluate the performance of nine state of the art sound event detection systems submitted to the IEEE DCASE 2013 Challenge. Results indicate that the proposed scheme is able to successfully build datasets useful for evaluating important aspects of the performance of sound event detection systems, such as their robustness to new recording conditions and to varying levels of background audio

    Vivos Voco: A survey of recent research on voice transformation at IRCAM

    Get PDF
    cote interne IRCAM: Lanchantin11cInternational audienceIRCAM has a long experience in analysis, synthesis and transformation of voice. Natural voice transformations are of great interest for many applications and can be combine with text-to-speech system, leading to a powerful creation tool. We present research conducted at IRCAM on voice transformations for the last few years. Transformations can be achieved in a global way by modifying pitch, spectral envelope, durations etc. While it sacrifices the possibility to attain a specific target voice, the approach allows the production of new voices of a high degree of naturalness with different gender and age, modified vocal quality, or another speech style. These transformations can be applied in realtime using ircamTools TRAX. Transformation can also be done in a more specific way in order to transform a voice towards the voice of a target speaker. Finally, we present some recent research on the transformation of expressivity

    An Entropy Based Method for Local Time-Adaptation of the Spectrogram

    No full text
    We propose a method for automatic local time-adaptation of the spectrogram of audio signals: it is based on the decomposition of a signal within a Gabor multi-frame through the STFT operator. The sparsity of the analysis in every individual frame of the multi-frame is evaluated through the Rényi entropy measures: the best local resolution is determined minimizing the entropy values. The overall spectrogram of the signal we obtain thus provides local optimal resolution adaptively evolving over time. We give examples of the performance of our algorithm with an instrumental sound and a synthetic one, showing the improvement in spectrogram displaying obtained with an automatic adaptation of the resolution. The analysis operator is invertible, thus leading to a perfect reconstruction of the original signal through the analysis coefficients

    Mixed source model and its adapted vocal tract filter estimate for voice transformation and synthesis

    No full text
    In current methods for voice transformation and speech synthesis, the vocal tract filter is usually assumed to be excited by a flat amplitude spectrum. In this article, we present a method using a mixed source model defined as a mixture of the Liljencrants-Fant (LF) model and Gaussian noise. Using the LF model, the base approach used in this presented work is therefore close to a vocoder using exogenous input like ARX-based methods or the Glottal Spectral Separation (GSS) method. Such approaches are therefore dedicated to voice processing promising an improved naturalness compared to generic signal models. To estimate the Vocal Tract Filter (VTF), using spectral division like in GSS, we show that a glottal source model can be used with any envelope estimation method conversely to ARX approach where a least square AR solution is used. We therefore derive a VTF estimate which takes into account the amplitude spectra of both deterministic and random components of the glottal source. The proposed mixed source model is controlled by a small set of intuitive and independent parameters. The relevance of this voice production model is evaluated, through listening tests, in the context of resynthesis, HMM-based speech synthesis, breathiness modification and pitch transposition. © 2012 Elsevier B.V. All rights reserved

    A reduced multiple Gabor frame for local time adaptation of the spectrogram

    No full text
    In this paper we propose a method for automatic local time adaptation of the spectrogram of an audio signal, based on its decomposition within a Gabor multi-frame. The sparsity of the analyses within each individual frame is evaluated through the Rényi entropies measures. According to the sparsity of the decompositions, an optimal resolution and a reduced multi-frame are determined, defining an adapted spectrogram with variable resolution and hop-size. The composition of such a reduced multi-frame allows an immediate definition of a dual frame: re-synthesis techniques for this adapted analysis are easily derived by the traditional phase vocoder scheme

    2D/3D AudioVisual content analysis & description

    Get PDF
    In this paper, we propose a way of using the Audio-Visual Description Profile (AVDP) of the MPEG-7 standard for 2D or stereo video and multichannel audio content description. Our aim is to provide means of using AVDP in such a way, that 3D video and audio content can be correctly and consistently described. Since AVDP semantics do not include ways for dealing with 3D audiovisual content, a new semantic framework within AVDP is proposed and examples of using AVDP to describe the results of analysis algorithms on stereo video and multichannel audio content are presented
    corecore