70 research outputs found

    Music tempo estimation using sub-band synchrony

    Get PDF
    Tempo estimation aims at estimating the pace of a musical piece measured in beats per minute. This paper presents a new tempo estimation method that utilizes coherent energy changes across multiple frequency sub-bands to identify the onsets. A new measure, called the sub-band synchrony, is proposed to detect and quantify the coherent amplitude changes across multiple sub-bands. Given a musical piece, our method first detects the onsets using the sub-band synchrony measure. The periodicity of the resulting onset curve, measured using the autocorrelation function, is used to estimate the tempo value. The performance of the sub-band synchrony based tempo estimation method is evaluated on two music databases. Experimental results indicate a reasonable improvement in performance when compared to conventional methods of tempo estimation

    Deep-sound field analysis for upscaling ambisonic signals

    Get PDF
    International audienceHigher Order Ambisonics (HOA) is a popular technique used in high quality spatial audio reproduction. Several time and frequency domain methods which exploit sparsity have been proposed in the literature. These methods exploit sparsity and an overcomplete spherical harmonics dictionary is used to compute the DOA of the source. Spherical harmonic decomposition has also been used to render the spatial sound. However, the desired sound field can be reproduced over a small 
reproduction area at lower ambisonic orders. Additionally, this technique is limited by low spatial resolution which can be improved by increasing the number of loudspeakers during spatial sound reproduction. An increase in the number of loudspeakers is not a good choice since it involves solving an underdetermined system of equations for improving spatial resolution. A joint method that upscales the Ambisonics order while simultaneously increasing the number of loudspeakers is a feasible solution to this problem. Deep Neural Networks have hitherto not been investigated in detail in the context of upscaling ambisonics.In this work, a novel Sequential Multi-Stage DNN (SMS-DNN) is developed for upscaling Ambisonic signals. The SMS-DNN consists of sequentially stacked DNNs, where each of the stacked DNN upscales the order of the signal by one. This DNN structure is motivated by the fact that the spherical components of the encoded signal are independent of each other. Additionally for a particular direction <latex>(θ, φ)</latex> of the sound source, increase in the spherical harmonic order only appends higher order spherical harmonic coefficients to the encoder of the previous order, while the lower order spherical harmonic coefficients remain unchanged. Hence the individual DNNs in the SMS-DNN can be trained independently for any upscaling order.Monophonic sound is acquired using a B-format (first order) ambisonic microphone. These signals are upscaled into order-N HOA encoded plane wave sounds using the SMS-DNN in this work. The SMS-DNN allows for training of a very large number of layers since training is performed in blocks consisting of a fixed number of layers. Hence each stage can be trained independently. Additionally, the vanishing gradient problem in DNN with a large number of layers is also effectively handled by the proposed SMS-DNN due to its sequential nature. This method does not require prior estimation of the source locations and works in multiple source scenarios.Experiments on ambisonics upscaling are conducted to evaluate the performance of the proposed method. The SMS-DNN architecture used in the experiment consists of N-1 fully connected feedforward neural networks where each network is trained separately. Here N is the ambisonics order up to which upscaling needs to be performed. An input training dataset where each example is a combination of five randomly located sound sources is also developed for the purpose of training the SMS-DNN. The output training dataset consists of a higher order encoding of the same mixture of sounds with similar locations as input data. Reconstructed sound field analysis, subjective and objective evaluations conducted on the upscaled Ambisonic sound scenes. Mean squared Error analysis of upscaled higher order reproduced fields indicates an error of up to -10dB. As the order of upscaling is increased it is noted that error-free reproduction area (sweet spot) increases. Average error distribution plots are also used to indicate the significance of the proposed method. MUSHRA tests, MOS (subjective evaluation) and PEAQ tests (objective evaluation) are also illustrated to indicate the perceptual quality of the reproduced sounds when compared to benchmark HOA reproduction

    Bridged variational autoencoders for joint modeling of images and attributes

    Get PDF
    Generative models have recently shown the ability to realistically generate data and model the distribution accurately. However, joint modeling of an image with the attribute that it is labeled with requires learning a cross modal correspondence between image and attribute data. Though the information present in a set of images and its attributes possesses completely different statistical properties altogether, there exists an inherent correspondence that is challenging to capture. Various models have aimed at capturing this correspondence either through joint modeling of a variational autoencoder or through separate encoder networks that are then concatenated. We present an alternative by proposing a bridged variational autoencoder that allows for learning cross-modal correspondence by incorporating cross-modal hallucination losses in the latent space. In comparison to the existing methods, we have found that by using a bridge connection in latent space we not only obtain better generation results, but also obtain highly parameter-efficient model which provide 40% reduction in training parameters for bimodal dataset and nearly 70% reduction for trimodal dataset. We validate the proposed method through comparison with state of the art methods and benchmarking on standard datasets.</p
    • …
    corecore