478 research outputs found
Linear Mixing Models for Active Listening of Music Productions in Realistic Studio Conditions
International audienceThe mixing/demixing of audio signals as addressed in the signal processing literature (the "source separation" problem) and the music production in studio remain quite separated worlds. Scienti c audio scene analysis rather focuses on "natural" mixtures and most often uses linear (convolutive) models of point sources placed in the same acoustic space. In contrast, the sound engineer can mix musical signals of very di erent nature and belonging to di erent acoustic spaces, and exploits many audio e ects including non-linear processes. In the present paper we discuss these di erences within the strongly emerging framework of active music listening, which is precisely at the crossroads of these two worlds: it consists in giving to the listener the ability to manipulate the di erent musical sources while listening to a musical piece. We propose a model that allows the description of a general studio mixing process as a linear stationary process of "generalized source image signals" considered as individual tracks. Such a model can be used to allow the recovery of the isolated tracks while preserving the professional sound quality of the mixture. A simple addition of these recovered tracks enables the end-user to recover the full-quality stereo mix, while these tracks can also be used for, e.g., basic remix / karaoke / soloing and re-orchestration applications
On the Informed Source Separation Approach for Interactive Remixing in Stereo
International audienceInformed source separation (ISS) has become a popular trend in the audio signal processing community over the past few years. Its purpose is to decompose a mixture signal into its constituent parts at the desired or the best possible quality level given some metadata. In this paper we present a comparison between two ISS systems and relate the ISS approach in various configurations with conventional coding of separate tracks for interactive remixing in stereo. The compared systems are Underdetermined Source Signal Recovery (USSR) and Enhanced Audio Object Separation (EAOS). The latter forms a part of MPEG's Spatial Audio Object Coding technology. The performance is evaluated using objective difference grades computed with PEMO-Q. The results suggest that USSR performs perceptually better than EOAS and has a lower computational complexity
Underdetermined convolutive source separation using two dimensional non-negative factorization techniques
PhD ThesisIn this thesis the underdetermined audio source separation has been considered, that is, estimating the original audio sources from the observed mixture when the number of audio sources is greater than the number of channels. The separation has been carried out using two approaches; the blind audio source separation and the informed audio source separation. The blind audio source separation approach depends on the mixture signal only and it assumes that the separation has been accomplished without any prior information (or as little as possible) about the sources. The informed audio source separation uses the exemplar in addition to the mixture signal to emulate the targeted speech signal to be separated. Both approaches are based on the two dimensional factorization techniques that decompose the signal into two tensors that are convolved in both the temporal and spectral directions. Both approaches are applied on the convolutive mixture and the high-reverberant convolutive mixture which are more realistic than the instantaneous mixture.
In this work a novel algorithm based on the nonnegative matrix factor two dimensional deconvolution (NMF2D) with adaptive sparsity has been proposed to separate the audio sources that have been mixed in an underdetermined convolutive mixture. Additionally, a novel Gamma Exponential Process has been proposed for estimating the convolutive parameters and number of components of the NMF2D/ NTF2D, and to initialize the NMF2D parameters. In addition, the effects of different window length have been investigated to determine the best fit model that suit the characteristics of the audio signal. Furthermore, a novel algorithm, namely the fusion K models of full-rank weighted nonnegative tensor factor two dimensional deconvolution (K-wNTF2D) has been proposed. The K-wNTF2D is developed for its ability in modelling both the spectral and temporal changes, and the spatial covariance matrix that addresses the high reverberation problem. Variable sparsity that derived from the Gibbs distribution is optimized under the Itakura-Saito divergence and adapted into the K-wNTF2D model. The tensors of this algorithm have been initialized by a novel initialization method, namely the SVD two-dimensional deconvolution (SVD2D). Finally, two novel informed source separation algorithms, namely, the semi-exemplar based algorithm and the exemplar-based algorithm, have been proposed. These algorithms based on the NMF2D model and the proposed two dimensional nonnegative matrix partial co-factorization (2DNMPCF) model. The idea of incorporating the exemplar is to inform the proposed separation algorithms about the targeted signal to be separated by initializing its parameters and guide the proposed separation algorithms. The adaptive sparsity is derived for both
ii
of the proposed algorithms. Also, a multistage of the proposed exemplar based algorithm has been proposed in order to further enhance the separation performance.
Results have shown that the proposed separation algorithms are very promising, more flexible, and offer an alternative model to the conventional methods
Audio source separation for music in low-latency and high-latency scenarios
Aquesta tesi proposa mètodes per tractar les limitacions de les tècniques existents de separació de fonts musicals en condicions de baixa i alta latència. En primer lloc, ens centrem en els mètodes amb un baix cost computacional i baixa latència. Proposem l'ús de la regularització de Tikhonov com a mètode de descomposició de l'espectre en el context de baixa latència. El comparem amb les tècniques existents en tasques d'estimació i seguiment dels tons, que són passos crucials en molts mètodes de separació. A continuació utilitzem i avaluem el mètode de descomposició de l'espectre en tasques de separació de veu cantada, baix i percussió. En segon lloc, proposem diversos mètodes d'alta latència que milloren la separació de la veu cantada, gràcies al modelatge de components específics, com la respiració i les consonants. Finalment, explorem l'ús de correlacions temporals i anotacions manuals per millorar la separació dels instruments de percussió i dels senyals musicals polifònics complexes.Esta tesis propone métodos para tratar las limitaciones de las técnicas existentes de separación de fuentes musicales en condiciones de baja y alta latencia. En primer lugar, nos centramos en los métodos con un bajo coste computacional y baja latencia. Proponemos el uso de la regularización de Tikhonov como método de descomposición del espectro en el contexto de baja latencia. Lo comparamos con las técnicas existentes en tareas de estimación y seguimiento de los tonos, que son pasos cruciales en muchos métodos de separación. A continuación utilizamos y evaluamos el método de descomposición del espectro en tareas de separación de voz cantada, bajo y percusión. En segundo lugar, proponemos varios métodos de alta latencia que mejoran la separación de la voz cantada, gracias al modelado de componentes que a menudo no se toman en cuenta, como la respiración y las consonantes. Finalmente, exploramos el uso de correlaciones temporales y anotaciones manuales para mejorar la separación de los instrumentos de percusión y señales musicales polifónicas complejas.This thesis proposes specific methods to address the limitations of current music source separation methods in low-latency and high-latency scenarios. First, we focus on methods with low computational cost and low latency. We propose the use of Tikhonov regularization as a method for spectrum decomposition in the low-latency context. We compare it to existing techniques in pitch estimation and tracking tasks, crucial steps in many separation methods. We then use the proposed spectrum decomposition method in low-latency separation tasks targeting singing voice, bass and drums. Second, we propose several high-latency methods that improve the separation of singing voice by modeling components that are often not accounted for, such as breathiness and consonants. Finally, we explore using temporal correlations and human annotations to enhance the separation of drums and complex polyphonic music signals
Fusion of Multimodal Information in Music Content Analysis
Music is often processed through its acoustic realization. This is restrictive in the sense that music is clearly a highly multimodal concept where various types of heterogeneous information can be associated to a given piece of music (a musical score, musicians\u27 gestures, lyrics, user-generated metadata, etc.). This has recently led researchers to apprehend music through its various facets, giving rise to "multimodal music analysis" studies. This article gives a synthetic overview of methods that have been successfully employed in multimodal signal analysis. In particular, their use in music content processing is discussed in more details through five case studies that highlight different multimodal integration techniques. The case studies include an example of cross-modal correlation for music video analysis, an audiovisual drum transcription system, a description of the concept of informed source separation, a discussion of multimodal dance-scene analysis, and an example of user-interactive music analysis. In the light of these case studies, some perspectives of multimodality in music processing are finally suggested
Recommended from our members
Single Channel auditory source separation with neural network
Although distinguishing different sounds in noisy environment is a relative easy task for human, source separation has long been extremely difficult in audio signal processing. The problem is challenging for three reasons: the large variety of sound type, the abundant mixing conditions and the unclear mechanism to distinguish sources, especially for similar sounds.
In recent years, the neural network based methods achieved impressive successes in various problems, including the speech enhancement, where the task is to separate the clean speech out of the noise mixture. However, the current deep learning based source separator does not perform well on real recorded noisy speech, and more importantly, is not applicable in a more general source separation scenario such as overlapped speech.
In this thesis, we firstly propose extensions for the current mask learning network, for the problem of speech enhancement, to fix the scale mismatch problem which is usually occurred in real recording audio. We solve this problem by combining two additional restoration layers in the existing mask learning network. We also proposed a residual learning architecture for the speech enhancement, further improving the network generalization under different recording conditions. We evaluate the proposed speech enhancement models on CHiME 3 data. Without retraining the acoustic model, the best bi-direction LSTM with residue connections yields 25.13% relative WER reduction on real data and 34.03% WER on simulated data.
Then we propose a novel neural network based model called “deep clustering” for more general source separation tasks. We train a deep network to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures. This yields a deep network-based analogue to spectral clustering, in that the embeddings form a low-rank pairwise affinity matrix that approximates the ideal affinity matrix, while enabling much faster performance. At test time, the clustering step “decodes” the segmentation implicit in the embeddings by optimizing K-means with respect to the unknown assignments. Experiments on single channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker and three speakers mixtures can improve signal quality for mixtures of held-out speakers by an average over 10dB.
We then propose an extension for deep clustering named “deep attractor” network that allows the system to perform efficient end-to-end training. In the proposed model, attractor points for each source are firstly created the acoustic signals which pull together the time-frequency bins corresponding to each source by finding the centroids of the sources in the embedding space, which are subsequently used to determine the similarity of each bin in the mixture to each source. The network is then trained to minimize the reconstruction error of each source by optimizing the embeddings. We showed that this frame work can achieve even better results.
Lastly, we introduce two applications of the proposed models, in singing voice separation and the smart hearing aid device. For the former, a multi-task architecture is proposed, which combines the deep clustering and the classification based network. And a new state of the art separation result was achieved, where the signal to noise ratio was improved by 11.1dB on music and 7.9dB on singing voice. In the application of smart hearing aid device, we combine the neural decoding with the separation network. The system firstly decodes the user’s attention, which is further used to guide the separator for the targeting source. Both objective study and subjective study show the proposed system can accurately decode the attention and significantly improve the user experience
Very Low Bitrate Spatial Audio Coding with Dimensionality Reduction
International audienceIn this paper, we show that tensor compression techniques based on randomization and partial observations are very useful for spatial audio object coding. In this application, we aim at transmitting several audio signals called objects from a coder to a decoder. A common strategy is to transmit only the downmix of the objects along some small information permitting reconstruction at the decoder. In practice , this is done by transmitting compressed versions of the objects spectrograms and separating the mix with Wiener filters. Previous research used nonnegative tensor factorizations in this context, with bitrates as low as 1 kbps per object. Building on recent advances on tensor compression, we show that the computation time for encoding can be extremely reduced. Then, we demonstrate how the mixture can be exploited at the de-coder to avoid the transmission of many parameters, permitting bi-trates as low as 0.1 kbps per object for comparable performance
Rétroingénierie du son pour l écoute active et autres applications
Ce travail s intéresse au problème de la rétroingénierie du son pour l écoute active. Le format considéré correspond au CD audio. Le contenu musical est vu comme le résultat d un enchaînement de la composition, l enregistrement, le mixage et le mastering. L inversion des deux dernières étapes constitue le fond du problème présent. Le signal audio est traité comme un mélange post-non-linéaire. Ainsi, le mélange est décompressé avant d'être décomposé en pistes audio. Le problème est abordé dans un contexte informé : l inversion est accompagnée d'une information qui est spécifique à la production du contenu. De cette manière, la qualité de l inversion est significativement améliorée. L information est réduite de taille en se servant des méthodes de quantification, codage, et des faits sur la psychoacoustique. Les méthodes proposées s appliquent en temps réel et montrent une complexité basse. Les résultats obtenus améliorent l état de l art et contribuent aux nouvelles connaissances.This work deals with the problem of reverse audio engineering for active listening. The format under consideration corresponds to the audio CD. The musical content is viewed as the result of a concatenation of the composition, the recording, the mixing, and the mastering. The inversion of the two latter stages constitutes the core of the problem at hand. The audio signal is treated as a post-nonlinear mixture. Thus, the mixture is decompressed before being decomposed into audio tracks. The problem is tackled in an informed context: The inversion is accompanied by information which is specific to the content production. In this manner, the quality of the inversion is significantly improved. The information is reduced in size by the use of quantification and coding methods, and some facts on psychoacoustics. The proposed methods are applicable in real time and have a low complexity. The obtained results advance the state of the art and contribute new insights.BORDEAUX1-Bib.electronique (335229901) / SudocSudocFranceF
- …