58 research outputs found
Evaluation of 2D Acoustic Signal Representations for Acoustic-Based Machine Condition Monitoring
Acoustic-based machine condition monitoring (MCM) provides an improved alternative to conventional MCM approaches, including vibration analysis and lubrication monitoring, among others. Several challenges arise in anomalous machine operating sound classification, as it requires effective 2D acoustic signal representation. This paper explores this question. A baseline convolutional neural network (CNN) is implemented and trained with rolling element bearing acoustic fault data. Three representations are considered, such as log-spectrogram, short-time Fourier transform and log-Mel spectrogram. The results establish log-Mel spectrogram and log-spectrogram, as promising candidates for further exploration.Peer reviewe
Music Structure Boundaries Estimation Using Multiple Self-Similarity Matrices as Input Depth of Convolutional Neural Networks
International audienceIn this paper, we propose a new representation as input of a Convolutional Neural Network with the goal of estimating music structure boundaries. For this task, previous works used a network performing the late-fusion of a Mel-scaled log-magnitude spectrogram and a self-similarity-lag-matrix. We propose here to use the square-sub-matrices centered on the main diagonals of several self-similarity-matrices, each one representing a different audio descriptors. We propose to combine them using the depth of the input layer. We show that this representation improves the results over the use of the self-similarity-lag-matrix. We also show that using the depth of the input layer provide a convenient way for early fusion of audio representations
Detection and restoration of click degraded audio based on high-order sparse linear prediction
Clicks are short-duration defects that affect most archived audio media. Linear prediction (LP) modeling for the representation and restoration of audio signals that have been corrupted by click degradation has been extensively studied. The use of high-order sparse linear prediction for the restoration of clickdegraded audio given the time location of samples affected by click degradation has been shown to lead to significant restoration improvement over conventional LP-based approaches. For the practical usage of such methods, the identification of the time location of samples affected by click degradation is critical. High-order sparse linear prediction has been shown to lead to better modeling of audio resulting in better restoration of click degraded archived audio. In this paper, the use of high-order sparse linear prediction for the detection and restoration of click degraded audio is proposed. Results in terms of click duration estimation, SNR improvement and perceptual audio quality show that the proposed approach based on high-order sparse linear prediction leads to better performance compared to state of the art LP-based approaches. 
Music Genre Classification with ResNet and Bi-GRU Using Visual Spectrograms
Music recommendation systems have emerged as a vital component to enhance
user experience and satisfaction for the music streaming services, which
dominates music consumption. The key challenge in improving these recommender
systems lies in comprehending the complexity of music data, specifically for
the underpinning music genre classification. The limitations of manual genre
classification have highlighted the need for a more advanced system, namely the
Automatic Music Genre Classification (AMGC) system. While traditional machine
learning techniques have shown potential in genre classification, they heavily
rely on manually engineered features and feature selection, failing to capture
the full complexity of music data. On the other hand, deep learning
classification architectures like the traditional Convolutional Neural Networks
(CNN) are effective in capturing the spatial hierarchies but struggle to
capture the temporal dynamics inherent in music data. To address these
challenges, this study proposes a novel approach using visual spectrograms as
input, and propose a hybrid model that combines the strength of the Residual
neural Network (ResNet) and the Gated Recurrent Unit (GRU). This model is
designed to provide a more comprehensive analysis of music data, offering the
potential to improve the music recommender systems through achieving a more
comprehensive analysis of music data and hence potentially more accurate genre
classification
Notes on the use of variational autoencoders for speech and audio spectrogram modeling
International audienceVariational autoencoders (VAEs) are powerful (deep) generative artificial neural networks. They have been recently used in several papers for speech and audio processing, in particular for the modeling of speech/audio spectrograms. In these papers, very poor theoretical support is given to justify the chosen data representation and decoder likelihood function or the corresponding cost function used for training the VAE. Yet, a nice theoretical statistical framework exists and has been extensively presented and discussed in papers dealing with nonnegative matrix factorization (NMF) of audio spectrograms and its application to audio source separation. In the present paper, we show how this statistical framework applies to VAE-based speech/audio spectrogram modeling. This provides the latter insights on the choice and interpretability of data representation and model parameterization
Auditory time-frequency masking: psychoacoustical data and application to audio representations
International audienceIn this paper, the results of psychoacoustical experiments on auditory time-frequency (TF) masking using stimuli (masker and target) with maximal concentration in the TF plane are presented. The target was shifted either along the time axis, the frequency axis, or both relative to the masker. The results show that a simple superposition of spectral and temporal masking functions does not provide an accurate representa- tion of the measured TF masking function. This confirms the inaccuracy of simple models of TF masking currently implemented in some percep- tual audio codecs. In the context of audio signal processing, the present results constitute a crucial basis for the prediction of auditory masking in the TF representations of sounds. An algorithm that removes the in- audible components in the wavelet transform of a sound while causing no audible difference to the original sound after re-synthesis is proposed. Preliminary results are promising, although further development is re- quired
Final Research Report on Auto-Tagging of Music
The deliverable D4.7 concerns the work achieved by IRCAM until M36 for the “auto-tagging of music”. The deliverable is a research report. The software libraries resulting from the research have been integrated into Fincons/HearDis! Music Library Manager or are used by TU Berlin. The final software libraries are described in D4.5.
The research work on auto-tagging has concentrated on four aspects:
1) Further improving IRCAM’s machine-learning system ircamclass. This has been done by developing the new MASSS audio features, including audio augmentation and audio segmentation into ircamclass. The system has then been applied to train HearDis! “soft” features (Vocals-1, Vocals-2, Pop-Appeal, Intensity, Instrumentation, Timbre, Genre, Style). This is described in Part 3.
2) Developing two sets of “hard” features (i.e. related to musical or musicological concepts) as specified by HearDis! (for integration into Fincons/HearDis! Music Library Manager) and TU Berlin (as input for the prediction model of the GMBI attributes). Such features are either derived from previously estimated higher-level concepts (such as structure, key or succession of chords) or by developing new signal processing algorithm (such as HPSS) or main melody estimation. This is described in Part 4.
3) Developing audio features to characterize the audio quality of a music track. The goal is to describe the quality of the audio independently of its apparent encoding. This is then used to estimate audio degradation or music decade. This is to be used to ensure that playlists contain tracks with similar audio quality. This is described in Part 5.
4) Developing innovative algorithms to extract specific audio features to improve music mixes. So far, innovative techniques (based on various Blind Audio Source Separation algorithms and Convolutional Neural Network) have been developed for singing voice separation, singing voice segmentation, music structure boundaries estimation, and DJ cue-region estimation. This is described in Part 6.EC/H2020/688122/EU/Artist-to-Business-to-Business-to-Consumer Audio Branding System/ABC D
Matching Pursuits with Random Sequential Subdictionaries
Matching pursuits are a class of greedy algorithms commonly used in signal
processing, for solving the sparse approximation problem. They rely on an atom
selection step that requires the calculation of numerous projections, which can
be computationally costly for large dictionaries and burdens their
competitiveness in coding applications. We propose using a non adaptive random
sequence of subdictionaries in the decomposition process, thus parsing a large
dictionary in a probabilistic fashion with no additional projection cost nor
parameter estimation. A theoretical modeling based on order statistics is
provided, along with experimental evidence showing that the novel algorithm can
be efficiently used on sparse approximation problems. An application to audio
signal compression with multiscale time-frequency dictionaries is presented,
along with a discussion of the complexity and practical implementations.Comment: 20 pages - accepted 2nd April 2012 at Elsevier Signal Processin
Recommended from our members
Adaptive Noise Reduction for Sound Event Detection Using Subband-Weighted NMF
Sound event detection in real-world environments suffers from the interference of non-stationary and time-varying noise. This paper presents an adaptive noise reduction method for sound event detection based on non-negative matrix factorization (NMF). First, a scheme for noise dictionary learning from the input noisy signal is employed by the technique of robust NMF, which supports adaptation to noise variations. The estimated noise dictionary is used to develop a supervised source separation framework in combination with a pre-trained event dictionary. Second, to improve the separation quality, we extend the basic NMF model to a weighted form, with the aim of varying the relative importance of the different components when separating a target sound event from noise. With properly designed weights, the separation process is forced to rely more on those dominant event components, whereas the noise gets greatly suppressed. The proposed method is evaluated on a dataset of the rare sound event detection task of the DCASE 2017 challenge, and achieves comparable results to the top-ranking system based on convolutional recurrent neural networks (CRNNs). The proposed weighted NMF method shows an excellent noise reduction ability, and achieves an improvement of an F-score by 5%, compared to the unweighted approach
Enhancing film sound design using audio features, regression models and artificial neural networks
This is an Accepted Manuscript of an article published by Taylor & Francis in Journal of New Music Research on 21/09/2021, available online: https://doi.org/10.1080/09298215.2021.1977336Making the link between human emotion and music is challenging. Our aim was to produce an efficient system that emotionally rates songs from multiple genres. To achieve this, we employed a series of online self-report studies, utilising Russell's circumplex model. The first study (n = 44) identified audio features that map to arousal and valence for 20 songs. From this, we constructed a set of linear regressors. The second study (n = 158) measured the efficacy of our system, utilising 40 new songs to create a ground truth. Results show our approach may be effective at emotionally rating music, particularly in the prediction of valence
- …