6,186 research outputs found
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Sample Complexity Analysis for Learning Overcomplete Latent Variable Models through Tensor Methods
We provide guarantees for learning latent variable models emphasizing on the
overcomplete regime, where the dimensionality of the latent space can exceed
the observed dimensionality. In particular, we consider multiview mixtures,
spherical Gaussian mixtures, ICA, and sparse coding models. We provide tight
concentration bounds for empirical moments through novel covering arguments. We
analyze parameter recovery through a simple tensor power update algorithm. In
the semi-supervised setting, we exploit the label or prior information to get a
rough estimate of the model parameters, and then refine it using the tensor
method on unlabeled samples. We establish that learning is possible when the
number of components scales as , where is the observed
dimension, and is the order of the observed moment employed in the tensor
method. Our concentration bound analysis also leads to minimax sample
complexity for semi-supervised learning of spherical Gaussian mixtures. In the
unsupervised setting, we use a simple initialization algorithm based on SVD of
the tensor slices, and provide guarantees under the stricter condition that
(where constant can be larger than ), where the
tensor method recovers the components under a polynomial running time (and
exponential in ). Our analysis establishes that a wide range of
overcomplete latent variable models can be learned efficiently with low
computational and sample complexity through tensor decomposition methods.Comment: Title change
Final Research Report on Auto-Tagging of Music
The deliverable D4.7 concerns the work achieved by IRCAM until M36 for the âauto-tagging of musicâ. The deliverable is a research report. The software libraries resulting from the research have been integrated into Fincons/HearDis! Music Library Manager or are used by TU Berlin. The final software libraries are described in D4.5.
The research work on auto-tagging has concentrated on four aspects:
1) Further improving IRCAMâs machine-learning system ircamclass. This has been done by developing the new MASSS audio features, including audio augmentation and audio segmentation into ircamclass. The system has then been applied to train HearDis! âsoftâ features (Vocals-1, Vocals-2, Pop-Appeal, Intensity, Instrumentation, Timbre, Genre, Style). This is described in Part 3.
2) Developing two sets of âhardâ features (i.e. related to musical or musicological concepts) as specified by HearDis! (for integration into Fincons/HearDis! Music Library Manager) and TU Berlin (as input for the prediction model of the GMBI attributes). Such features are either derived from previously estimated higher-level concepts (such as structure, key or succession of chords) or by developing new signal processing algorithm (such as HPSS) or main melody estimation. This is described in Part 4.
3) Developing audio features to characterize the audio quality of a music track. The goal is to describe the quality of the audio independently of its apparent encoding. This is then used to estimate audio degradation or music decade. This is to be used to ensure that playlists contain tracks with similar audio quality. This is described in Part 5.
4) Developing innovative algorithms to extract specific audio features to improve music mixes. So far, innovative techniques (based on various Blind Audio Source Separation algorithms and Convolutional Neural Network) have been developed for singing voice separation, singing voice segmentation, music structure boundaries estimation, and DJ cue-region estimation. This is described in Part 6.EC/H2020/688122/EU/Artist-to-Business-to-Business-to-Consumer Audio Branding System/ABC D
Laplacian Mixture Modeling for Network Analysis and Unsupervised Learning on Graphs
Laplacian mixture models identify overlapping regions of influence in
unlabeled graph and network data in a scalable and computationally efficient
way, yielding useful low-dimensional representations. By combining Laplacian
eigenspace and finite mixture modeling methods, they provide probabilistic or
fuzzy dimensionality reductions or domain decompositions for a variety of input
data types, including mixture distributions, feature vectors, and graphs or
networks. Provable optimal recovery using the algorithm is analytically shown
for a nontrivial class of cluster graphs. Heuristic approximations for scalable
high-performance implementations are described and empirically tested.
Connections to PageRank and community detection in network analysis demonstrate
the wide applicability of this approach. The origins of fuzzy spectral methods,
beginning with generalized heat or diffusion equations in physics, are reviewed
and summarized. Comparisons to other dimensionality reduction and clustering
methods for challenging unsupervised machine learning problems are also
discussed.Comment: 13 figures, 35 reference
First Season QUIET Observations: Measurements of Cosmic Microwave Background Polarization Power Spectra at 43 GHz in the Multipole Range 25 ⤠â ⤠475
The Q/U Imaging ExperimenT (QUIET) employs coherent receivers at 43 GHz and 94 GHz, operating on the Chajnantor plateau in the Atacama Desert in Chile, to measure the anisotropy in the polarization of the cosmic microwave background (CMB). QUIET primarily targets the B modes from primordial gravitational waves. The combination of these frequencies gives sensitivity to foreground contributions from diffuse Galactic synchrotron radiation. Between 2008 October and 2010 December, over 10,000 hr of data were collected, first with the 19 element 43 GHz array (3458 hr) and then with the 90 element 94 GHz array. Each array observes the same four fields, selected for low foregrounds, together covering â1000 deg^2. This paper reports initial results from the 43 GHz receiver, which has an array sensitivity to CMB fluctuations of 69 ÎźKâs. The data were extensively studied with a large suite of null tests before the power spectra, determined with two independent pipelines, were examined. Analysis choices, including data selection, were modified until the null tests passed. Cross-correlating maps with different telescope pointings is used to eliminate a bias. This paper reports the EE, BB, and EB power spectra in the multipole range â = 25-475. With the exception of the lowest multipole bin for one of the fields, where a polarized foreground, consistent with Galactic synchrotron radiation, is detected with 3Ď significance, the E-mode spectrum is consistent with the ÎCDM model, confirming the only previous detection of the first acoustic peak. The B-mode spectrum is consistent with zero, leading to a measurement of the tensor-to-scalar ratio of r = 0.35^(+1.06)_(â0.87). The combination of a new time-stream "double-demodulation" technique, side-fed Dragonian optics, natural sky rotation, and frequent boresight rotation leads to the lowest level of systematic contamination in the B-mode power so far reported, below the level of r = 0.1
Image Processing and Machine Learning for Hyperspectral Unmixing: An Overview and the HySUPP Python Package
Spectral pixels are often a mixture of the pure spectra of the materials,
called endmembers, due to the low spatial resolution of hyperspectral sensors,
double scattering, and intimate mixtures of materials in the scenes. Unmixing
estimates the fractional abundances of the endmembers within the pixel.
Depending on the prior knowledge of endmembers, linear unmixing can be divided
into three main groups: supervised, semi-supervised, and unsupervised (blind)
linear unmixing. Advances in Image processing and machine learning
substantially affected unmixing. This paper provides an overview of advanced
and conventional unmixing approaches. Additionally, we draw a critical
comparison between advanced and conventional techniques from the three
categories. We compare the performance of the unmixing techniques on three
simulated and two real datasets. The experimental results reveal the advantages
of different unmixing categories for different unmixing scenarios. Moreover, we
provide an open-source Python-based package available at
https://github.com/BehnoodRasti/HySUPP to reproduce the results
SAR Tomography via Nonlinear Blind Scatterer Separation
Layover separation has been fundamental to many synthetic aperture radar
applications, such as building reconstruction and biomass estimation.
Retrieving the scattering profile along the mixed dimension (elevation) is
typically solved by inversion of the SAR imaging model, a process known as SAR
tomography. This paper proposes a nonlinear blind scatterer separation method
to retrieve the phase centers of the layovered scatterers, avoiding the
computationally expensive tomographic inversion. We demonstrate that
conventional linear separation methods, e.g., principle component analysis
(PCA), can only partially separate the scatterers under good conditions. These
methods produce systematic phase bias in the retrieved scatterers due to the
nonorthogonality of the scatterers' steering vectors, especially when the
intensities of the sources are similar or the number of images is low. The
proposed method artificially increases the dimensionality of the data using
kernel PCA, hence mitigating the aforementioned limitations. In the processing,
the proposed method sequentially deflates the covariance matrix using the
estimate of the brightest scatterer from kernel PCA. Simulations demonstrate
the superior performance of the proposed method over conventional PCA-based
methods in various respects. Experiments using TerraSAR-X data show an
improvement in height reconstruction accuracy by a factor of one to three,
depending on the used number of looks.Comment: This work has been accepted by IEEE TGRS for publicatio
- âŚ