23 research outputs found
Raw Multi-Channel Audio Source Separation using Multi-Resolution Convolutional Auto-Encoders
Supervised multi-channel audio source separation requires extracting useful
spectral, temporal, and spatial features from the mixed signals. The success of
many existing systems is therefore largely dependent on the choice of features
used for training. In this work, we introduce a novel multi-channel,
multi-resolution convolutional auto-encoder neural network that works on raw
time-domain signals to determine appropriate multi-resolution features for
separating the singing-voice from stereo music. Our experimental results show
that the proposed method can achieve multi-channel audio source separation
without the need for hand-crafted features or any pre- or post-processing
Multi-Resolution Fully Convolutional Neural Networks for Monaural Audio Source Separation
In deep neural networks with convolutional layers, each layer typically has
fixed-size/single-resolution receptive field (RF). Convolutional layers with a
large RF capture global information from the input features, while layers with
small RF size capture local details with high resolution from the input
features. In this work, we introduce novel deep multi-resolution fully
convolutional neural networks (MR-FCNN), where each layer has different RF
sizes to extract multi-resolution features that capture the global and local
details information from its input features. The proposed MR-FCNN is applied to
separate a target audio source from a mixture of many audio sources.
Experimental results show that using MR-FCNN improves the performance compared
to feedforward deep neural networks (DNNs) and single resolution deep fully
convolutional neural networks (FCNNs) on the audio source separation problem.Comment: arXiv admin note: text overlap with arXiv:1703.0801
Contributions and limitations of using machine learning to predict noise-induced hearing loss
Purpose
Noise-induced hearing loss (NIHL) is a global issue that impacts people’s life and health. The current review aims to clarify the contributions and limitations of applying machine learning (ML) to predict NIHL by analyzing the performance of different ML techniques and the procedure of model construction.
Methods
The authors searched PubMed, EMBASE and Scopus on November 26, 2020.
Results
Eight studies were recruited in the current review following defined inclusion and exclusion criteria. Sample size in the selected studies ranged between 150 and 10,567. The most popular models were artificial neural networks (n = 4), random forests (n = 3) and support vector machines (n = 3). Features mostly correlated with NIHL and used in the models were: age (n = 6), duration of noise exposure (n = 5) and noise exposure level (n = 4). Five included studies used either split-sample validation (n = 3) or ten-fold cross-validation (n = 2). Assessment of accuracy ranged in value from 75.3% to 99% with a low prediction error/root-mean-square error in 3 studies. Only 2 studies measured discrimination risk using the receiver operating characteristic (ROC) curve and/or the area under ROC curve.
Conclusion
In spite of high accuracy and low prediction error of machine learning models, some improvement can be expected from larger sample sizes, multiple algorithm use, completed reports of model construction and the sufficient evaluation of calibration and discrimination risk
Machine learning in diagnosing middle ear disorders using tympanic membrane images : a meta-analysis
OBJECTIVE : To systematically evaluate the development of Machine Learning (ML) models and compare their diagnostic
accuracy for the classification of Middle Ear Disorders (MED) using Tympanic Membrane (TM) images.
METHODS : PubMed, EMBASE, CINAHL, and CENTRAL were searched up until November 30, 2021. Studies on the development
of ML approaches for diagnosing MED using TM images were selected according to the inclusion criteria. PRISMA guidelines
were followed with study design, analysis method, and outcomes extracted. Sensitivity, specificity, and area under the
curve (AUC) were used to summarize the performance metrics of the meta-analysis. Risk of Bias was assessed using the Quality
Assessment of Diagnostic Accuracy Studies-2 tool in combination with the Prediction Model Risk of Bias Assessment Tool.
RESULTS : Sixteen studies were included, encompassing 20254 TM images (7025 normal TM and 13229 MED). The sample
size ranged from 45 to 6066 per study. The accuracy of the 25 included ML approaches ranged from 76.00% to 98.26%.
Eleven studies (68.8%) were rated as having a low risk of bias, with the reference standard as the major domain of high risk
of bias (37.5%). Sensitivity and specificity were 93% (95% CI, 90%–95%) and 85% (95% CI, 82%–88%), respectively. The
AUC of total TM images was 94% (95% CI, 91%–96%). The greater AUC was found using otoendoscopic images than otoscopic
images.
CONCLUSIONS : ML approaches perform robustly in distinguishing between normal ears and MED, however, it is proposed
that a standardized TM image acquisition and annotation protocol should be developed.NIHR, Sêr Cymru III Enhancing Competitiveness Infrastructure Award, Great Britain Sasakawa Foundation, Cardiff Metropolitan University Research Innovation Award, and The Global Academies Research and Innovation Development Fund, National Natural Science Foundation of China, Guizhou Provincial Science and Technology Projects and Global Academies and Santandar 2021 Fellowship Award.https://onlinelibrary.wiley.com/journal/15314995am2024Electrical, Electronic and Computer EngineeringSpeech-Language Pathology and AudiologySDG-03:Good heatlh and well-beingSDG-09: Industry, innovation and infrastructur
Gaussian mixture gain priors for regularized nonnegative matrix factorization in single-channel source separation
We propose a new method to incorporate statistical priors on the solution of the nonnegative matrix factorization (NMF) for single-channel source separation (SCSS) applications. The Gaussian mixture model (GMM) is used as a log-normalized gain prior model for the NMF solution. The normalization makes the prior models energy independent. In NMF based SCSS, NMF is used to decompose the spectra of the observed mixed signal as a weighted linear combination of a set of trained basis vectors. In this work, the NMF decomposition weights are enforced to consider statistical prior information on the weight combination patterns that the trained basis vectors can jointly receive for each source in the observed mixed signal. The NMF solutions for the weights are encouraged to increase the loglikelihood with the trained gain prior GMMs while reducing the NMF reconstruction error at the same time
Adaptation of speaker-specific bases in non-negative matrix factorization for single channel speech-music separation
Abstract This paper introduces a speaker adaptation algorithm for nonnegative matrix factorization (NMF) models. The proposed adaptation algorithm is a combination of Bayesian and subspace model adaptation. The adapted model is used to separate speech signal from a background music signal in a single record. Training speech data for multiple speakers is used with NMF to train a set of basis vectors as a general model for speech signals. The probabilistic interpretation of NMF is used to achieve Bayesian adaptation to adjust the general model with respect to the actual properties of the speech signals that is observed in the mixed signal. The Bayesian adapted model is adapted again by a linear transform, which changes the subspace that the Bayesian adapted model spans to better match the speech signal that is in the mixed signal. The experimental results show that combining Bayesian with linear transform adaptation improves the separation results
Combining Fully Convolutional and Recurrent Neural Networks for Single Channel Audio Source Separation
Combining different models is a common strategy to build a good audio source separation system. In this work,
we combine two powerful deep neural networks for audio single channel source separation (SCSS). Namely, we
combine fully convolutional neural networks (FCNs) and recurrent neural networks, specifically, bidirectional
long short-term memory recurrent neural networks (BLSTMs). FCNs are good at extracting useful features from
the audio data and BLSTMs are good at modeling the temporal structure of the audio signals. Our experimental
results show that combining FCNs and BLSTMs achieves better separation performance than using each model
individually
Single Channel Audio Source Separation using Convolutional Denoising Autoencoders
Deep learning techniques have been used recently to tackle the audio
source separation problem. In this work, we propose to use deep
fully convolutional denoising autoencoders (CDAEs) for monaural
audio source separation. We use as many CDAEs as the number
of sources to be separated from the mixed signal. Each CDAE
is trained to separate one source and treats the other sources as
background noise. The main idea is to allow each CDAE to learn
suitable spectral-temporal filters and features to its corresponding
source. Our experimental results show that CDAEs perform source
separation slightly better than the deep feedforward neural networks
(FNNs) even with fewer parameters than FNNs
Remixing musical audio on the web using source separation
Presented at the 2nd Web Audio Conference (WAC), April 4-6, 2016, Atlanta, Georgia.Research in audio source separation has progressed a long
way, producing systems that are able to approximate the
component signals of sound mixtures. In recent years, many
efforts have focused on learning time-frequency masks that can be used to filter a monophonic signal in the frequency domain. Using current web audio technologies, time-frequency
masking can be implemented in a web browser in real time.
This allows applying source separation techniques to arbitrary
audio streams, such as internet radios, depending on
cross-domain security configurations. While producing good
quality separated audio from monophonic music mixtures is still challenging, current methods can be applied to remixing
scenarios, where part of the signal is emphasized or deemphasized.
This paper describes a system for remixing
musical audio on the web by applying time-frequency masks
estimated using deep neural networks. Our example prototype,
implemented in client-side Javascript, provides reasonable
quality results for small modifications