16,714 research outputs found
MBTFNet: Multi-Band Temporal-Frequency Neural Network For Singing Voice Enhancement
A typical neural speech enhancement (SE) approach mainly handles speech and
noise mixtures, which is not optimal for singing voice enhancement scenarios.
Music source separation (MSS) models treat vocals and various accompaniment
components equally, which may reduce performance compared to the model that
only considers vocal enhancement. In this paper, we propose a novel multi-band
temporal-frequency neural network (MBTFNet) for singing voice enhancement,
which particularly removes background music, noise and even backing vocals from
singing recordings. MBTFNet combines inter and intra-band modeling for better
processing of full-band signals. Dual-path modeling are introduced to expand
the receptive field of the model. We propose an implicit personalized
enhancement (IPE) stage based on signal-to-noise ratio (SNR) estimation, which
further improves the performance of MBTFNet. Experiments show that our proposed
model significantly outperforms several state-of-the-art SE and MSS models
Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset
Audio signals represent a wide diversity of acoustic events, from background environmental noise to spoken
communication. Machine learning models such as neural networks have already been proposed for audio signal
modeling, where recurrent structures can take advantage of temporal dependencies. This work aims to study the
implementation of several neural network-based systems for speech and music event detection over a collection of
77,937 10-second audio segments (216 h), selected from the Google AudioSet dataset. These segments belong to
YouTube videos and have been represented as mel-spectrograms. We propose and compare two approaches. The
first one is the training of two different neural networks, one for speech detection and another for music detection.
The second approach consists on training a single neural network to tackle both tasks at the same time. The studied
architectures include fully connected, convolutional and LSTM (long short-term memory) recurrent networks.
Comparative results are provided in terms of classification performance and model complexity. We would like to
highlight the performance of convolutional architectures, specially in combination with an LSTM stage. The hybrid
convolutional-LSTM models achieve the best overall results (85% accuracy) in the three proposed tasks. Furthermore,
a distractor analysis of the results has been carried out in order to identify which events in the ontology are the most
harmful for the performance of the models, showing some difficult scenarios for the detection of music and speechThis work has been supported by project âDSSL: Redes Profundas y Modelos
de Subespacios para Deteccion y Seguimiento de Locutor, Idioma y
Enfermedades Degenerativas a partir de la Vozâ (TEC2015-68172-C2-1-P),
funded by the Ministry of Economy and Competitivity of Spain and FEDE
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Predicting Audio Advertisement Quality
Online audio advertising is a particular form of advertising used abundantly
in online music streaming services. In these platforms, which tend to host tens
of thousands of unique audio advertisements (ads), providing high quality ads
ensures a better user experience and results in longer user engagement.
Therefore, the automatic assessment of these ads is an important step toward
audio ads ranking and better audio ads creation. In this paper we propose one
way to measure the quality of the audio ads using a proxy metric called Long
Click Rate (LCR), which is defined by the amount of time a user engages with
the follow-up display ad (that is shown while the audio ad is playing) divided
by the impressions. We later focus on predicting the audio ad quality using
only acoustic features such as harmony, rhythm, and timbre of the audio,
extracted from the raw waveform. We discuss how the characteristics of the
sound can be connected to concepts such as the clarity of the audio ad message,
its trustworthiness, etc. Finally, we propose a new deep learning model for
audio ad quality prediction, which outperforms the other discussed models
trained on hand-crafted features. To the best of our knowledge, this is the
first large-scale audio ad quality prediction study.Comment: WSDM '18 Proceedings of the Eleventh ACM International Conference on
Web Search and Data Mining, 9 page
- âŠ