4,192 research outputs found
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Enhancing the Performance of Single-Channel Blind Source Separation by Using ConvTransFormer
In the specialized field of audio signal processing, this study introduces a pioneering ConvTransFormer architecture aimed at enhancing the performance of single-channel blind source separation (SCBSS). This innovative architecture ingeniously combines the strengths of a multiple simple-weak attention mechanism with the triple-gating feature of a Gated Attention Unit (GAU) within the ConvTransFormer. This combination allows for a more focused and effective targeting of specific segments within the input sequence. The efficacy of this ConvTransFormer architecture is rigorously evaluated using the WSJ0-2mix dataset, a standard benchmark in the field. The results of this evaluation are significant, demonstrating substantial improvements in key performance metrics. Notably, there is an increase in the Signal-to-Interference (SI)-Signal-to-Noise Ratio improvement (SNRi) by 16.5 and in the Signal-to-Distortion Ratio improvement (SDRi)-Signal-to-Interference (SDRi) by 16.8. These improvements are crucial indicators of the quality of source separation in SCBSS. The findings of this research are groundbreaking, indicating that the proposed ConvTransFormer architecture surpasses existing methods in both SI-SNRi and SDRi performance metrics. This advancement marks a significant step forward in the field of SCBSS, offering new avenues for more effective and precise audio signal processing, especially in scenarios where isolating individual sound sources from a single- channel input is essential
Joint speech and overlap detection: a benchmark over multiple audio setup and speech domains
Voice activity and overlapped speech detection (respectively VAD and OSD) are
key pre-processing tasks for speaker diarization. The final segmentation
performance highly relies on the robustness of these sub-tasks. Recent studies
have shown VAD and OSD can be trained jointly using a multi-class
classification model. However, these works are often restricted to a specific
speech domain, lacking information about the generalization capacities of the
systems. This paper proposes a complete and new benchmark of different VAD and
OSD models, on multiple audio setups (single/multi-channel) and speech domains
(e.g. media, meeting...). Our 2/3-class systems, which combine a Temporal
Convolutional Network with speech representations adapted to the setup,
outperform state-of-the-art results. We show that the joint training of these
two tasks offers similar performances in terms of F1-score to two dedicated VAD
and OSD systems while reducing the training cost. This unique architecture can
also be used for single and multichannel speech processing
CHORUS Deliverable 2.2: Second report - identification of multi-disciplinary key issues for gap analysis toward EU multimedia search engines roadmap
After addressing the state-of-the-art during the first year of Chorus and establishing the existing landscape in
multimedia search engines, we have identified and analyzed gaps within European research effort during our second year.
In this period we focused on three directions, notably technological issues, user-centred issues and use-cases and socio-
economic and legal aspects. These were assessed by two central studies: firstly, a concerted vision of functional breakdown
of generic multimedia search engine, and secondly, a representative use-cases descriptions with the related discussion on
requirement for technological challenges. Both studies have been carried out in cooperation and consultation with the
community at large through EC concertation meetings (multimedia search engines cluster), several meetings with our
Think-Tank, presentations in international conferences, and surveys addressed to EU projects coordinators as well as
National initiatives coordinators. Based on the obtained feedback we identified two types of gaps, namely core
technological gaps that involve research challenges, and “enablers”, which are not necessarily technical research
challenges, but have impact on innovation progress. New socio-economic trends are presented as well as emerging legal
challenges
Interpretation of Multiparty Meetings: The AMI and AMIDA Projects
The AMI and AMIDA projects are collaborative EU projects concerned with the automatic recognition and interpretation of multiparty meetings. This paper provides an overview of the advances we have made in these projects with a particular focus on the multimodal recording infrastructure, the publicly available AMI corpus of annotated meeting recordings, and the speech recognition framework that we have developed for this domain
- …