Search CORE

4,192 research outputs found

Deep Learning for Audio Signal Processing

Author: Chang Shuo-yiin
Li Bo
Purwins Hendrik
Sainath Tara
Schlüter Jan
Virtanen Tuomas
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2019
Field of study

Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

arXiv.org e-Print Archive

VBN

Enhancing the Performance of Single-Channel Blind Source Separation by Using ConvTransFormer

Author: Bharathi S H
Santosh Kumar S
Publication venue: UK Zhende Publishing
Publication date: 15/11/2023
Field of study

In the specialized field of audio signal processing, this study introduces a pioneering ConvTransFormer architecture aimed at enhancing the performance of single-channel blind source separation (SCBSS). This innovative architecture ingeniously combines the strengths of a multiple simple-weak attention mechanism with the triple-gating feature of a Gated Attention Unit (GAU) within the ConvTransFormer. This combination allows for a more focused and effective targeting of specific segments within the input sequence. The efficacy of this ConvTransFormer architecture is rigorously evaluated using the WSJ0-2mix dataset, a standard benchmark in the field. The results of this evaluation are significant, demonstrating substantial improvements in key performance metrics. Notably, there is an increase in the Signal-to-Interference (SI)-Signal-to-Noise Ratio improvement (SNRi) by 16.5 and in the Signal-to-Distortion Ratio improvement (SDRi)-Signal-to-Interference (SDRi) by 16.8. These improvements are crucial indicators of the quality of source separation in SCBSS. The findings of this research are groundbreaking, indicating that the proposed ConvTransFormer architecture surpasses existing methods in both SI-SNRi and SDRi performance metrics. This advancement marks a significant step forward in the field of SCBSS, offering new avenues for more effective and precise audio signal processing, especially in scenarios where isolating individual sound sources from a single- channel input is essential

International Journal of Communication Networks and Information Security (IJCNIS)

Joint speech and overlap detection: a benchmark over multiple audio setup and speech domains

Author: Larcher Anthony
Laurent Antoine
Lebourdais Martin
Mariotte Théo
Meignier Sylvain
Montresor Silvio
Tahon Marie
Thomas Jean-Hugh
Publication venue
Publication date: 24/07/2023
Field of study

Voice activity and overlapped speech detection (respectively VAD and OSD) are key pre-processing tasks for speaker diarization. The final segmentation performance highly relies on the robustness of these sub-tasks. Recent studies have shown VAD and OSD can be trained jointly using a multi-class classification model. However, these works are often restricted to a specific speech domain, lacking information about the generalization capacities of the systems. This paper proposes a complete and new benchmark of different VAD and OSD models, on multiple audio setups (single/multi-channel) and speech domains (e.g. media, meeting...). Our 2/3-class systems, which combine a Temporal Convolutional Network with speech representations adapted to the setup, outperform state-of-the-art results. We show that the joint training of these two tasks offers similar performances in terms of F1-score to two dedicated VAD and OSD systems while reducing the training cost. This unique architecture can also be used for single and multichannel speech processing

arXiv.org e-Print Archive

CHORUS Deliverable 2.2: Second report - identification of multi-disciplinary key issues for gap analysis toward EU multimedia search engines roadmap

Author: Bardeli Rolf
Boujemaa Nozha
Compañó Ramón
Doch Christoph
Geurts Joost
Gouraud Henri
Joly Alexis
Karlgren Jussi
King Paul
Kompatsiaris Yiannis
Köhler Joachim
Le Moine Jean-Yves
Ortgies Robert
Point Jean-Charles
Rotenberg Boris
Rudström Åsa
Schreer Oliver
Sebe Nicu
Snoek Cees
Publication venue: Chorus Project Consortium
Publication date: 01/01/2008
Field of study

After addressing the state-of-the-art during the first year of Chorus and establishing the existing landscape in multimedia search engines, we have identified and analyzed gaps within European research effort during our second year. In this period we focused on three directions, notably technological issues, user-centred issues and use-cases and socio- economic and legal aspects. These were assessed by two central studies: firstly, a concerted vision of functional breakdown of generic multimedia search engine, and secondly, a representative use-cases descriptions with the related discussion on requirement for technological challenges. Both studies have been carried out in cooperation and consultation with the community at large through EC concertation meetings (multimedia search engines cluster), several meetings with our Think-Tank, presentations in international conferences, and surveys addressed to EU projects coordinators as well as National initiatives coordinators. Based on the obtained feedback we identified two types of gaps, namely core technological gaps that involve research challenges, and “enablers”, which are not necessarily technical research challenges, but have impact on innovation progress. New socio-economic trends are presented as well as emerging legal challenges

RISE – Research Institutes of Sweden

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

Software institutes' Online Digital Archive

Interpretation of Multiparty Meetings: The AMI and AMIDA Projects

Author: Bourlard Herve
Hain Thomas
Renals Steve
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2008
Field of study

The AMI and AMIDA projects are collaborative EU projects concerned with the automatic recognition and interpretation of multiparty meetings. This paper provides an overview of the advances we have made in these projects with a particular focus on the multimodal recording infrastructure, the publicly available AMI corpus of annotated meeting recordings, and the speech recognition framework that we have developed for this domain

CiteSeerX

Edinburgh Research Archive

Edinburgh Research Explorer