4,192 research outputs found

    Deep Learning for Audio Signal Processing

    Full text link
    Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

    Enhancing the Performance of Single-Channel Blind Source Separation by Using ConvTransFormer

    Get PDF
    In the specialized field of audio signal processing, this study introduces a pioneering ConvTransFormer architecture aimed at enhancing the performance of single-channel blind source separation (SCBSS). This innovative architecture ingeniously combines the strengths of a multiple simple-weak attention mechanism with the triple-gating feature of a Gated Attention Unit (GAU) within the ConvTransFormer. This combination allows for a more focused and effective targeting of specific segments within the input sequence. The efficacy of this ConvTransFormer architecture is rigorously evaluated using the WSJ0-2mix dataset, a standard benchmark in the field. The results of this evaluation are significant, demonstrating substantial improvements in key performance metrics. Notably, there is an increase in the Signal-to-Interference (SI)-Signal-to-Noise Ratio improvement (SNRi) by 16.5 and in the Signal-to-Distortion Ratio improvement (SDRi)-Signal-to-Interference (SDRi) by 16.8. These improvements are crucial indicators of the quality of source separation in SCBSS. The findings of this research are groundbreaking, indicating that the proposed ConvTransFormer architecture surpasses existing methods in both SI-SNRi and SDRi performance metrics. This advancement marks a significant step forward in the field of SCBSS, offering new avenues for more effective and precise audio signal processing, especially in scenarios where isolating individual sound sources from a single- channel input is essential

    Joint speech and overlap detection: a benchmark over multiple audio setup and speech domains

    Full text link
    Voice activity and overlapped speech detection (respectively VAD and OSD) are key pre-processing tasks for speaker diarization. The final segmentation performance highly relies on the robustness of these sub-tasks. Recent studies have shown VAD and OSD can be trained jointly using a multi-class classification model. However, these works are often restricted to a specific speech domain, lacking information about the generalization capacities of the systems. This paper proposes a complete and new benchmark of different VAD and OSD models, on multiple audio setups (single/multi-channel) and speech domains (e.g. media, meeting...). Our 2/3-class systems, which combine a Temporal Convolutional Network with speech representations adapted to the setup, outperform state-of-the-art results. We show that the joint training of these two tasks offers similar performances in terms of F1-score to two dedicated VAD and OSD systems while reducing the training cost. This unique architecture can also be used for single and multichannel speech processing

    CHORUS Deliverable 2.2: Second report - identification of multi-disciplinary key issues for gap analysis toward EU multimedia search engines roadmap

    Get PDF
    After addressing the state-of-the-art during the first year of Chorus and establishing the existing landscape in multimedia search engines, we have identified and analyzed gaps within European research effort during our second year. In this period we focused on three directions, notably technological issues, user-centred issues and use-cases and socio- economic and legal aspects. These were assessed by two central studies: firstly, a concerted vision of functional breakdown of generic multimedia search engine, and secondly, a representative use-cases descriptions with the related discussion on requirement for technological challenges. Both studies have been carried out in cooperation and consultation with the community at large through EC concertation meetings (multimedia search engines cluster), several meetings with our Think-Tank, presentations in international conferences, and surveys addressed to EU projects coordinators as well as National initiatives coordinators. Based on the obtained feedback we identified two types of gaps, namely core technological gaps that involve research challenges, and “enablers”, which are not necessarily technical research challenges, but have impact on innovation progress. New socio-economic trends are presented as well as emerging legal challenges

    Interpretation of Multiparty Meetings: The AMI and AMIDA Projects

    Get PDF
    The AMI and AMIDA projects are collaborative EU projects concerned with the automatic recognition and interpretation of multiparty meetings. This paper provides an overview of the advances we have made in these projects with a particular focus on the multimodal recording infrastructure, the publicly available AMI corpus of annotated meeting recordings, and the speech recognition framework that we have developed for this domain
    corecore