10,342 research outputs found

    Deep Learning for Audio Signal Processing

    Full text link
    Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

    Polyphonic Sound Event Detection by using Capsule Neural Networks

    Full text link
    Artificial sound event detection (SED) has the aim to mimic the human ability to perceive and understand what is happening in the surroundings. Nowadays, Deep Learning offers valuable techniques for this goal such as Convolutional Neural Networks (CNNs). The Capsule Neural Network (CapsNet) architecture has been recently introduced in the image processing field with the intent to overcome some of the known limitations of CNNs, specifically regarding the scarce robustness to affine transformations (i.e., perspective, size, orientation) and the detection of overlapped images. This motivated the authors to employ CapsNets to deal with the polyphonic-SED task, in which multiple sound events occur simultaneously. Specifically, we propose to exploit the capsule units to represent a set of distinctive properties for each individual sound event. Capsule units are connected through a so-called "dynamic routing" that encourages learning part-whole relationships and improves the detection performance in a polyphonic context. This paper reports extensive evaluations carried out on three publicly available datasets, showing how the CapsNet-based algorithm not only outperforms standard CNNs but also allows to achieve the best results with respect to the state of the art algorithms

    Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams

    Full text link
    Many previous audio-visual voice-related works focus on speech, ignoring the singing voice in the growing number of musical video streams on the Internet. For processing diverse musical video data, voice activity detection is a necessary step. This paper attempts to detect the speech and singing voices of target performers in musical video streams using audiovisual information. To integrate information of audio and visual modalities, a multi-branch network is proposed to learn audio and image representations, and the representations are fused by attention based on semantic similarity to shape the acoustic representations through the probability of anchor vocalization. Experiments show the proposed audio-visual multi-branch network far outperforms the audio-only model in challenging acoustic environments, indicating the cross-modal information fusion based on semantic correlation is sensible and successful.Comment: Accepted by INTERSPEECH 202

    Pre-processing techniques for improved detection of vocalization sounds in a neonatal intensive care unit

    Get PDF
    The sounds occurring in the noisy acoustical environment of a Neonatal Intensive Care Unit (NICU) are thought to affect the growth and neurodevelopment of preterm infants. Automatic sound detection in a NICU is a novel and challenging problem, and it is an essential step in the investigation of how preterm infants react to auditory stimuli of the NICU environment. In this paper, we present our work on an automatic system for detection of vocalization sounds, which are extensively present in NICUs. The proposed system reduces the presence of irrelevant sounds prior to detection. Several pre-processing techniques are compared, which are based on either spectral subtraction or non-negative matrix factorization, or a combination of both. The vocalization sounds are detected from the enhanced audio signal using either generative or discriminative classification models. An audio database acquired in a real-world NICU environment is used to assess the performance of the detection system in terms of frame-level missing and false alarm rates. The inclusion of the enhancement pre-processing step leads to up to 17.54% relative improvement over the baseline.Peer ReviewedPostprint (published version

    AudioPairBank: Towards A Large-Scale Tag-Pair-Based Audio Content Analysis

    Full text link
    Recently, sound recognition has been used to identify sounds, such as car and river. However, sounds have nuances that may be better described by adjective-noun pairs such as slow car, and verb-noun pairs such as flying insects, which are under explored. Therefore, in this work we investigate the relation between audio content and both adjective-noun pairs and verb-noun pairs. Due to the lack of datasets with these kinds of annotations, we collected and processed the AudioPairBank corpus consisting of a combined total of 1,123 pairs and over 33,000 audio files. One contribution is the previously unavailable documentation of the challenges and implications of collecting audio recordings with these type of labels. A second contribution is to show the degree of correlation between the audio content and the labels through sound recognition experiments, which yielded results of 70% accuracy, hence also providing a performance benchmark. The results and study in this paper encourage further exploration of the nuances in audio and are meant to complement similar research performed on images and text in multimedia analysis.Comment: This paper is a revised version of "AudioSentibank: Large-scale Semantic Ontology of Acoustic Concepts for Audio Content Analysis

    Affective games:a multimodal classification system

    Get PDF
    Affective gaming is a relatively new field of research that exploits human emotions to influence gameplay for an enhanced player experience. Changes in player’s psychology reflect on their behaviour and physiology, hence recognition of such variation is a core element in affective games. Complementary sources of affect offer more reliable recognition, especially in contexts where one modality is partial or unavailable. As a multimodal recognition system, affect-aware games are subject to the practical difficulties met by traditional trained classifiers. In addition, inherited game-related challenges in terms of data collection and performance arise while attempting to sustain an acceptable level of immersion. Most existing scenarios employ sensors that offer limited freedom of movement resulting in less realistic experiences. Recent advances now offer technology that allows players to communicate more freely and naturally with the game, and furthermore, control it without the use of input devices. However, the affective game industry is still in its infancy and definitely needs to catch up with the current life-like level of adaptation provided by graphics and animation

    Speech analysis for Ambient Assisted Living : technical and user design of a vocal order system

    No full text
    International audienceEvolution of ICT led to the emergence of smart home. A Smart Home consists in a home equipped with data-processing technology which anticipates the needs of its inhabitant while trying to maintain their comfort and their safety by action on the house and by implementing connections with the outside world. Therefore, smart homes equipped with ambient intelligence technology constitute a promising direction to enable the growing number of elderly to continue to live in their own homes as long as possible. However, the technological solutions requested by this part of the population have to suit their specific needs and capabilities. It is obvious that these Smart Houses tend to be equipped with devices whose interfaces are increasingly complex and become difficult to control by the user. The people the most likely to benefit from these new technologies are the people in loss of autonomy such as the disabled people or the elderly which cognitive deficiencies (Alzheimer). Moreover, these people are the less capable of using the complex interfaces due to their handicap or their lack ICT understanding. Thus, it becomes essential to facilitate the daily life and the access to the whole home automation system through the smart home. The usual tactile interfaces should be supplemented by accessible interfaces, in particular, thanks to a system reactive to the voice ; these interfaces are also useful when the person cannot move easily. Vocal orders will allow the following functionality: - To ensure an assistance by a traditional or vocal order. - To set up a indirect order regulation for a better energy management. - To reinforce the link with the relatives by the integration of interfaces dedicated and adapted to the person in loss of autonomy. - To ensure more safety by detection of distress situations and when someone is breaking in the house. This chapter will describe the different steps which are needed for the conception of an audio ambient system. The first step is related to the acceptability and the objection aspects by the end users and we will report a user evaluation assessing the acceptance and the fear of this new technology. The experience aimed at testing three important aspects of speech interaction: voice command, communication with the outside world, home automation system interrupting a person's activity. The experiment was conducted in a smart home with a voice command using a Wizard of OZ technique and gave information of great interest. The second step is related to a general presentation of the audio sensing technology for ambient assisted living. Different aspect of sound and speech processing will be developed. The applications and challenges will be presented. The third step is related to speech recognition in the home environment. Automatic Speech Recognition systems (ASR) have reached good performances with close talking microphones (e.g., head-set), but the performances decrease significantly as soon as the microphone is moved away from the mouth of the speaker (e.g., when the microphone is set in the ceiling). This deterioration is due to a broad variety of effects including reverberation and presence of undetermined background noise such as TV radio and, devices. This part will present a system of vocal order recognition in distant speech context. This system was evaluated in a dedicated flat thanks to some experiments. This chapter will then conclude with a discussion on the interest of the speech modality concerning the Ambient Assisted Living

    Glottal-synchronous speech processing

    No full text
    Glottal-synchronous speech processing is a field of speech science where the pseudoperiodicity of voiced speech is exploited. Traditionally, speech processing involves segmenting and processing short speech frames of predefined length; this may fail to exploit the inherent periodic structure of voiced speech which glottal-synchronous speech frames have the potential to harness. Glottal-synchronous frames are often derived from the glottal closure instants (GCIs) and glottal opening instants (GOIs). The SIGMA algorithm was developed for the detection of GCIs and GOIs from the Electroglottograph signal with a measured accuracy of up to 99.59%. For GCI and GOI detection from speech signals, the YAGA algorithm provides a measured accuracy of up to 99.84%. Multichannel speech-based approaches are shown to be more robust to reverberation than single-channel algorithms. The GCIs are applied to real-world applications including speech dereverberation, where SNR is improved by up to 5 dB, and to prosodic manipulation where the importance of voicing detection in glottal-synchronous algorithms is demonstrated by subjective testing. The GCIs are further exploited in a new area of data-driven speech modelling, providing new insights into speech production and a set of tools to aid deployment into real-world applications. The technique is shown to be applicable in areas of speech coding, identification and artificial bandwidth extension of telephone speec
    • …
    corecore