9,945 research outputs found

    Deep Learning for Audio Signal Processing

    Full text link
    Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

    BigEAR: Inferring the Ambient and Emotional Correlates from Smartphone-based Acoustic Big Data

    Get PDF
    This paper presents a novel BigEAR big data framework that employs psychological audio processing chain (PAPC) to process smartphone-based acoustic big data collected when the user performs social conversations in naturalistic scenarios. The overarching goal of BigEAR is to identify moods of the wearer from various activities such as laughing, singing, crying, arguing, and sighing. These annotations are based on ground truth relevant for psychologists who intend to monitor/infer the social context of individuals coping with breast cancer. We pursued a case study on couples coping with breast cancer to know how the conversations affect emotional and social well being. In the state-of-the-art methods, psychologists and their team have to hear the audio recordings for making these inferences by subjective evaluations that not only are time-consuming and costly, but also demand manual data coding for thousands of audio files. The BigEAR framework automates the audio analysis. We computed the accuracy of BigEAR with respect to the ground truth obtained from a human rater. Our approach yielded overall average accuracy of 88.76% on real-world data from couples coping with breast cancer.Comment: 6 pages, 10 equations, 1 Table, 5 Figures, IEEE International Workshop on Big Data Analytics for Smart and Connected Health 2016, June 27, 2016, Washington DC, US

    DESIGN AND EVALUATION OF HARMONIC SPEECH ENHANCEMENT AND BANDWIDTH EXTENSION

    Get PDF
    Improving the quality and intelligibility of speech signals continues to be an important topic in mobile communications and hearing aid applications. This thesis explored the possibilities of improving the quality of corrupted speech by cascading a log Minimum Mean Square Error (logMMSE) noise reduction system with a Harmonic Speech Enhancement (HSE) system. In HSE, an adaptive comb filter is deployed to harmonically filter the useful speech signal and suppress the noisy components to noise floor. A Bandwidth Extension (BWE) algorithm was applied to the enhanced speech for further improvements in speech quality. Performance of this algorithm combination was evaluated using objective speech quality metrics across a variety of noisy and reverberant environments. Results showed that the logMMSE and HSE combination enhanced the speech quality in any reverberant environment and in the presence of multi-talker babble. The objective improvements associated with the BWE were found to be minima

    The listening talker: A review of human and algorithmic context-induced modifications of speech

    Get PDF
    International audienceSpeech output technology is finding widespread application, including in scenarios where intelligibility might be compromised - at least for some listeners - by adverse conditions. Unlike most current algorithms, talkers continually adapt their speech patterns as a response to the immediate context of spoken communication, where the type of interlocutor and the environment are the dominant situational factors influencing speech production. Observations of talker behaviour can motivate the design of more robust speech output algorithms. Starting with a listener-oriented categorisation of possible goals for speech modification, this review article summarises the extensive set of behavioural findings related to human speech modification, identifies which factors appear to be beneficial, and goes on to examine previous computational attempts to improve intelligibility in noise. The review concludes by tabulating 46 speech modifications, many of which have yet to be perceptually or algorithmically evaluated. Consequently, the review provides a roadmap for future work in improving the robustness of speech output
    corecore