133 research outputs found

    Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

    Get PDF
    Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks

    Adaptive Audio Classification Framework for in-Vehicle Environment with Dynamic Noise Characteristics

    Get PDF
    With ever-increasing number of car-mounted electric devices that are accessed, managed, and controlled with smartphones, car apps are becoming an important part of the automotive industry. Audio classification is one of the key components of car apps as a front-end technology to enable human-app interactions. Existing approaches for audio classification, however, fall short as the unique and time-varying audio characteristics of car environments are not appropriately taken into account. Leveraging recent advances in mobile sensing technology that allows for an active and accurate driving environment detection, in this thesis, we develop an audio classification framework for mobile apps that categorizes an audio stream into music, speech, speech and music, and noise, adaptability depending on different driving environments. A case study is performed with four different driving environments, i.e., highway, local road, crowded city, and stopped vehicle. More than 420 minutes of audio data are collected including various genres of music, speech, speech and music, and noise from the driving environments

    Deep Spoken Keyword Spotting:An Overview

    Get PDF
    Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few years ago. This has allowed the rapid embedding of deep KWS in a myriad of small electronic devices with different purposes like the activation of voice assistants. Prospects suggest a sustained growth in terms of social use of this technology. Thus, it is not surprising that deep KWS has become a hot research topic among speech scientists, who constantly look for KWS performance improvement and computational complexity reduction. This context motivates this paper, in which we conduct a literature review into deep spoken KWS to assist practitioners and researchers who are interested in this technology. Specifically, this overview has a comprehensive nature by covering a thorough analysis of deep KWS systems (which includes speech features, acoustic modeling and posterior handling), robustness methods, applications, datasets, evaluation metrics, performance of deep KWS systems and audio-visual KWS. The analysis performed in this paper allows us to identify a number of directions for future research, including directions adopted from automatic speech recognition research and directions that are unique to the problem of spoken KWS

    Deep sleep: deep learning methods for the acoustic analysis of sleep-disordered breathing

    Get PDF
    Sleep-disordered breathing (SDB) is a serious and prevalent condition that results from the collapse of the upper airway during sleep, which leads to oxygen desaturations, unphysiological variations in intrathoracic pressure, and sleep fragmentation. Its most common form is obstructive sleep apnoea (OSA). This has a big impact on quality of life, and is associated with cardiovascular morbidity. Polysomnography, the gold standard for diagnosing SDB, is obtrusive, time-consuming and expensive. Alternative diagnostic approaches have been proposed to overcome its limitations. In particular, acoustic analysis of sleep breathing sounds offers an unobtrusive and inexpensive means to screen for SDB, since it displays symptoms with unique acoustic characteristics. These include snoring, loud gasps, chokes, and absence of breathing. This thesis investigates deep learning methods, which have revolutionised speech and audio technology, to robustly screen for SDB in typical sleep conditions using acoustics. To begin with, the desirable characteristics for an acoustic corpus of SDB, and the acoustic definition of snoring are considered to create corpora for this study. Then three approaches are developed to tackle increasingly complex scenarios. Firstly, with the aim of leveraging a large amount of unlabelled SDB data, unsupervised learning is applied to learn novel feature representations with deep neural networks for the classification of SDB events such as snoring. The incorporation of contextual information to assist the classifier in producing realistic event durations is investigated. Secondly, the temporal pattern of sleep breathing sounds is exploited using convolutional neural networks to screen participants sleeping by themselves for OSA. The integration of acoustic features with physiological data for screening is examined. Thirdly, for the purpose of achieving robustness to bed partner breathing sounds, recurrent neural networks are used to screen a subject and their bed partner for SDB in the same session. Experiments conducted on the constructed corpora show that the developed systems accurately classify SDB events, screen for OSA with high sensitivity and specificity, and screen a subject and their bed partner for SDB with encouraging performance. In conclusion, this thesis makes promising progress in improving access to SDB diagnosis through low-cost and non-invasive methods

    A new system to detect coronavirus social distance violation

    Get PDF
    In this paper, a novel solution to avoid new infections is presented. Instead of tracing users’ locations, the presence of individuals is detected by analysing the voices, and people’s faces are detected by the camera. To do this, two different Android applications were implemented. The first one uses the camera to detect people’s faces whenever the user answers or performs a phone call. Firebase Platform will be used to detect faces captured by the camera and determine its size and estimate their distance to the phone terminal. The second application uses voice biometrics to differentiate the users’ voice from unknown speakers and creates a neural network model based on 5 samples of the user’s voice. This feature will only be activated whenever the user is surfing the Internet or using other applications to prevent undesired contacts. Currently, the patient’s tracking is performed by geolocation or by using Bluetooth connection. Although face detection and voice recognition are existing methods, this paper aims to use them and integrate both in a single device. Our application cannot violate privacy since it does not save the data used to carry out the detection and does not associate this data to people

    Machine Learning for Human Activity Detection in Smart Homes

    Get PDF
    Recognizing human activities in domestic environments from audio and active power consumption sensors is a challenging task since on the one hand, environmental sound signals are multi-source, heterogeneous, and varying in time and on the other hand, the active power consumption varies significantly for similar type electrical appliances. Many systems have been proposed to process environmental sound signals for event detection in ambient assisted living applications. Typically, these systems use feature extraction, selection, and classification. However, despite major advances, several important questions remain unanswered, especially in real-world settings. A part of this thesis contributes to the body of knowledge in the field by addressing the following problems for ambient sounds recorded in various real-world kitchen environments: 1) which features, and which classifiers are most suitable in the presence of background noise? 2) what is the effect of signal duration on recognition accuracy? 3) how do the SNR and the distance between the microphone and the audio source affect the recognition accuracy in an environment in which the system was not trained? We show that for systems that use traditional classifiers, it is beneficial to combine gammatone frequency cepstral coefficients and discrete wavelet transform coefficients and to use a gradient boosting classifier. For systems based on deep learning, we consider 1D and 2D CNN using mel-spectrogram energies and mel-spectrograms images, as inputs, respectively and show that the 2D CNN outperforms the 1D CNN. We obtained competitive classification results for two such systems and validated the performance of our algorithms on public datasets (Google Brain/TensorFlow Speech Recognition Challenge and the 2017 Detection and Classification of Acoustic Scenes and Events Challenge). Regarding the problem of the energy-based human activity recognition in a household environment, machine learning techniques to infer the state of household appliances from their energy consumption data are applied and rule-based scenarios that exploit these states to detect human activity are used. Since most activities within a house are related with the operation of an electrical appliance, this unimodal approach has a significant advantage using inexpensive smart plugs and smart meters for each appliance. This part of the thesis proposes the use of unobtrusive and easy-install tools (smart plugs) for data collection and a decision engine that combines energy signal classification using dominant classifiers (compared in advanced with grid search) and a probabilistic measure for appliance usage. It helps preserving the privacy of the resident, since all the activities are stored in a local database. DNNs received great research interest in the field of computer vision. In this thesis we adapted different architectures for the problem of human activity recognition. We analyze the quality of the extracted features, and more specifically how model architectures and parameters affect the ability of the automatically extracted features from DNNs to separate activity classes in the final feature space. Additionally, the architectures that we applied for our main problem were also applied to text classification in which we consider the input text as an image and apply 2D CNNs to learn the local and global semantics of the sentences from the variations of the visual patterns of words. This work helps as a first step of creating a dialogue agent that would not require any natural language preprocessing. Finally, since in many domestic environments human speech is present with other environmental sounds, we developed a Convolutional Recurrent Neural Network, to separate the sound sources and applied novel post-processing filters, in order to have an end-to-end noise robust system. Our algorithm ranked first in the Apollo-11 Fearless Steps Challenge.Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement No. 676157, project ACROSSIN

    Machine Learning and Signal Processing Design for Edge Acoustic Applications

    Get PDF

    Machine Learning and Signal Processing Design for Edge Acoustic Applications

    Get PDF
    corecore