352 research outputs found

    Automatic Environmental Sound Recognition: Performance versus Computational Cost

    Get PDF
    In the context of the Internet of Things (IoT), sound sensing applications are required to run on embedded platforms where notions of product pricing and form factor impose hard constraints on the available computing power. Whereas Automatic Environmental Sound Recognition (AESR) algorithms are most often developed with limited consideration for computational cost, this article seeks which AESR algorithm can make the most of a limited amount of computing power by comparing the sound classification performance em as a function of its computational cost. Results suggest that Deep Neural Networks yield the best ratio of sound classification accuracy across a range of computational costs, while Gaussian Mixture Models offer a reasonable accuracy at a consistently small cost, and Support Vector Machines stand between both in terms of compromise between accuracy and computational cost

    Fusion for Audio-Visual Laughter Detection

    Get PDF
    Laughter is a highly variable signal, and can express a spectrum of emotions. This makes the automatic detection of laughter a challenging but interesting task. We perform automatic laughter detection using audio-visual data from the AMI Meeting Corpus. Audio-visual laughter detection is performed by combining (fusing) the results of a separate audio and video classifier on the decision level. The video-classifier uses features based on the principal components of 20 tracked facial points, for audio we use the commonly used PLP and RASTA-PLP features. Our results indicate that RASTA-PLP features outperform PLP features for laughter detection in audio. We compared hidden Markov models (HMMs), Gaussian mixture models (GMMs) and support vector machines (SVM) based classifiers, and found that RASTA-PLP combined with a GMM resulted in the best performance for the audio modality. The video features classified using a SVM resulted in the best single-modality performance. Fusion on the decision-level resulted in laughter detection with a significantly better performance than single-modality classification

    Inferring Room Semantics Using Acoustic Monitoring

    Full text link
    Having knowledge of the environmental context of the user i.e. the knowledge of the users' indoor location and the semantics of their environment, can facilitate the development of many of location-aware applications. In this paper, we propose an acoustic monitoring technique that infers semantic knowledge about an indoor space \emph{over time,} using audio recordings from it. Our technique uses the impulse response of these spaces as well as the ambient sounds produced in them in order to determine a semantic label for them. As we process more recordings, we update our \emph{confidence} in the assigned label. We evaluate our technique on a dataset of single-speaker human speech recordings obtained in different types of rooms at three university buildings. In our evaluation, the confidence\emph{ }for the true label generally outstripped the confidence for all other labels and in some cases converged to 100\% with less than 30 samples.Comment: 2017 IEEE International Workshop on Machine Learning for Signal Processing, Sept.\ 25--28, 2017, Tokyo, Japa

    Assessment of severe apnoea through voice analysis, automatic speech, and speaker recognition techniques

    Full text link
    The electronic version of this article is the complete one and can be found online at: http://asp.eurasipjournals.com/content/2009/1/982531This study is part of an ongoing collaborative effort between the medical and the signal processing communities to promote research on applying standard Automatic Speech Recognition (ASR) techniques for the automatic diagnosis of patients with severe obstructive sleep apnoea (OSA). Early detection of severe apnoea cases is important so that patients can receive early treatment. Effective ASR-based detection could dramatically cut medical testing time. Working with a carefully designed speech database of healthy and apnoea subjects, we describe an acoustic search for distinctive apnoea voice characteristics. We also study abnormal nasalization in OSA patients by modelling vowels in nasal and nonnasal phonetic contexts using Gaussian Mixture Model (GMM) pattern recognition on speech spectra. Finally, we present experimental findings regarding the discriminative power of GMMs applied to severe apnoea detection. We have achieved an 81% correct classification rate, which is very promising and underpins the interest in this line of inquiry.The activities described in this paper were funded by the Spanish Ministry of Science and Technology as part of the TEC2006-13170-C02-02 Project

    Analysis of very low quality speech for mask-based enhancement

    Get PDF
    The complexity of the speech enhancement problem has motivated many different solutions. However, most techniques address situations in which the target speech is fully intelligible and the background noise energy is low in comparison with that of the speech. Thus while current enhancement algorithms can improve the perceived quality, the intelligibility of the speech is not increased significantly and may even be reduced. Recent research shows that intelligibility of very noisy speech can be improved by the use of a binary mask, in which a binary weight is applied to each time-frequency bin of the input spectrogram. There are several alternative goals for the binary mask estimator, based either on the Signal-to-Noise Ratio (SNR) of each time-frequency bin or on the speech signal characteristics alone. Our approach to the binary mask estimation problem aims to preserve the important speech cues independently of the noise present by identifying time-frequency regions that contain significant speech energy. The speech power spectrum varies greatly for different types of speech sound. The energy of voiced speech sounds is concentrated in the harmonics of the fundamental frequency while that of unvoiced sounds is, in contrast, distributed across a broad range of frequencies. To identify the presence of speech energy in a noisy speech signal we have therefore developed two detection algorithms. The first is a robust algorithm that identifies voiced speech segments and estimates their fundamental frequency. The second detects the presence of sibilants and estimates their energy distribution. In addition, we have developed a robust algorithm to estimate the active level of the speech. The outputs of these algorithms are combined with other features estimated from the noisy speech to form the input to a classifier which estimates a mask that accurately reflects the time-frequency distribution of speech energy even at low SNR levels. We evaluate a mask-based speech enhancer on a range of speech and noise signals and demonstrate a consistent increase in an objective intelligibility measure with respect to noisy speech.Open Acces

    Characterization of speaker recognition in noisy channels

    Get PDF
    Speaker recognition is a frequently overlooked form of biometric security. Text-independent speaker identification is used by financial services, forensic experts, and human computer interaction developers to extract information that is transmitted along with a spoken message such as identity, gender, age, emotional state, etc. of a speaker. Speech features are classified as either low-level or high-level characteristics. Highlevel speech features are associated with syntax, dialect, and the overall meaning of a spoken message. In contrast, low-level features such as pitch, and phonemic spectra are associated much more with the physiology of the human vocal tract. It is these lowlevel features that are also the easiest and least computationally intensive characteristics of speech to extract. Once extracted, modern speaker recognition systems attempt to fit these features best to statistical classification models. One such widely used model is the Gaussian Mixture Model (GMM). The current standard of testing of speaker recognition systems is standardized by NIST in the often updated NIST Speaker Recognition Evaluation (NIST-SRE) standard. The results measured by the tests outlined in the standard are ultimately presented as Detection Error Tradeoff (DET) curves and detection cost function scores. A new method of measuring the effects of channel impediments on the quality of identifications made by Gaussian Mixture Model based speaker recognition systems will be presented in this thesis. With the exception of the NIST-SRE, no standardized or extensive testing of speaker recognition systems in noisy channels has been conducted. Thorough testing of speaker recognition systems will be conducted in channel model simulators. Additionally, the NIST-SRE error metric will be evaluated against a new proposed metric for gauging the performance and improvements of speaker recognition systems

    Time–Frequency Cepstral Features and Heteroscedastic Linear Discriminant Analysis for Language Recognition

    Get PDF
    The shifted delta cepstrum (SDC) is a widely used feature extraction for language recognition (LRE). With a high context width due to incorporation of multiple frames, SDC outperforms traditional delta and acceleration feature vectors. However, it also introduces correlation into the concatenated feature vector, which increases redundancy and may degrade the performance of backend classifiers. In this paper, we first propose a time-frequency cepstral (TFC) feature vector, which is obtained by performing a temporal discrete cosine transform (DCT) on the cepstrum matrix and selecting the transformed elements in a zigzag scan order. Beyond this, we increase discriminability through a heteroscedastic linear discriminant analysis (HLDA) on the full cepstrum matrix. By utilizing block diagonal matrix constraints, the large HLDA problem is then reduced to several smaller HLDA problems, creating a block diagonal HLDA (BDHLDA) algorithm which has much lower computational complexity. The BDHLDA method is finally extended to the GMM domain, using the simpler TFC features during re-estimation to provide significantly improved computation speed. Experiments on NIST 2003 and 2007 LRE evaluation corpora show that TFC is more effective than SDC, and that the GMM-based BDHLDA results in lower equal error rate (EER) and minimum average cost (Cavg) than either TFC or SDC approaches

    Acoustic Scene Classification

    Get PDF
    This work was supported by the Centre for Digital Music Platform (grant EP/K009559/1) and a Leadership Fellowship (EP/G007144/1) both from the United Kingdom Engineering and Physical Sciences Research Council
    • …
    corecore