853 research outputs found
Robust Sound Event Classification using Deep Neural Networks
The automatic recognition of sound events by computers is an important aspect of emerging applications such as automated surveillance, machine hearing and auditory scene understanding. Recent advances in machine learning, as well as in computational models of the human auditory system, have contributed to advances in this increasingly popular research field. Robust sound event classification, the ability to recognise sounds under real-world noisy conditions, is an especially challenging task. Classification methods translated from the speech recognition domain, using features such as mel-frequency cepstral coefficients, have been shown to perform reasonably well for the sound event classification task, although spectrogram-based or auditory image analysis techniques reportedly achieve superior performance in noise.
This paper outlines a sound event classification framework that compares auditory image front end features with spectrogram image-based front end features, using support vector machine and deep neural network classifiers. Performance is evaluated on a standard robust classification task in different levels of corrupting noise, and with several system enhancements, and shown to compare very well with current state-of-the-art classification techniques
An automated lexical stress classification tool for assessing dysprosody in childhood apraxia of speech
Childhood apraxia of speech (CAS) commonly affects the production of lexical stress contrast in polysyllabic words. Automated classification tools have the potential to increase reliability and efficiency in measuring lexical stress. Here, factors affecting the accuracy of a custom-built deep neural network (DNN)-based classification tool are evaluated. Sixteen children with typical development (TD) and 26 with CAS produced 50 polysyllabic words. Words with strong–weak (SW, e.g., dinosaur) or WS (e.g., banana) stress were fed to the classification tool, and the accuracy measured (a) against expert judgment, (b) for speaker group, and (c) with/without prior knowledge of phonemic errors in the sample. The influence of segmental features and participant factors on tool accuracy was analysed. Linear mixed modelling showed significant interaction between group and stress type, surviving adjustment for age and CAS severity. For TD, agreement for SW and WS words was >80%, but CAS speech was higher for SW (>80%) than WS (~60%). Prior knowledge of segmental errors conferred no clear advantage. Automatic lexical stress classification shows promise for identifying errors in children’s speech at diagnosis or with treatment-related change, but accuracy for WS words in apraxic speech needs improvement. Further training of algorithms using larger sets of labelled data containing impaired speech and WS words may increase accuracy
Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition
This paper presents a self-supervised method for visual detection of the
active speaker in a multi-person spoken interaction scenario. Active speaker
detection is a fundamental prerequisite for any artificial cognitive system
attempting to acquire language in social settings. The proposed method is
intended to complement the acoustic detection of the active speaker, thus
improving the system robustness in noisy conditions. The method can detect an
arbitrary number of possibly overlapping active speakers based exclusively on
visual information about their face. Furthermore, the method does not rely on
external annotations, thus complying with cognitive development. Instead, the
method uses information from the auditory modality to support learning in the
visual domain. This paper reports an extensive evaluation of the proposed
method using a large multi-person face-to-face interaction dataset. The results
show good performance in a speaker dependent setting. However, in a speaker
independent setting the proposed method yields a significantly lower
performance. We believe that the proposed method represents an essential
component of any artificial cognitive system or robotic platform engaging in
social interactions.Comment: 10 pages, IEEE Transactions on Cognitive and Developmental System
Code-Switched Urdu ASR for Noisy Telephonic Environment using Data Centric Approach with Hybrid HMM and CNN-TDNN
Call Centers have huge amount of audio data which can be used for achieving
valuable business insights and transcription of phone calls is manually tedious
task. An effective Automated Speech Recognition system can accurately
transcribe these calls for easy search through call history for specific
context and content allowing automatic call monitoring, improving QoS through
keyword search and sentiment analysis. ASR for Call Center requires more
robustness as telephonic environment are generally noisy. Moreover, there are
many low-resourced languages that are on verge of extinction which can be
preserved with help of Automatic Speech Recognition Technology. Urdu is the
most widely spoken language in the world, with 231,295,440 worldwide
still remains a resource constrained language in ASR. Regional call-center
conversations operate in local language, with a mix of English numbers and
technical terms generally causing a "code-switching" problem. Hence, this paper
describes an implementation framework of a resource efficient Automatic Speech
Recognition/ Speech to Text System in a noisy call-center environment using
Chain Hybrid HMM and CNN-TDNN for Code-Switched Urdu Language. Using Hybrid
HMM-DNN approach allowed us to utilize the advantages of Neural Network with
less labelled data. Adding CNN with TDNN has shown to work better in noisy
environment due to CNN's additional frequency dimension which captures extra
information from noisy speech, thus improving accuracy. We collected data from
various open sources and labelled some of the unlabelled data after analysing
its general context and content from Urdu language as well as from commonly
used words from other languages, primarily English and were able to achieve WER
of 5.2% with noisy as well as clean environment in isolated words or numbers as
well as in continuous spontaneous speech.Comment: 32 pages, 19 figures, 2 tables, preprin
Hidden Markov models and neural networks for speech recognition
The Hidden Markov Model (HMMs) is one of the most successful modeling approaches for acoustic events in speech recognition, and more recently it has proven useful for several problems in biological sequence analysis. Although the HMM is good at capturing the temporal nature of processes such as speech, it has a very limited capacity for recognizing complex patterns involving more than first order dependencies in the observed data sequences. This is due to the first order state process and the assumption of state conditional independence between observations. Artificial Neural Networks (NNs) are almost the opposite: they cannot model dynamic, temporally extended phenomena very well, but are good at static classification and regression tasks. Combining the two frameworks in a sensible way can therefore lead to a more powerful model with better classification abilities. The overall aim of this work has been to develop a probabilistic hybrid of hidden Markov models and neural networks and ..
- …