258 research outputs found

    Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

    Get PDF
    Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks

    Studies on noise robust automatic speech recognition

    Get PDF
    Noise in everyday acoustic environments such as cars, traffic environments, and cafeterias remains one of the main challenges in automatic speech recognition (ASR). As a research theme, it has received wide attention in conferences and scientific journals focused on speech technology. This article collection reviews both the classic and novel approaches suggested for noise robust ASR. The articles are literature reviews written for the spring 2009 seminar course on noise robust automatic speech recognition (course code T-61.6060) held at TKK

    A Novel Sound Reconstruction Technique based on a Spike Code (event) Representation

    Get PDF
    This thesis focuses on the re-generation of sound from a spike based coding system. Three different types of spike based coding system have been analyzed. Two of them are biologically inspired spike based coding systems i.e. the spikes are generated in a similar way to how our auditory nerves generate spikes. They have been called AN (Auditory Nerve) spikes and AN Onset (Amplitude Modulated Onset) spikes. Sounds have been re-generated from spikes generated by both of those spike coding technique. A related event based coding technique has been developed by Koickal and the sounds have been re-generated from spikes generated by Koickal's spike coding technique and the results are compared. Our brain does not reconstruct sound from the spikes received from auditory nerves, it interprets it. But by reconstructing sounds from these spike coding techniques, we will be able to identify which spike based technique is better and more efficient for coding different types of sounds. Many issues and challenges arise in reconstructing sound from spikes and they are discussed. The AN spike technique generates the most spikes of the techniques tested, followed by Koickal's technique (54.4% lower) and the AN Onset technique (85.6% lower). Both subjective and objective types of testing have been carried out to assess the quality of reconstructed sounds from these three spike coding techniques. Four types of sounds have been used in the subjective test: string, percussion, male voice and female voice. In the objective test, these four types and many other types of sounds have been included. From the results, it has been established that AN spikes generates the best quality of decoded sounds but it produces many more spikes than the others. AN Onset spikes generates better quality of decoded sounds than Koickal's technique for most of sounds except choir type of sounds and noises, however AN Onset spikes produces 68.5% fewer spikes than Koickal's spikes. This provides evidences that AN Onset spikes can outperform Koickal's spikes for most of the sound types

    A review of Yorùbá Automatic Speech Recognition

    Get PDF
    Automatic Speech Recognition (ASR) has recorded appreciable progress both in technology and application.Despite this progress, there still exist wide performance gap between human speech recognition (HSR) and ASR which has inhibited its full adoption in real life situation.A brief review of research progress on Yorùbá Automatic Speech Recognition (ASR) is presented in this paper focusing of variability as factor contributing to performance gap between HSR and ASR with a view of x-raying the advances recorded, major obstacles, and chart a way forward for development of ASR for Yorùbá that is comparable to those of other tone languages and of developed nations.This is done through extensive surveys of literatures on ASR with focus on Yorùbá.Though appreciable progress has been recorded in advancement of ASR in the developed world, reverse is the case for most of the developing nations especially those of Africa.Yorùbá like most of languages in Africa lacks both human and materials resources needed for the development of functional ASR system much less taking advantage of its potentials benefits. Results reveal that attaining an ultimate goal of ASR performance comparable to human level requires deep understanding of variability factors

    Neural Basis and Computational Strategies for Auditory Processing

    Get PDF
    Our senses are our window to the world, and hearing is the window through which we perceive the world of sound. While seemingly effortless, the process of hearing involves complex transformations by which the auditory system consolidates acoustic information from the environment into perceptual and cognitive experiences. Studies of auditory processing try to elucidate the mechanisms underlying the function of the auditory system, and infer computational strategies that are valuable both clinically and intellectually, hence contributing to our understanding of the function of the brain. In this thesis, we adopt both an experimental and computational approach in tackling various aspects of auditory processing. We first investigate the neural basis underlying the function of the auditory cortex, and explore the dynamics and computational mechanisms of cortical processing. Our findings offer physiological evidence for a role of primary cortical neurons in the integration of sound features at different time constants, and possibly in the formation of auditory objects. Based on physiological principles of sound processing, we explore computational implementations in tackling specific perceptual questions. We exploit our knowledge of the neural mechanisms of cortical auditory processing to formulate models addressing the problems of speech intelligibility and auditory scene analysis. The intelligibility model focuses on a computational approach for evaluating loss of intelligibility, inspired from mammalian physiology and human perception. It is based on a multi-resolution filter-bank implementation of cortical response patterns, which extends into a robust metric for assessing loss of intelligibility in communication channels and speech recordings. This same cortical representation is extended further to develop a computational scheme for auditory scene analysis. The model maps perceptual principles of auditory grouping and stream formation into a computational system that combines aspects of bottom-up, primitive sound processing with an internal representation of the world. It is based on a framework of unsupervised adaptive learning with Kalman estimation. The model is extremely valuable in exploring various aspects of sound organization in the brain, allowing us to gain interesting insight into the neural basis of auditory scene analysis, as well as practical implementations for sound separation in ``cocktail-party'' situations

    Predicting Speech Intelligibility

    Get PDF
    Hearing impairment, and specifically sensorineural hearing loss, is an increasingly prevalent condition, especially amongst the ageing population. It occurs primarily as a result of damage to hair cells that act as sound receptors in the inner ear and causes a variety of hearing perception problems, most notably a reduction in speech intelligibility. Accurate diagnosis of hearing impairments is a time consuming process and is complicated by the reliance on indirect measurements based on patient feedback due to the inaccessible nature of the inner ear. The challenges of designing hearing aids to counteract sensorineural hearing losses are further compounded by the wide range of severities and symptoms experienced by hearing impaired listeners. Computer models of the auditory periphery have been developed, based on phenomenological measurements from auditory-nerve fibres using a range of test sounds and varied conditions. It has been demonstrated that auditory-nerve representations of vowels in normal and noisedamaged ears can be ranked by a subjective visual inspection of how the impaired representations differ from the normal. This thesis seeks to expand on this procedure to use full word tests rather than single vowels, and to replace manual inspection with an automated approach using a quantitative measure. It presents a measure that can predict speech intelligibility in a consistent and reproducible manner. This new approach has practical applications as it could allow speechprocessing algorithms for hearing aids to be objectively tested in early stage development without having to resort to extensive human trials. Simulated hearing tests were carried out by substituting real listeners with the auditory model. A range of signal processing techniques were used to measure the model’s auditory-nerve outputs by presenting them spectro-temporally as neurograms. A neurogram similarity index measure (NSIM) was developed that allowed the impaired outputs to be compared to a reference output from a normal hearing listener simulation. A simulated listener test was developed, using standard listener test material, and was validated for predicting normal hearing speech intelligibility in quiet and noisy conditions. Two types of neurograms were assessed: temporal fine structure (TFS) which retained spike timing information; and average discharge rate or temporal envelope (ENV). Tests were carried out to simulate a wide range of sensorineural hearing losses and the results were compared to real listeners’ unaided and aided performance. Simulations to predict speech intelligibility performance of NAL-RP and DSL 4.0 hearing aid fitting algorithms were undertaken. The NAL-RP hearing aid fitting algorithm was adapted using a chimaera sound algorithm which aimed to improve the TFS speech cues available to aided hearing impaired listeners. NSIM was shown to quantitatively rank neurograms with better performance than a relative mean squared error and other similar metrics. Simulated performance intensity functions predicted speech intelligibility for normal and hearing impaired listeners. The simulated listener tests demonstrated that NAL-RP and DSL 4.0 performed with similar speech intelligibility restoration levels. Using NSIM and a computational model of the auditory periphery, speech intelligibility can be predicted for both normal and hearing impaired listeners and novel hearing aids can be rapidly prototyped and evaluated prior to real listener tests

    Towards Cognizant Hearing Aids: Modeling of Content, Affect and Attention

    Get PDF

    Biophysical modeling of a cochlear implant system: progress on closed-loop design using a novel patient-specific evaluation platform

    Get PDF
    The modern cochlear implant is one of the most successful neural stimulation devices, which partially mimics the workings of the auditory periphery. In the last few decades it has created a paradigm shift in hearing restoration of the deaf population, which has led to more than 324,000 cochlear implant users today. Despite its great success there is great disparity in patient outcomes without clear understanding of the aetiology of this variance in implant performance. Furthermore speech recognition in adverse conditions or music appreciation is still not attainable with today's commercial technology. This motivates the research for the next generation of cochlear implants that takes advantage of recent developments in electronics, neuroscience, nanotechnology, micro-mechanics, polymer chemistry and molecular biology to deliver high fidelity sound. The main difficulties in determining the root of the problem in the cases where the cochlear implant does not perform well are two fold: first there is not a clear paradigm on how the electrical stimulation is perceived as sound by the brain, and second there is limited understanding on the plasticity effects, or learning, of the brain in response to electrical stimulation. These significant knowledge limitations impede the design of novel cochlear implant technologies, as the technical specifications that can lead to better performing implants remain undefined. The motivation of the work presented in this thesis is to compare and contrast the cochlear implant neural stimulation with the operation of the physiological healthy auditory periphery up to the level of the auditory nerve. As such design of novel cochlear implant systems can become feasible by gaining insight on the question `how well does a specific cochlear implant system approximate the healthy auditory periphery?' circumventing the necessity of complete understanding of the brain's comprehension of patterned electrical stimulation delivered from a generic cochlear implant device. A computational model, termed Digital Cochlea Stimulation and Evaluation Tool (‘DiCoStET’) has been developed to provide an objective estimate of cochlear implant performance based on neuronal activation measures, such as vector strength and average activation. A patient-specific cochlea 3D geometry is generated using a model derived by a single anatomical measurement from a patient, using non-invasive high resolution computed tomography (HRCT), and anatomically invariant human metrics and relations. Human measurements of the neuron route within the inner ear enable an innervation pattern to be modelled which joins the space from the organ of Corti to the spiral ganglion subsequently descending into the auditory nerve bundle. An electrode is inserted in the cochlea at a depth that is determined by the user of the tool. The geometric relation between the stimulation sites on the electrode and the spiral ganglion are used to estimate an activating function that will be unique for the specific patient's cochlear shape and electrode placement. This `transfer function', so to speak, between electrode and spiral ganglion serves as a `digital patient' for validating novel cochlear implant systems. The novel computational tool is intended for use by bioengineers, surgeons, audiologists and neuroscientists alike. In addition to ‘DiCoStET’ a second computational model is presented in this thesis aiming at enhancing the understanding of the physiological mechanisms of hearing, specifically the workings of the auditory synapse. The purpose of this model is to provide insight on the sound encoding mechanisms of the synapse. A hypothetical mechanism is suggested in the release of neurotransmitter vesicles that permits the auditory synapse to encode temporal patterns of sound separately from sound intensity. DiCoStET was used to examine the performance of two different types of filters used for spectral analysis in the cochlear implant system, the Gammatone type filter and the Butterworth type filter. The model outputs suggest that the Gammatone type filter performs better than the Butterworth type filter. Furthermore two stimulation strategies, the Continuous Interleaved Stimulation (CIS) and Asynchronous Interleaved Stimulation (AIS) have been compared. The estimated neuronal stimulation spatiotemporal patterns for each strategy suggest that the overall stimulation pattern is not greatly affected by the temporal sequence change. However the finer detail of neuronal activation is different between the two strategies, and when compared to healthy neuronal activation patterns the conjecture is made that the sequential stimulation of CIS hinders the transmission of sound fine structure information to the brain. The effect of the two models developed is the feasibility of collaborative work emanating from various disciplines; especially electrical engineering, auditory physiology and neuroscience for the development of novel cochlear implant systems. This is achieved by using the concept of a `digital patient' whose artificial neuronal activation is compared to a healthy scenario in a computationally efficient manner to allow practical simulation times.Open Acces

    Deep Learning for Distant Speech Recognition

    Full text link
    Deep learning is an emerging technology that is considered one of the most promising directions for reaching higher levels of artificial intelligence. Among the other achievements, building computers that understand speech represents a crucial leap towards intelligent machines. Despite the great efforts of the past decades, however, a natural and robust human-machine speech interaction still appears to be out of reach, especially when users interact with a distant microphone in noisy and reverberant environments. The latter disturbances severely hamper the intelligibility of a speech signal, making Distant Speech Recognition (DSR) one of the major open challenges in the field. This thesis addresses the latter scenario and proposes some novel techniques, architectures, and algorithms to improve the robustness of distant-talking acoustic models. We first elaborate on methodologies for realistic data contamination, with a particular emphasis on DNN training with simulated data. We then investigate on approaches for better exploiting speech contexts, proposing some original methodologies for both feed-forward and recurrent neural networks. Lastly, inspired by the idea that cooperation across different DNNs could be the key for counteracting the harmful effects of noise and reverberation, we propose a novel deep learning paradigm called network of deep neural networks. The analysis of the original concepts were based on extensive experimental validations conducted on both real and simulated data, considering different corpora, microphone configurations, environments, noisy conditions, and ASR tasks.Comment: PhD Thesis Unitn, 201
    • …