78,586 research outputs found
Prediction of speech intelligibility based on a correlation metric in the envelope power spectrum domain
A powerful tool to investigate speech perception is the use of speech intelligibility prediction models. Recently, a model was presented, termed correlation-based speechbased envelope power spectrum model (sEPSMcorr) [1], based on the auditory processing of the multi-resolution speech-based Envelope Power Spectrum Model (mr-sEPSM) [2], combined with the correlation back-end of the Short-Time Objective Intelligibility measure (STOI) [3]. The sEPSMcorr can accurately predict NH data for a broad range of listening conditions, e.g., additive noise, phase jitter and ideal binary mask processing
Deep Spiking Neural Network model for time-variant signals classification: a real-time speech recognition approach
Speech recognition has become an important task
to improve the human-machine interface. Taking into account
the limitations of current automatic speech recognition systems,
like non-real time cloud-based solutions or power demand,
recent interest for neural networks and bio-inspired systems has
motivated the implementation of new techniques.
Among them, a combination of spiking neural networks and
neuromorphic auditory sensors offer an alternative to carry
out the human-like speech processing task. In this approach,
a spiking convolutional neural network model was implemented,
in which the weights of connections were calculated by training
a convolutional neural network with specific activation functions,
using firing rate-based static images with the spiking information
obtained from a neuromorphic cochlea.
The system was trained and tested with a large dataset
that contains ”left” and ”right” speech commands, achieving
89.90% accuracy. A novel spiking neural network model has been
proposed to adapt the network that has been trained with static
images to a non-static processing approach, making it possible
to classify audio signals and time series in real time.Ministerio de Economía y Competitividad TEC2016-77785-
A Neural-Network Framework for the Design of Individualised Hearing-Loss Compensation
Even though sound processing in the human auditory system is complex and
highly non-linear, hearing aids (HAs) still rely on simplified descriptions of
auditory processing or hearing loss to restore hearing. Standard HA
amplification strategies succeed in restoring inaudibility of faint sounds, but
fall short of providing targetted treatments for complex sensorineural
deficits. To address this challenge, biophysically realistic models of human
auditory processing can be adopted in the design of individualised HA
strategies, but these are typically non-differentiable and computationally
expensive. Therefore, this study proposes a differentiable DNN framework that
can be used to train DNN-based HA models based on biophysical
auditory-processing differences between normal-hearing and hearing-impaired
models. We investigate the restoration capabilities of our DNN-based
hearing-loss compensation for different loss functions, to optimally compensate
for a mixed outer-hair-cell (OHC) loss and cochlear-synaptopathy (CS)
impairment. After evaluating which trained DNN-HA model yields the best
restoration outcomes on simulated auditory responses and speech
intelligibility, we applied the same training procedure to two milder
hearing-loss profiles with OHC loss or CS alone. Our results show that
auditory-processing restoration was possible for all considered hearing-loss
cases, with OHC loss proving easier to compensate than CS. Several objective
metrics were considered to estimate the expected perceptual benefit after
processing, and these simulations hold promise in yielding improved
understanding of speech-in-noise for hearing-impaired listeners who use our
DNN-HA processing. Since our framework can be tuned to the hearing-loss
profiles of individual listeners, we enter an era where truly individualised
and DNN-based hearing-restoration strategies can be developed and be tested
experimentally
Discrimination of Speech From Non-Speech Based on Multiscale Spectro-Temporal Modulations
We describe a content-based audio classification algorithm based on novel multiscale spectrotemporal modulation features inspired by a model of auditory cortical processing. The task explored is to discriminate speech from non-speech consisting of animal vocalizations, music and environmental sounds. Although this is a relatively easy task for humans, it is still difficult to automate well, especially in noisy and reverberant environments. The auditory model captures basic processes occurring from the early cochlear stages to the central cortical areas. The model generates a multidimensional spectro-temporal representation of the sound, which is then analyzed by a multi-linear dimensionality reduction technique and classified by a Support Vector Machine (SVM). Generalization of the system to signals in high level of additive noise and reverberation is evaluated and compared to two existing approaches [1] [2]. The results demonstrate the advantages of the auditory model over the other two systems, especially at low SNRs and high reverberation
Spectral discontinuity in concatenative speech synthesis – perception, join costs and feature transformations
This thesis explores the problem of determining an objective measure to represent human perception of spectral discontinuity in concatenative speech synthesis. Such measures are used as join costs to quantify the compatibility of speech units for concatenation in unit selection synthesis. No previous study has reported a spectral measure that satisfactorily correlates with human perception of discontinuity. An analysis of the limitations of existing measures and our understanding of the human auditory system were used to guide the strategies adopted to advance a solution to this problem.
A listening experiment was conducted using a database of concatenated speech with results indicating the perceived continuity of each concatenation. The results of this experiment were used to correlate proposed measures of spectral continuity with the perceptual results. A number of standard speech parametrisations and distance measures were tested as measures of spectral continuity and analysed to identify their limitations. Time-frequency resolution was found to limit the performance of standard speech parametrisations.As a solution to this problem, measures of continuity based on the wavelet transform were proposed and tested, as wavelets offer superior time-frequency resolution to standard spectral measures. A further limitation of standard speech parametrisations is that they are typically computed from the magnitude spectrum. However, the auditory system combines information relating to the magnitude spectrum, phase spectrum and spectral dynamics. The potential of phase and spectral dynamics as measures of spectral continuity were investigated. One widely adopted approach to detecting discontinuities is to compute the Euclidean distance between feature vectors about the join in concatenated speech. The detection of an auditory event, such as the detection of a discontinuity, involves processing high up the auditory pathway in the central auditory system. The basic Euclidean distance cannot model such behaviour. A study was conducted to investigate feature transformations with sufficient processing complexity to mimic high level auditory processing. Neural networks and principal component analysis were investigated as feature transformations.
Wavelet based measures were found to outperform all measures of continuity based on standard speech parametrisations. Phase and spectral dynamics based measures were found to correlate with human perception of discontinuity in the test database, although neither measure was found to contribute a significant increase in performance when combined with standard measures of continuity. Neural network feature transformations were found to significantly outperform all other measures tested in this study, producing correlations with perceptual results in excess of 90%
Recognizing Speech in a Novel Accent: The Motor Theory of Speech Perception Reframed
The motor theory of speech perception holds that we perceive the speech of
another in terms of a motor representation of that speech. However, when we
have learned to recognize a foreign accent, it seems plausible that recognition
of a word rarely involves reconstruction of the speech gestures of the speaker
rather than the listener. To better assess the motor theory and this
observation, we proceed in three stages. Part 1 places the motor theory of
speech perception in a larger framework based on our earlier models of the
adaptive formation of mirror neurons for grasping, and for viewing extensions
of that mirror system as part of a larger system for neuro-linguistic
processing, augmented by the present consideration of recognizing speech in a
novel accent. Part 2 then offers a novel computational model of how a listener
comes to understand the speech of someone speaking the listener's native
language with a foreign accent. The core tenet of the model is that the
listener uses hypotheses about the word the speaker is currently uttering to
update probabilities linking the sound produced by the speaker to phonemes in
the native language repertoire of the listener. This, on average, improves the
recognition of later words. This model is neutral regarding the nature of the
representations it uses (motor vs. auditory). It serve as a reference point for
the discussion in Part 3, which proposes a dual-stream neuro-linguistic
architecture to revisits claims for and against the motor theory of speech
perception and the relevance of mirror neurons, and extracts some implications
for the reframing of the motor theory
A spiking neural network for real-time Spanish vowel phonemes recognition
This paper explores neuromorphic approach capabilities applied to real-time speech processing. A spiking
recognition neural network composed of three types of neurons is proposed. These neurons are based on an
integrative and fire model and are capable of recognizing auditory frequency patterns, such as vowel phonemes;
words are recognized as sequences of vowel phonemes. For demonstrating real-time operation, a complete
spiking recognition neural network has been described in VHDL for detecting certain Spanish words, and it has
been tested in a FPGA platform. This is a stand-alone and fully hardware system that allows to embed it in a
mobile system. To stimulate the network, a spiking digital-filter-based cochlea has been implemented in VHDL.
In the implementation, an Address Event Representation (AER) is used for transmitting information between
neurons.Ministerio de Economía y Competitividad TEC2012-37868-C04-02/0
Low-Level Information and High-Level Perception: The Case of Speech in Noise
Auditory information is processed in a fine-to-crude hierarchical scheme, from low-level acoustic information to high-level abstract representations, such as phonological labels. We now ask whether fine acoustic information, which is not retained at high levels, can still be used to extract speech from noise. Previous theories suggested either full availability of low-level information or availability that is limited by task difficulty. We propose a third alternative, based on the Reverse Hierarchy Theory (RHT), originally derived to describe the relations between the processing hierarchy and visual perception. RHT asserts that only the higher levels of the hierarchy are immediately available for perception. Direct access to low-level information requires specific conditions, and can be achieved only at the cost of concurrent comprehension. We tested the predictions of these three views in a series of experiments in which we measured the benefits from utilizing low-level binaural information for speech perception, and compared it to that predicted from a model of the early auditory system. Only auditory RHT could account for the full pattern of the results, suggesting that similar defaults and tradeoffs underlie the relations between hierarchical processing and perception in the visual and auditory modalities
- …