965 research outputs found
Bio-inspired broad-class phonetic labelling
Recent studies have shown that the correct labeling of phonetic classes may help current Automatic Speech Recognition (ASR) when combined with classical parsing automata based on Hidden Markov Models (HMM).Through the present paper a method for Phonetic Class Labeling (PCL) based on bio-inspired speech processing is described. The methodology is based in the automatic detection of formants and formant trajectories after a careful separation of the vocal and glottal components of speech and in the operation of CF (Characteristic Frequency) neurons in the cochlear nucleus and cortical complex of the human auditory apparatus. Examples of phonetic class labeling are given and the applicability of the method to Speech Processing is discussed
A physiologically inspired model for solving the cocktail party problem.
At a cocktail party, we can broadly monitor the entire acoustic scene to detect important cues (e.g., our names being called, or the fire alarm going off), or selectively listen to a target sound source (e.g., a conversation partner). It has recently been observed that individual neurons in the avian field L (analog to the mammalian auditory cortex) can display broad spatial tuning to single targets and selective tuning to a target embedded in spatially distributed sound mixtures. Here, we describe a model inspired by these experimental observations and apply it to process mixtures of human speech sentences. This processing is realized in the neural spiking domain. It converts binaural acoustic inputs into cortical spike trains using a multi-stage model composed of a cochlear filter-bank, a midbrain spatial-localization network, and a cortical network. The output spike trains of the cortical network are then converted back into an acoustic waveform, using a stimulus reconstruction technique. The intelligibility of the reconstructed output is quantified using an objective measure of speech intelligibility. We apply the algorithm to single and multi-talker speech to demonstrate that the physiologically inspired algorithm is able to achieve intelligible reconstruction of an "attended" target sentence embedded in two other non-attended masker sentences. The algorithm is also robust to masker level and displays performance trends comparable to humans. The ideas from this work may help improve the performance of hearing assistive devices (e.g., hearing aids and cochlear implants), speech-recognition technology, and computational algorithms for processing natural scenes cluttered with spatially distributed acoustic objects.R01 DC000100 - NIDCD NIH HHSPublished versio
A Computational Model of Auditory Feature Extraction and Sound Classification
This thesis introduces a computer model that incorporates responses similar to
those found in the cochlea, in sub-corticai auditory processing, and in auditory
cortex. The principle aim of this work is to show that this can form the basis
for a biologically plausible mechanism of auditory stimulus classification. We will
show that this classification is robust to stimulus variation and time compression.
In addition, the response of the system is shown to support multiple, concurrent,
behaviourally relevant classifications of natural stimuli (speech).
The model incorporates transient enhancement, an ensemble of spectro -
temporal filters, and a simple measure analogous to the idea of visual salience
to produce a quasi-static description of the stimulus suitable either for classification
with an analogue artificial neural network or, using appropriate rate coding,
a classifier based on artificial spiking neurons. We also show that the spectotemporal
ensemble can be derived from a limited class of 'formative' stimuli, consistent
with a developmental interpretation of ensemble formation. In addition,
ensembles chosen on information theoretic grounds consist of filters with relatively
simple geometries, which is consistent with reports of responses in mammalian
thalamus and auditory cortex.
A powerful feature of this approach is that the ensemble response, from
which salient auditory events are identified, amounts to stimulus-ensemble driven
method of segmentation which respects the envelope of the stimulus, and leads
to a quasi-static representation of auditory events which is suitable for spike rate
coding.
We also present evidence that the encoded auditory events may form the
basis of a representation-of-similarity, or second order isomorphism, which implies
a representational space that respects similarity relationships between stimuli
including novel stimuli
Recommended from our members
Biologically inspired speaker verification
Speaker verification is an active research problem that has been addressed using a variety of different classification techniques. However, in general, methods inspired by the human auditory system tend to show better verification performance than other methods. In this thesis three biologically inspired speaker verification algorithms are presented
SEGAN: Speech Enhancement Generative Adversarial Network
Current speech enhancement techniques operate on the spectral domain and/or
exploit some higher-level feature. The majority of them tackle a limited number
of noise conditions and rely on first-order statistics. To circumvent these
issues, deep networks are being increasingly used, thanks to their ability to
learn complex functions from large example sets. In this work, we propose the
use of generative adversarial networks for speech enhancement. In contrast to
current techniques, we operate at the waveform level, training the model
end-to-end, and incorporate 28 speakers and 40 different noise conditions into
the same model, such that model parameters are shared across them. We evaluate
the proposed model using an independent, unseen test set with two speakers and
20 alternative noise conditions. The enhanced samples confirm the viability of
the proposed model, and both objective and subjective evaluations confirm the
effectiveness of it. With that, we open the exploration of generative
architectures for speech enhancement, which may progressively incorporate
further speech-centric design choices to improve their performance.Comment: 5 pages, 4 figures, accepted in INTERSPEECH 201
Feature extraction based on bio-inspired model for robust emotion recognition
Emotional state identification is an important issue to achieve more natural speech interactive systems. Ideally, these systems should also be able to work in real environments in which generally exist some kind of noise. Several bio-inspired representations have been applied to artificial systems for speech processing under noise conditions. In this work, an auditory signal representation is used to obtain a novel bio-inspired set of features for emotional speech signals. These characteristics, together with other spectral and prosodic features, are used for emotion recognition under noise conditions. Neural models were trained as classifiers and results were compared to the well-known mel-frequency cepstral coefficients. Results show that using the proposed representations, it is possible to significantly improve the robustness of an emotion recognition system. The results were also validated in a speaker independent scheme and with two emotional speech corpora.Fil: Albornoz, Enrique Marcelo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; ArgentinaFil: Milone, Diego Humberto. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; ArgentinaFil: Rufiner, Hugo Leonardo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; Argentin
Bio-Inspired Multi-Layer Spiking Neural Network Extracts Discriminative Features from Speech Signals
Spiking neural networks (SNNs) enable power-efficient implementations due to
their sparse, spike-based coding scheme. This paper develops a bio-inspired SNN
that uses unsupervised learning to extract discriminative features from speech
signals, which can subsequently be used in a classifier. The architecture
consists of a spiking convolutional/pooling layer followed by a fully connected
spiking layer for feature discovery. The convolutional layer of leaky,
integrate-and-fire (LIF) neurons represents primary acoustic features. The
fully connected layer is equipped with a probabilistic spike-timing-dependent
plasticity learning rule. This layer represents the discriminative features
through probabilistic, LIF neurons. To assess the discriminative power of the
learned features, they are used in a hidden Markov model (HMM) for spoken digit
recognition. The experimental results show performance above 96% that compares
favorably with popular statistical feature extraction methods. Our results
provide a novel demonstration of unsupervised feature acquisition in an SNN
- …