709 research outputs found
Bio-inspired multisensory integration of social signals
Emotions understanding represents a core aspect of human communication. Our social behaviours
are closely linked to expressing our emotions and understanding others’ emotional and mental
states through social signals. Emotions are expressed in a multisensory manner, where humans
use social signals from different sensory modalities such as facial expression, vocal changes, or
body language. The human brain integrates all relevant information to create a new multisensory
percept and derives emotional meaning.
There exists a great interest for emotions recognition in various fields such as HCI, gaming,
marketing, and assistive technologies. This demand is driving an increase in research on multisensory
emotion recognition. The majority of existing work proceeds by extracting meaningful
features from each modality and applying fusion techniques either at a feature level or decision
level. However, these techniques are ineffective in translating the constant talk and feedback
between different modalities. Such constant talk is particularly crucial in continuous emotion
recognition, where one modality can predict, enhance and complete the other.
This thesis proposes novel architectures for multisensory emotions recognition inspired by
multisensory integration in the brain. First, we explore the use of bio-inspired unsupervised
learning for unisensory emotion recognition for audio and visual modalities. Then we propose
three multisensory integration models, based on different pathways for multisensory integration
in the brain; that is, integration by convergence, early cross-modal enhancement, and integration
through neural synchrony. The proposed models are designed and implemented using third generation
neural networks, Spiking Neural Networks (SNN) with unsupervised learning. The
models are evaluated using widely adopted, third-party datasets and compared to state-of-the-art
multimodal fusion techniques, such as early, late and deep learning fusion. Evaluation results
show that the three proposed models achieve comparable results to state-of-the-art supervised
learning techniques. More importantly, this thesis shows models that can translate a constant
talk between modalities during the training phase. Each modality can predict, complement and
enhance the other using constant feedback. The cross-talk between modalities adds an insight
into emotions compared to traditional fusion techniques
Spike encoding techniques for IoT time-varying signals benchmarked on a neuromorphic classification task
Spiking Neural Networks (SNNs), known for their potential to enable low energy consumption and computational cost, can bring significant advantages to the realm of embedded machine learning for edge applications. However, input coming from standard digital sensors must be encoded into spike trains before it can be elaborated with neuromorphic computing technologies. We present here a detailed comparison of available spike encoding techniques for the translation of time-varying signals into the event-based signal domain, tested on two different datasets both acquired through commercially available digital devices: the Free Spoken Digit dataset (FSD), consisting of 8-kHz audio files, and the WISDM dataset, composed of 20-Hz recordings of human activity through mobile and wearable inertial sensors. We propose a complete pipeline to benchmark these encoding techniques by performing time-dependent signal classification through a Spiking Convolutional Neural Network (sCNN), including a signal preprocessing step consisting of a bank of filters inspired by the human cochlea, feature extraction by production of a sonogram, transfer learning via an equivalent ANN, and model compression schemes aimed at resource optimization. The resulting performance comparison and analysis provides a powerful practical tool, empowering developers to select the most suitable coding method based on the type of data and the desired processing algorithms, and further expands the applicability of neuromorphic computational paradigms to embedded sensor systems widely employed in the IoT and industrial domains
Brain at work : time, sparseness and superposition principles
Abstract : Many studies explored mechanisms through which the brain encodes sensory inputs allowing a
coherent behavior. The brain could identify stimuli via a hierarchical stream of activity leading to a cardinal
neuron responsive to one particular object. The opportunity to record from numerous neurons offered
investigators the capability of examining simultaneously the functioning of many cells. These approaches
suggested encoding processes that are parallel rather than serial. Binding the many features of a stimulus may be accomplished through an induced synchronization of cell’s action potentials. These interpretations are supported by experimental data and offer many advantages but also several shortcomings. We argue for a coding mechanism based on a sparse synchronization paradigm. We show that synchronization of spikes is a fast and efficient mode to encode the representation of objects based on feature bindings. We introduce the view that sparse synchronization coding presents an interesting venue in probing brain encoding mechanisms as it allows the functional establishment of multilayered and time-conditioned neuronal networks or multislice networks. We propose a model based on integrate-and-fire spiking neurons
Bio-motivated features and deep learning for robust speech recognition
Mención Internacional en el tÃtulo de doctorIn spite of the enormous leap forward that the Automatic Speech
Recognition (ASR) technologies has experienced over the last five years
their performance under hard environmental condition is still far from
that of humans preventing their adoption in several real applications.
In this thesis the challenge of robustness of modern automatic speech
recognition systems is addressed following two main research lines.
The first one focuses on modeling the human auditory system to
improve the robustness of the feature extraction stage yielding to novel
auditory motivated features. Two main contributions are produced.
On the one hand, a model of the masking behaviour of the Human
Auditory System (HAS) is introduced, based on the non-linear filtering
of a speech spectro-temporal representation applied simultaneously
to both frequency and time domains. This filtering is accomplished
by using image processing techniques, in particular mathematical
morphology operations with an specifically designed Structuring Element
(SE) that closely resembles the masking phenomena that take
place in the cochlea. On the other hand, the temporal patterns of
auditory-nerve firings are modeled. Most conventional acoustic features
are based on short-time energy per frequency band discarding
the information contained in the temporal patterns. Our contribution
is the design of several types of feature extraction schemes based on
the synchrony effect of auditory-nerve activity, showing that the modeling
of this effect can indeed improve speech recognition accuracy in
the presence of additive noise. Both models are further integrated into
the well known Power Normalized Cepstral Coefficients (PNCC).
The second research line addresses the problem of robustness in
noisy environments by means of the use of Deep Neural Networks
(DNNs)-based acoustic modeling and, in particular, of Convolutional
Neural Networks (CNNs) architectures. A deep residual network
scheme is proposed and adapted for our purposes, allowing Residual
Networks (ResNets), originally intended for image processing tasks,
to be used in speech recognition where the network input is small
in comparison with usual image dimensions. We have observed that
ResNets on their own already enhance the robustness of the whole system
against noisy conditions. Moreover, our experiments demonstrate
that their combination with the auditory motivated features devised
in this thesis provide significant improvements in recognition accuracy
in comparison to other state-of-the-art CNN-based ASR systems
under mismatched conditions, while maintaining the performance in
matched scenarios.
The proposed methods have been thoroughly tested and compared
with other state-of-the-art proposals for a variety of datasets and
conditions. The obtained results prove that our methods outperform
other state-of-the-art approaches and reveal that they are suitable for
practical applications, specially where the operating conditions are
unknown.El objetivo de esta tesis se centra en proponer soluciones al problema
del reconocimiento de habla robusto; por ello, se han llevado a cabo
dos lÃneas de investigación.
En la primera lÃınea se han propuesto esquemas de extracción de caracterÃsticas novedosos, basados en el modelado del comportamiento
del sistema auditivo humano, modelando especialmente los fenómenos
de enmascaramiento y sincronÃa. En la segunda, se propone mejorar
las tasas de reconocimiento mediante el uso de técnicas de
aprendizaje profundo, en conjunto con las caracterÃsticas propuestas.
Los métodos propuestos tienen como principal objetivo, mejorar la
precisión del sistema de reconocimiento cuando las condiciones de
operación no son conocidas, aunque el caso contrario también ha sido
abordado.
En concreto, nuestras principales propuestas son los siguientes:
Simular el sistema auditivo humano con el objetivo de mejorar
la tasa de reconocimiento en condiciones difÃciles, principalmente
en situaciones de alto ruido, proponiendo esquemas de
extracción de caracterÃsticas novedosos.
Siguiendo esta dirección, nuestras principales propuestas se detallan a continuación:
• Modelar el comportamiento de enmascaramiento del sistema
auditivo humano, usando técnicas del procesado de
imagen sobre el espectro, en concreto, llevando a cabo el
diseño de un filtro morfológico que captura este efecto.
• Modelar el efecto de la sincronà que tiene lugar en el nervio
auditivo.
• La integración de ambos modelos en los conocidos Power
Normalized Cepstral Coefficients (PNCC).
La aplicación de técnicas de aprendizaje profundo con el objetivo
de hacer el sistema más robusto frente al ruido, en particular
con el uso de redes neuronales convolucionales profundas, como
pueden ser las redes residuales.
Por último, la aplicación de las caracterÃsticas propuestas en
combinación con las redes neuronales profundas, con el objetivo
principal de obtener mejoras significativas, cuando las condiciones
de entrenamiento y test no coinciden.Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: Javier Ferreiros López.- Secretario: Fernando DÃaz de MarÃa.- Vocal: Rubén Solera Ureñ
Neural Models of Subcortical Auditory Processing
An important feature of the auditory system is its ability to distinguish many simultaneous
sound sources. The primary goal of this work was to understand how a robust, preattentive
analysis of the auditory scene is accomplished by the subcortical auditory system.
Reasonably accurate modelling of the morphology and organisation of the relevant auditory
nuclei, was seen as being of great importance. The formulation of plausible models and their
subsequent simulation was found to be invaluable in elucidating biological processes and in
highlighting areas of uncertainty.
In the thesis, a review of important aspects of mammalian auditory processing is presented
and used as a basis for the subsequent modelling work. For each aspect of auditory
processing modelled, psychophysical results are described and existing models reviewed,
before the models used here are described and simulated. Auditory processes which are
modelled include the peripheral system, and the production of tonotopic maps of the
spectral content of complex acoustic stimuli, and of modulation frequency or periodicity. A
model of the formation of sequential associations between successive sounds is described,
and the model is shown to be capable of emulating a wide range of psychophysical
behaviour. The grouping of related spectral components and the development of pitch
perception is also investigated. Finally a critical assessment of the work and ideas for future
developments are presented.
The principal contributions of this work are the further development of a model for pitch
perception and the development of a novel architecture for the sequential association of
those groups. In the process of developing these ideas, further insights into subcortical
auditory processing were gained, and explanations for a number of puzzling psychophysical
characteristics suggested.Royal Naval Engineering College, Manadon, Plymout
Recommended from our members
Biologically inspired speaker verification
Speaker verification is an active research problem that has been addressed using a variety of different classification techniques. However, in general, methods inspired by the human auditory system tend to show better verification performance than other methods. In this thesis three biologically inspired speaker verification algorithms are presented
A Comparative Study of Computational Models of Auditory Peripheral System
A deep study about the computational models of the auditory peripheral
system from three different research groups: Carney, Meddis and Hemmert,
is presented here. The aim is to find out which model fits the data best and
which properties of the models are relevant for speech recognition. To get a
first approximation, different tests with tones have been performed with seven
models. Then we have evaluated the results of these models in the presence
of speech. Therefore, two models were studied deeply through an automatic
speech recognition (ASR) system, in clean and noisy background and for a
diversity of sound levels. The post stimulus time histogram help us to see how
the models that improved the offset adaptation present the ¿dead time¿. For
its part, the synchronization evaluation for tones and modulated signals, have
highlighted the better result from the models with offset adaptation. Finally,
tuning curves and Q10dB (added to ASR results) on contrary have indicated
that the selectivity is not a property needed for speech recognition. Besides
the evaluation of the models with ASR have demonstrated the outperforming
of models with offset adaptation and the triviality of using cat or human
tuning for speech recognition. With this results, we conclude that mostly
the model that better fits the data is the one described by Zilany et al.
(2009) and the property unquestionable for speech recognition would be a
good offset adaptation that offers a better synchronization and a better ASR
result. For ASR system it makes no big difference if offset adaptation comes
from a shift of the auditory nerve response or from a power law adaptation
in the synapse.Vendrell Llopis, N. (2010). A Comparative Study of Computational Models of Auditory Peripheral System. http://hdl.handle.net/10251/20433.Archivo delegad
- …