24,111 research outputs found
End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models
Speech activity detection (SAD) plays an important role in current speech
processing systems, including automatic speech recognition (ASR). SAD is
particularly difficult in environments with acoustic noise. A practical
solution is to incorporate visual information, increasing the robustness of the
SAD approach. An audiovisual system has the advantage of being robust to
different speech modes (e.g., whisper speech) or background noise. Recent
advances in audiovisual speech processing using deep learning have opened
opportunities to capture in a principled way the temporal relationships between
acoustic and visual features. This study explores this idea proposing a
\emph{bimodal recurrent neural network} (BRNN) framework for SAD. The approach
models the temporal dynamic of the sequential audiovisual data, improving the
accuracy and robustness of the proposed SAD system. Instead of estimating
hand-crafted features, the study investigates an end-to-end training approach,
where acoustic and visual features are directly learned from the raw data
during training. The experimental evaluation considers a large audiovisual
corpus with over 60.8 hours of recordings, collected from 105 speakers. The
results demonstrate that the proposed framework leads to absolute improvements
up to 1.2% under practical scenarios over a VAD baseline using only audio
implemented with deep neural network (DNN). The proposed approach achieves
92.7% F1-score when it is evaluated using the sensors from a portable tablet
under noisy acoustic environment, which is only 1.0% lower than the performance
obtained under ideal conditions (e.g., clean speech obtained with a high
definition camera and a close-talking microphone).Comment: Submitted to Speech Communicatio
Synchronization and Redundancy: Implications for Robustness of Neural Learning and Decision Making
Learning and decision making in the brain are key processes critical to
survival, and yet are processes implemented by non-ideal biological building
blocks which can impose significant error. We explore quantitatively how the
brain might cope with this inherent source of error by taking advantage of two
ubiquitous mechanisms, redundancy and synchronization. In particular we
consider a neural process whose goal is to learn a decision function by
implementing a nonlinear gradient dynamics. The dynamics, however, are assumed
to be corrupted by perturbations modeling the error which might be incurred due
to limitations of the biology, intrinsic neuronal noise, and imperfect
measurements. We show that error, and the associated uncertainty surrounding a
learned solution, can be controlled in large part by trading off
synchronization strength among multiple redundant neural systems against the
noise amplitude. The impact of the coupling between such redundant systems is
quantified by the spectrum of the network Laplacian, and we discuss the role of
network topology in synchronization and in reducing the effect of noise. A
range of situations in which the mechanisms we model arise in brain science are
discussed, and we draw attention to experimental evidence suggesting that
cortical circuits capable of implementing the computations of interest here can
be found on several scales. Finally, simulations comparing theoretical bounds
to the relevant empirical quantities show that the theoretical estimates we
derive can be tight.Comment: Preprint, accepted for publication in Neural Computatio
Recognizing recurrent neural networks (rRNN): Bayesian inference for recurrent neural networks
Recurrent neural networks (RNNs) are widely used in computational
neuroscience and machine learning applications. In an RNN, each neuron computes
its output as a nonlinear function of its integrated input. While the
importance of RNNs, especially as models of brain processing, is undisputed, it
is also widely acknowledged that the computations in standard RNN models may be
an over-simplification of what real neuronal networks compute. Here, we suggest
that the RNN approach may be made both neurobiologically more plausible and
computationally more powerful by its fusion with Bayesian inference techniques
for nonlinear dynamical systems. In this scheme, we use an RNN as a generative
model of dynamic input caused by the environment, e.g. of speech or kinematics.
Given this generative RNN model, we derive Bayesian update equations that can
decode its output. Critically, these updates define a 'recognizing RNN' (rRNN),
in which neurons compute and exchange prediction and prediction error messages.
The rRNN has several desirable features that a conventional RNN does not have,
for example, fast decoding of dynamic stimuli and robustness to initial
conditions and noise. Furthermore, it implements a predictive coding scheme for
dynamic inputs. We suggest that the Bayesian inversion of recurrent neural
networks may be useful both as a model of brain function and as a machine
learning tool. We illustrate the use of the rRNN by an application to the
online decoding (i.e. recognition) of human kinematics
- …