4 research outputs found
Twin Regularization for online speech recognition
Online speech recognition is crucial for developing natural human-machine
interfaces. This modality, however, is significantly more challenging than
off-line ASR, since real-time/low-latency constraints inevitably hinder the use
of future information, that is known to be very helpful to perform robust
predictions. A popular solution to mitigate this issue consists of feeding
neural acoustic models with context windows that gather some future frames.
This introduces a latency which depends on the number of employed look-ahead
features. This paper explores a different approach, based on estimating the
future rather than waiting for it. Our technique encourages the hidden
representations of a unidirectional recurrent network to embed some useful
information about the future. Inspired by a recently proposed technique called
Twin Networks, we add a regularization term that forces forward hidden states
to be as close as possible to cotemporal backward ones, computed by a "twin"
neural network running backwards in time. The experiments, conducted on a
number of datasets, recurrent architectures, input features, and acoustic
conditions, have shown the effectiveness of this approach. One important
advantage is that our method does not introduce any additional computation at
test time if compared to standard unidirectional recurrent networks.Comment: Accepted at INTESPEECH 201
Interpretable Convolutional Filters with SincNet
Deep learning is currently playing a crucial role toward higher levels of
artificial intelligence. This paradigm allows neural networks to learn complex
and abstract representations, that are progressively obtained by combining
simpler ones. Nevertheless, the internal "black-box" representations
automatically discovered by current neural architectures often suffer from a
lack of interpretability, making of primary interest the study of explainable
machine learning techniques. This paper summarizes our recent efforts to
develop a more interpretable neural model for directly processing speech from
the raw waveform. In particular, we propose SincNet, a novel Convolutional
Neural Network (CNN) that encourages the first layer to discover more
meaningful filters by exploiting parametrized sinc functions. In contrast to
standard CNNs, which learn all the elements of each filter, only low and high
cutoff frequencies of band-pass filters are directly learned from data. This
inductive bias offers a very compact way to derive a customized filter-bank
front-end, that only depends on some parameters with a clear physical meaning.
Our experiments, conducted on both speaker and speech recognition, show that
the proposed architecture converges faster, performs better, and is more
interpretable than standard CNNs.Comment: In Proceedings of NIPS@IRASL 2018. arXiv admin note: substantial text
overlap with arXiv:1808.0015
Speech and Speaker Recognition from Raw Waveform with SincNet
Deep neural networks can learn complex and abstract representations, that are
progressively obtained by combining simpler ones. A recent trend in speech and
speaker recognition consists in discovering these representations starting from
raw audio samples directly. Differently from standard hand-crafted features
such as MFCCs or FBANK, the raw waveform can potentially help neural networks
discover better and more customized representations. The high-dimensional raw
inputs, however, can make training significantly more challenging. This paper
summarizes our recent efforts to develop a neural architecture that efficiently
processes speech from audio waveforms. In particular, we propose SincNet, a
novel Convolutional Neural Network (CNN) that encourages the first layer to
discover meaningful filters by exploiting parametrized sinc functions. In
contrast to standard CNNs, which learn all the elements of each filter, only
low and high cutoff frequencies of band-pass filters are directly learned from
data. This inductive bias offers a very compact way to derive a customized
front-end, that only depends on some parameters with a clear physical meaning.
Our experiments, conducted on both speaker and speech recognition, show that
the proposed architecture converges faster, performs better, and is more
computationally efficient than standard CNNs.Comment: arXiv admin note: substantial text overlap with arXiv:1811.09725,
arXiv:1808.0015
The PyTorch-Kaldi Speech Recognition Toolkit
The availability of open-source software is playing a remarkable role in the
popularization of speech recognition and deep learning. Kaldi, for instance, is
nowadays an established framework used to develop state-of-the-art speech
recognizers. PyTorch is used to build neural networks with the Python language
and has recently spawn tremendous interest within the machine learning
community thanks to its simplicity and flexibility.
The PyTorch-Kaldi project aims to bridge the gap between these popular
toolkits, trying to inherit the efficiency of Kaldi and the flexibility of
PyTorch. PyTorch-Kaldi is not only a simple interface between these software,
but it embeds several useful features for developing modern speech recognizers.
For instance, the code is specifically designed to naturally plug-in
user-defined acoustic models. As an alternative, users can exploit several
pre-implemented neural networks that can be customized using intuitive
configuration files. PyTorch-Kaldi supports multiple feature and label streams
as well as combinations of neural networks, enabling the use of complex neural
architectures. The toolkit is publicly-released along with a rich documentation
and is designed to properly work locally or on HPC clusters.
Experiments, that are conducted on several datasets and tasks, show that
PyTorch-Kaldi can effectively be used to develop modern state-of-the-art speech
recognizers.Comment: Accepted at ICASSP 201