10,438 research outputs found
Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems
Neural models have become ubiquitous in automatic speech recognition systems.
While neural networks are typically used as acoustic models in more complex
systems, recent studies have explored end-to-end speech recognition systems
based on neural networks, which can be trained to directly predict text from
input acoustic features. Although such systems are conceptually elegant and
simpler than traditional systems, it is less obvious how to interpret the
trained models. In this work, we analyze the speech representations learned by
a deep end-to-end model that is based on convolutional and recurrent layers,
and trained with a connectionist temporal classification (CTC) loss. We use a
pre-trained model to generate frame-level features which are given to a
classifier that is trained on frame classification into phones. We evaluate
representations from different layers of the deep model and compare their
quality for predicting phone labels. Our experiments shed light on important
aspects of the end-to-end model such as layer depth, model complexity, and
other design choices.Comment: NIPS 201
Masked Conditional Neural Networks for Environmental Sound Classification
The ConditionaL Neural Network (CLNN) exploits the nature of the temporal
sequencing of the sound signal represented in a spectrogram, and its variant
the Masked ConditionaL Neural Network (MCLNN) induces the network to learn in
frequency bands by embedding a filterbank-like sparseness over the network's
links using a binary mask. Additionally, the masking automates the exploration
of different feature combinations concurrently analogous to handcrafting the
optimum combination of features for a recognition task. We have evaluated the
MCLNN performance using the Urbansound8k dataset of environmental sounds.
Additionally, we present a collection of manually recorded sounds for rail and
road traffic, YorNoise, to investigate the confusion rates among machine
generated sounds possessing low-frequency components. MCLNN has achieved
competitive results without augmentation and using 12% of the trainable
parameters utilized by an equivalent model based on state-of-the-art
Convolutional Neural Networks on the Urbansound8k. We extended the Urbansound8k
dataset with YorNoise, where experiments have shown that common tonal
properties affect the classification performance.Comment: Conditional Neural Networks, CLNN, Masked Conditional Neural
Networks, MCLNN, Restricted Boltzmann Machine, RBM, Conditional Restricted
Boltz-mann Machine, CRBM, Deep Belief Nets, Environmental Sound Recognition,
ESR, YorNois
An Auto Encoder For Audio Dolphin Communication
Research in dolphin communication and cognition requires detailed inspection
of audible dolphin signals. The manual analysis of these signals is cumbersome
and time-consuming. We seek to automate parts of the analysis using modern deep
learning methods. We propose to learn an autoencoder constructed from
convolutional and recurrent layers trained in an unsupervised fashion. The
resulting model embeds patterns in audible dolphin communication. In several
experiments, we show that the embeddings can be used for clustering as well as
signal detection and signal type classification
Recent Advances in Physical Reservoir Computing: A Review
Reservoir computing is a computational framework suited for
temporal/sequential data processing. It is derived from several recurrent
neural network models, including echo state networks and liquid state machines.
A reservoir computing system consists of a reservoir for mapping inputs into a
high-dimensional space and a readout for pattern analysis from the
high-dimensional states in the reservoir. The reservoir is fixed and only the
readout is trained with a simple method such as linear regression and
classification. Thus, the major advantage of reservoir computing compared to
other recurrent neural networks is fast learning, resulting in low training
cost. Another advantage is that the reservoir without adaptive updating is
amenable to hardware implementation using a variety of physical systems,
substrates, and devices. In fact, such physical reservoir computing has
attracted increasing attention in diverse fields of research. The purpose of
this review is to provide an overview of recent advances in physical reservoir
computing by classifying them according to the type of the reservoir. We
discuss the current issues and perspectives related to physical reservoir
computing, in order to further expand its practical applications and develop
next-generation machine learning systems.Comment: 62 pages, 13 figure
A Closer Look at Weak Label Learning for Audio Events
Audio content analysis in terms of sound events is an important research
problem for a variety of applications. Recently, the development of weak
labeling approaches for audio or sound event detection (AED) and availability
of large scale weakly labeled dataset have finally opened up the possibility of
large scale AED. However, a deeper understanding of how weak labels affect the
learning for sound events is still missing from literature. In this work, we
first describe a CNN based approach for weakly supervised training of audio
events. The approach follows some basic design principle desirable in a
learning method relying on weakly labeled audio. We then describe important
characteristics, which naturally arise in weakly supervised learning of sound
events. We show how these aspects of weak labels affect the generalization of
models. More specifically, we study how characteristics such as label density
and corruption of labels affects weakly supervised training for audio events.
We also study the feasibility of directly obtaining weak labeled data from the
web without any manual label and compare it with a dataset which has been
manually labeled. The analysis and understanding of these factors should be
taken into picture in the development of future weak label learning methods.
Audioset, a large scale weakly labeled dataset for sound events is used in our
experiments.Comment: 10 page
On the Robustness of Speech Emotion Recognition for Human-Robot Interaction with Deep Neural Networks
Speech emotion recognition (SER) is an important aspect of effective
human-robot collaboration and received a lot of attention from the research
community. For example, many neural network-based architectures were proposed
recently and pushed the performance to a new level. However, the applicability
of such neural SER models trained only on in-domain data to noisy conditions is
currently under-researched. In this work, we evaluate the robustness of
state-of-the-art neural acoustic emotion recognition models in human-robot
interaction scenarios. We hypothesize that a robot's ego noise, room
conditions, and various acoustic events that can occur in a home environment
can significantly affect the performance of a model. We conduct several
experiments on the iCub robot platform and propose several novel ways to reduce
the gap between the model's performance during training and testing in
real-world conditions. Furthermore, we observe large improvements in the model
performance on the robot and demonstrate the necessity of introducing several
data augmentation techniques like overlaying background noise and loudness
variations to improve the robustness of the neural approaches.Comment: Submitted to IROS'18, Madrid, Spai
A general-purpose deep learning approach to model time-varying audio effects
Audio processors whose parameters are modified periodically over time are
often referred as time-varying or modulation based audio effects. Most existing
methods for modeling these type of effect units are often optimized to a very
specific circuit and cannot be efficiently generalized to other time-varying
effects. Based on convolutional and recurrent neural networks, we propose a
deep learning architecture for generic black-box modeling of audio processors
with long-term memory. We explore the capabilities of deep neural networks to
learn such long temporal dependencies and we show the network modeling various
linear and nonlinear, time-varying and time-invariant audio effects. In order
to measure the performance of the model, we propose an objective metric based
on the psychoacoustics of modulation frequency perception. We also analyze what
the model is actually learning and how the given task is accomplished.Comment: audio files: https://mchijmma.github.io/modeling-time-varying
ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks
In this paper, we propose a deep neural network architecture for object
recognition based on recurrent neural networks. The proposed network, called
ReNet, replaces the ubiquitous convolution+pooling layer of the deep
convolutional neural network with four recurrent neural networks that sweep
horizontally and vertically in both directions across the image. We evaluate
the proposed ReNet on three widely-used benchmark datasets; MNIST, CIFAR-10 and
SVHN. The result suggests that ReNet is a viable alternative to the deep
convolutional neural network, and that further investigation is needed
Deep Learning for Sensor-based Activity Recognition: A Survey
Sensor-based activity recognition seeks the profound high-level knowledge
about human activities from multitudes of low-level sensor readings.
Conventional pattern recognition approaches have made tremendous progress in
the past years. However, those methods often heavily rely on heuristic
hand-crafted feature extraction, which could hinder their generalization
performance. Additionally, existing methods are undermined for unsupervised and
incremental learning tasks. Recently, the recent advancement of deep learning
makes it possible to perform automatic high-level feature extraction thus
achieves promising performance in many areas. Since then, deep learning based
methods have been widely adopted for the sensor-based activity recognition
tasks. This paper surveys the recent advance of deep learning based
sensor-based activity recognition. We summarize existing literature from three
aspects: sensor modality, deep model, and application. We also present detailed
insights on existing work and propose grand challenges for future research.Comment: 10 pages, 2 figures, and 5 tables; submitted to Pattern Recognition
Letters (second revision
An Optimized Recurrent Unit for Ultra-Low-Power Keyword Spotting
There is growing interest in being able to run neural networks on sensors,
wearables and internet-of-things (IoT) devices. However, the computational
demands of neural networks make them difficult to deploy on
resource-constrained edge devices.
To meet this need, our work introduces a new recurrent unit architecture that
is specifically adapted for on-device low power acoustic event detection (AED).
The proposed architecture is based on the gated recurrent unit (`GRU') but
features optimizations that make it implementable on ultra-low power
micro-controllers such as the Arm Cortex M0+.
Our new architecture, the Embedded Gated Recurrent Unit (eGRU) is
demonstrated to be highly efficient and suitable for short-duration AED and
keyword spotting tasks. A single eGRU cell is 60x faster and 10x smaller than a
GRU cell. Despite its optimizations, eGRU compares well with GRU across tasks
of varying complexities.
The practicality of eGRU is investigated in a wearable acoustic event
detection application. An eGRU model is implemented and tested on the Arm
Cortex M0-based Atmel ATSAMD21E18 processor. The Arm M0+ implementation of the
eGRU model compares favorably with a full precision GRU that is running on a
workstation. The embedded eGRU model achieves a classification accuracy 95.3%,
which is only 2% less than the full precision GRU
- …